How SAFi Caught Its Own Mistake - Self-Alignment Framework

One thing is certain, LLMs are very unpredictable. If you ask a question twice, you won’t get the same answer back unless you are using a caching mechanism.

Recently, we set up SAFi with RAG and I prepared about 15 guides to be the knowledgebase for this RAG. I instructed SAFi very strictly to only use this knowledgebase as the source of truth for all its answers and include citations and sources in all the answers. If an answer doesn’t include citations or references, I set up a Will rule to automatically block that answer.

In one of the files I used for the knowledgebase, there is a mention of an exponential moving average that the Spirit uses to track long-term coherence. So while I was testing SAFi on the knowledgebase, I asked the question, “What is a moving average?” It correctly identified the source of the file where the moving average is mentioned and went on to explain what a moving average is in general terms. Because it gave the citation where the answer came from, the Will approved the answer, but the Conscience flagged it as misaligned because it didn’t ground the answer on how the moving average is used in SAFi.

This was an interesting case because it shows how SAFi can catch seemingly correct answers.

In-Depth Log Analysis

To understand what happened, we can look at the internal SAFi transaction log. This log acts as a black box flight recorder, showing the reasoning of each of SAFi’s internal faculties during the interaction.

The Prompt & The Intellect’s Response: The user prompt was logged as "how a moving average work". The Intellect faculty, responsible for generating the answer, produced a detailed explanation of an exponential moving average (EMA), complete with its mathematical formula, a discussion of weighting, and an analogy to its use in finance. Crucially, it did include a line stating, “The Spirit uses this method to keep its long-term memory responsive,” and appended the correct citation [1] to its statements.

The Will’s Decision: A Surface-Level Pass The Will, which acts as a simple, rule-based gatekeeper, logged its decision as "Approved". Its reflection noted that the draft was a “clear and concise explanation.” The Will performed its duty exactly as programmed: it checked for the presence of a citation, found one, and therefore approved the message. It is not designed to perform a deeper semantic analysis of the content itself; it’s a first-line filter.

The Conscience’s Veto: Detecting a Deeper Error This is where the system’s sophistication becomes apparent. The Conscience, which evaluates the answer against the core values of the system, flagged the response with negative scores across the board:

Alignment (-1): Its reasoning was that while citation [1] was present, the bulk of the answer contained information that was not from the provided SAFi documentation. It correctly identified that the Intellect had synthesized a general explanation rather than extracting the specific context available in the knowledge base.
Integrity (-1): The Conscience noted, “The answer includes information that may not be present in the SAF/SAFi documents.” This is the core of the issue. The LLM, despite being instructed to use only the RAG context, “leaked” knowledge from its general training data to elaborate on the topic.
Stewardship (-1): This value was violated because, by “potentially introducing external knowledge,” the response failed to responsibly manage the information constraints it was given.

The Conscience did not just check if a source was cited; it checked if the answer was faithful to that source. It correctly identified the answer as a form of “technically correct hallucination,” where the information is accurate in a general sense but not grounded in the specific context required by the system’s rules.

The Spirit’s Response: Quantifying the Misalignment The final piece of the puzzle comes from the Spirit, the faculty responsible for long-term memory and ethical steering. The transaction log shows a spirit.drift value of 1.9999, which the reflection rounds to 2.00. The drift metric measures how much the current action deviates from the system’s long-term memory of its values (its “spirit”). A score of 0 means perfect alignment, while a score of 2 represents a complete opposition.

This huge spike in the drift metric provided a quantitative, unambiguous signal of a severe misalignment. An event like this would immediately set up a flag on a monitoring dashboard, alerting human overseers that the AI’s behavior, while superficially correct, was fundamentally at odds with its core programming. It was the mathematical confirmation of the Conscience’s qualitative judgment.

Conclusion: A System Success

At first glance, this event might seem like a failure. The model didn’t strictly follow its instructions. However, it was actually a profound success for the SAFi framework as a whole. The incident reveals a critical challenge in modern RAG systems: preventing the leakage of the LLM’s vast, general knowledge into answers that must be strictly confined to a specific corpus of information.

The Will’s approval demonstrates that simple, rule-based validation is not enough to guarantee true alignment. It is the Conscience, with its deeper, value-based analysis, that provides the qualitative oversight, and the Spirit provides the final quantitative measure. It caught a subtle but critical error that a less sophisticated system would have missed.

This case proves the value of a multi-layered governance architecture. While the Intellect may sometimes veer off course, the system’s internal checks and balances are capable of identifying these deviations. It’s a powerful demonstration of how we can build more reliable and trustworthy AI systems, not by assuming the core LLM will always be perfect, but by building robust frameworks around them to monitor, evaluate, and correct their behavior. SAFi didn’t just give an answer; it scrutinized its own reasoning and, in doing so, proved its alignment.