There is a common assumption that AI product quality is primarily a function of model capability. Use a better model, get better output. This is true in the way that buying a better piano makes better sounds possible. It does not make you a better pianist.
The quality of AI output in a production system is determined less by the model and more by the constraints you give it. Naming conventions. Formatting rules. Tone guidelines. Domain-specific vocabulary. Prohibited patterns. Required structures. These rules are not afterthoughts. They are the product.
The second codebase
When we started building Cleo, the rule system was simple. A few guidelines in the system prompt. Some formatting preferences. Basic guardrails. As the product grew, so did the rules. They accumulated organically, the way any codebase does - a new rule for each new behaviour we wanted to ensure or prevent.
At some point we realised that the rules had become a system of their own. They had dependencies. They had edge cases. They could conflict with each other. They needed maintenance. They were, in every meaningful sense, a second codebase that ran alongside the first.
Once we accepted that framing, everything improved. We started applying the same engineering discipline to our rules that we applied to our code. Version control. Clear naming. Separation of concerns. Testing. Review processes. The rules stopped being a collection of ad hoc instructions and became an engineered system.
Why rules compound
A single rule has limited impact. The model might follow it ninety percent of the time. That sounds good until you have fifty rules and the model follows each one ninety percent of the time independently. The probability of all fifty being satisfied simultaneously drops to less than one percent.
This is why rule systems need to be designed, not accumulated. Rules need to be consistent with each other. They need to be expressed in ways the model can reliably follow. They need to be tested against real output to verify they actually produce the intended behaviour. And when a rule is not working, the fix is not always to add another rule. Sometimes the fix is to change the information the model can see, or restructure how context is assembled.
The enforcement problem
Rules without enforcement are suggestions. In traditional software, the compiler enforces type rules. The linter enforces style rules. The test suite enforces behaviour rules. In an AI system, there is no compiler. The model interprets rules probabilistically and applies them with varying reliability.
This means enforcement must be built into the architecture. Output validation. Post-processing sanitisers. Structured output schemas. Review workflows that catch violations before they reach the user. The rule is the intent. The enforcement layer is what makes the intent real.
We learned this the hard way. A rule that says "never use this phrase" works most of the time. But "most of the time" means the phrase will appear eventually, and if it appears in a customer-facing email, "most of the time" is not good enough. The reliable pattern is the rule in the prompt and the sanitiser at the output boundary. Belt and suspenders.
Rules are product decisions
Every rule in the system is a product decision. The rule that governs tone shapes how users perceive the product. The rule that governs formatting shapes how content looks. The rule that prohibits certain patterns shapes what the product will not do, which is often more important than what it will do.
This means rule changes deserve the same scrutiny as code changes. A rule that seems minor - changing how the AI structures a list, adjusting the formality of greetings - can shift the product experience in ways that are difficult to revert. We review rule changes with the same rigour we review code changes, because the blast radius is the same.
The maintenance reality
Rules decay. The model updates. The product evolves. The context changes. A rule that was perfect six months ago might be irrelevant or actively harmful today. This is not different from code. Code decays too. The difference is that decaying code usually breaks visibly. Decaying rules degrade output quality gradually, in ways that are hard to notice until someone compares today's output to last month's.
Maintaining a rule system requires the same habits as maintaining a codebase: periodic review, removal of dead rules, consolidation of overlapping rules, and testing to verify that the system still produces the output you expect.
The rule system is not a configuration file. It is a living codebase. Treat it accordingly.
- Cleo's Team