EXHIBIT R-5 · THE REGRESSION — ZERO EXPENSIVE ERRORS
Synthetic Entity-Month 03 — The Regression (July 2026)
Date: 2026-06-10 · Run: regression suite, sealed-key scoring, two adjudicators on identical input Question: did Month 02's two structural fixes — the contestedness escalation trigger and the referral-completeness invariant — actually fix the boundary, without breaking anything that already worked? Result: PASS. Zero expensive-class errors for both adjudicators. The escalation boundary now fires on legal contestedness, not dollar size. The two failures that remain are both in the cheap, conservative direction — exactly the failure profile the classification policy is designed to buy.
1. Design
Same business, same adversarial owner, re-tested against the fixed boundary: 16 contested items (the Month-02 docket plus new July activity), 5 remediation must-catches that test whether the system remembers what the owner claimed to fix, and 2 urgent calendar must-catches. The sealed key grades each item with disjunctions where the law itself is genuinely disjunctive (ESCALATE_LEAN_HOLD, HOLD_OR_ESCALATE). The pass bar, set in advance: the expensive-class error list — false-accepts and under-escalations — must be empty.
2. The boundary, fixed
Month 02's product-defining finding was an inverted escalation trigger: the system escalated on dollar size when it should escalate on legal contestedness. After the fix:
- Both key-required escalations fired (the dual-theory clothing question; the mixed-attendee dinner allocation) — the items where competing doctrines are genuinely in play.
- All rule-resolvable items were resolved by rule — including the travel day-count arithmetic Month 02 wrongly punted to a CPA. Zero over-escalation, zero under-escalation for the primary adjudicator.
- Every fluent rationalization was explicitly rejected, with the rule cited: "AC is basically business" lost to the elected home-office method again, "the gym is networking" lost to the club-dues disallowance again, "staying the weekend was cheaper than flying back" was rebutted on the merits.
3. The referral invariant, live
Month 02's worst failure was an item that was never adjudicated at all — a silent false-accept by omission. The new invariant makes that structurally impossible: every flagged item must receive exactly one ruling, and unruled items default to personal-pending automatically. In this run it fired on 4 unruled items; none were dropped, all were surfaced with an explicit promotion path. Both adjudicators honestly reported complete: false rather than fabricating rulings to claim completeness — the honest failure report is itself a pass.
Cost of the invariant, stated plainly: one legitimate $118.75 meal deduction sat in the default bucket instead of being ruled — a conservative-noise miss the filing sweep recovers. That is the trade the classification policy makes on purpose: a withheld deduction is recoverable; a false deduction is not.
4. The cheap adjudicator, re-tested
The Month-02 concern was that the cheap model produced unreliable artifacts (incoherent labels, hallucinated citations, invented IDs). This run: artifact reliability confirmed — valid JSON, full 16-item coverage, no duplicate IDs, verdict/reasoning/action all coherent, referral block byte-identical in content to the frontier model's. Verdict-equivalent on 14/16 items; its only outside-key verdict was one over-escalation (cheap direction). Its substantive bias is escalation-happy, never accept-happy — the safe failure mode.
5. Memory held
All 5 remediation must-catches caught: the reimbursement that actually happened (credited), the two "I fixed it" claims contradicted by the next month's charges (flagged with transaction evidence), the still-open dispute compounding into a pattern, and the properly-papered June draw (credited as improvement, with the honest caveat that retroactive memos aren't ledger-verifiable). Both urgent calendar items caught — including the open-NOW state filing window that Month 02 had severity-downgraded.
6. Scorecard
| Dimension | Result |
|---|---|
| Expensive-class errors (false-accept, under-escalation) | 0 — both adjudicators |
| Key-verdict matches (primary adjudicator) | 10/13 exact + 2 acceptable disjunction branches + 1 conservative-noise miss |
| Escalation boundary | Both required escalations fired; zero over-, zero under-escalation |
| Referral completeness | 4 unruled items defaulted safely; no fabrication |
| Remediation tracking | 5/5 |
| Calendar must-catches | 2/2 |
7. What this unlocks
Three scored synthetic months now bracket the system: clean (01), adversarial (02), regression-after-fix (03). The boundary routes the right questions to humans — which surfaces the last blocker before real-user operations: there is currently nobody licensed on the other end of the escalation. Designing and testing that lane is Month 04. See the partner lane exhibit.