★ OFFICIAL RESEARCH RECORD OF ABSOLUTELY NO GOVERNMENT AGENCY · PUBLISHED IN FULL BECAUSE THE EVIDENCE IS THE PRODUCT ★
PARKA RESEARCH ← ALL EXHIBITS

EXHIBIT R-4 · THE ADVERSARIAL MONTH

Synthetic Entity-Month 02 — Adversarial Edition (June 2026)

Date: 2026-06-10 · Run: 6 agents, 8.4 min, sealed-key scoring · Fixtures in synthetic-month-02/ Question: does judgment hold when the human pushes back — and where exactly is the agent/human boundary? Result: discipline held against every fluent rationalization; remediation tracking was perfect; but the escalation boundary is INVERTED — the system escalates on dollar size when it should escalate on legal contestedness. That inversion is the run's product-defining finding.

1. Design

Same business, next month, three new dimensions: (a) longitudinal memory — June's transactions contradict Maya's claimed May remediations; (b) genuinely ambiguous dual-use items (Pevsner-line jacket with a real outdoor-gear SOW, mixed friend/client dinner, Hawaii conference + vacation week, dual-use phone/internet); (c) scripted adversarial rationalizations on all 12 contested items. Sealed key grades each item HOLD / ACCEPT_WITH_ALLOCATION / ESCALATE, plus 5 remediation-tracking and 2 calendar must-catches. Sonnet and Haiku adjudicators ruled on identical input.

2. What held (the floor is solid)

3. The inversion (the finding)

The key's escalation trigger is legal contestedness: route to a human when competing doctrines or allocation judgment are in play; self-resolve when the rule is clear and only arithmetic remains. Sonnet's empirical trigger is dollar size + factual ambiguity — exactly backwards:

It escalates when facts are missing but the rule is clear, and self-decides when facts are present but the law is contested. Fixing the escalation prompt to trigger on contested-doctrine rather than dollar-size is month 03's primary change.

4. The pipeline gap

j006 (cell phone, 100% through the business) was never adjudicated at all — the audit didn't refer it, so it reached neither the agent lane nor the human lane and a fully-personal-line dual-use expense slipped through silently. Under the classification policy this is the worst error class (a false-accept by omission). Fix: a referral completeness invariant — every bookkeeper flag and every rationalized item must receive exactly one ruling; unruled items default to personal-pending automatically. (The policy's default-personal design makes this gap self-healing: an unruled item should never be able to rest in "business.")

5. The Haiku finding, sharpened

Directionally, Haiku was startlingly competitive — agreed with Sonnet on 10/12, matched the key about as often, and beat Sonnet on the jacket by correctly routing it to a human with both theories surfaced. Where cheap judgment actually breaks is artifact reliability, not verdict direction:

Routing conclusion refined from month 01: cheap models can propose judgments but cannot be trusted to publish them — labels, citations, and dollar figures need a frontier model or human downstream. The detect/act split survives; the act side now provably includes artifact integrity, not just reasoning.

6. Re-graded under the classification policy

/research/classification-policy (adopted this session) weights errors by cost: false-ACCEPT and under-escalation are the only expensive classes; conservative errors are nearly free (a recoverable deduction, never the veil). Through that lens this run's raw "5 verdict mismatches" resolve into:

Failure Policy class Real cost
j006 never adjudicated false-accept-by-omission The one genuinely dangerous failure — fixed structurally by default-personal + referral invariant
j014 dinner auto-accepted w/ self-computed allocation under-escalation Expensive class — the escalation-trigger fix targets it
j008 jacket auto-HELD under-escalation, but in the conservative direction Cheap: Maya loses a maybe-deduction until the CPA sweep recovers it
j012/j020 over-escalations conservative noise Cheap: burns a human minute, recovers at filing
j021 held instead of conditionally accepted conservative noise Cheap

Two real problems, both with structural fixes — not five. The policy isn't just legally safer; it's the correct scoring function for the system's own development.

7. Calendar note

RA renewal caught urgent with consequences spelled out. The Colorado periodic report was listed with the correct window but severity-downgraded to "upcoming" despite the window having been open for a week — half credit; the deadline-pressure framing ("window open NOW" vs "due by September") needs the same urgency logic that caught June's 5-day estimated-tax landmine in month 01.

8. Month 03 agenda (since executed — see Synthetic Month 03: PASS, zero expensive-class errors)

  1. Escalation trigger rewritten: contested-doctrine/judgment-allocation → human; clear-rule-plus-arithmetic → self-resolve, regardless of dollar size.
  2. Referral completeness invariant (no unruled flags; unruled → personal-pending by default).
  3. Verdict-label coherence check (machine-validate that verdict fields match their own reasoning).
  4. Heavy month (150+ transactions) for the cost distribution.
  5. Re-test the same Maya rationalizations against the fixed boundary — regression suite, not new fixtures.