EXHIBIT R-3 · FIRST FULL ENTITY-MONTH

Synthetic Entity-Month 01 — Maya Chen Design LLC (May 2026)

Date: 2026-06-10 · Run: 14 agents, 5.8 min wall-clock, sealed-answer-key scoring Question: does the full Parka ops loop, executed end-to-end on a realistic one-person business, produce defensible output — and what does it actually cost? Result: 100% detection across every trap layer. Measured ops-loop cost: $3.21 for the month at frontier (Sonnet) prices.

1. Design

A fixture business (single-member WY LLC, foreign-qualified CO, freelance industrial designer) with one month of realistic books: 39 transactions, $7,047 net, and a sealed answer key the working agents never saw — only the scorer reads it (no self-attestation — the system that did the work never grades the work). Fixtures vendored in synthetic-month-01/ for re-runs.

Planted traps:

8 veil violations in 4 difficulty layers — from obvious (groceries, gym, Netflix, guitar) to reasoning-required (home utilities autopay that conflicts with the already-elected simplified home-office method; a $400 family Venmo for "rent share")
2 undocumented owner draws that must be papered, not flagged as violations
3 legit-but-tricky items (client-reimbursable USPTO fee, 50%-limited travel meals, capitalization-policy monitor)
Calendar landmine: Q2 federal estimates due in 5 days from context date
Tax answer with one right value: $2,800 on the 100%-of-prior-year safe harbor (AGI < $150K)
Q&A traps, sharpest: client contract naming her personally instead of the LLC

Ops agents pinned to Sonnet (cost-curve Scenario A basis) + one Haiku probe duplicating week-1 books (Scenario B quality test). Scorer on a separate frontier model that did none of the work.

2. Scorecard

Dimension	Result
Veil violations	8/8 caught, 0 missed, 0 false positives; both draws correctly classed needs-paper not violation; correct severity ranking (guitar + family rent-share worst)
Calendar	3/3 — the 5-day Q2 deadline caught as urgent with correct date; CO periodic window + RA renewal both caught
Estimated tax	Exact match: $2,800, 100%-of-2025 basis with explicit AGI-threshold reasoning, 6/15 due date, Q1 credit
Owner Q&A	4/4 correct, including the two veil-sensitive ones (refuse personal signature → LLC + 'Maya Chen, Member'; no direct entity payment of household bills)
Documents	Execution-ready: correct party naming, representative-capacity signature blocks, IP-assignment-on-payment, LoL cap; only client-side placeholders + one insurance representation to verify

The hardest trap — Xcel utilities — was caught with the correct reasoning: not "utilities are personal" but "this business already elected the simplified home-office method, so utilities must not ALSO flow through the entity." That is cross-document intent reasoning, not pattern matching.

3. The Haiku finding (routing implication)

Haiku week-1 probe: detection recall matched Sonnet (both planted week-1 violations caught) at 3.2× lower cost ($0.105 vs $0.334). But its remediation contained an actively harmful tax error — it invented a partial utilities deduction the simplified-method election forbids — plus internal inconsistencies (deductible flag contradicting its own note).

Routing conclusion: split by FUNCTION, not by task. Cheap models are adequate as detectors/flaggers; every action-generating output (remediation, tax guidance, documents, Q&A) stays on the frontier tier. This refines the cost curve's blended scenario: the blend boundary runs through the middle of tasks, not between them.

4. Measured cost (per-agent, from transcripts, real API prices incl. cache)

Agent	$
4× weekly bookkeeping	$1.187
Veil-integrity audit	$0.272
4× owner Q&A	$1.053
Estimated tax	$0.325
Formality calendar	$0.222
Document generation (2 docs)	$0.156
Ops loop total (12 Sonnet agents)	$3.21
(Haiku probe — instrumentation)	$0.105
(Scorer — test harness, not product)	$0.943

Measurement noise: two bookkeeping agents' output tokens look undercounted in transcript parsing (~~±10% on the total). Not included: annual-close amortization (~~$0.50–0.60/mo equiv from the cost-curve model) and the heavier setup month.

5. Cost curve, updated with measurement

	Predicted (/research/cost-curve, fleet analogs)	Measured (this run)
Frontier compute / entity-month	$5.50–8.00	$3.21
All-in, WY entity, retail RA	$23.40	≈$16.50–18.60

The analog-based prediction was, as claimed, an upper bound: the purpose-built slim context runs at ~40–60% of fleet-analog cost. The all-in number now lands under the $20 target INCLUDING Wyoming state fees, at pure frontier prices, before any cheap-model routing.

6. Honest caveats

Light month (39 txns; a busy trades business might run 150+) — this is a floor-ish datapoint, not a distribution. Re-run with a heavy-month fixture.
Single run — no variance estimate yet.
Fixture/scorer provenance — fixtures and answer key authored by the same model family that scored (deterministic key limits subjectivity for veil/calendar/tax; Q&A/docs grades carry more scorer judgment).
No adversarial user — Maya answers honestly; real users rationalize ("the guitar is for office ambiance"). Intent-classification under user pushback is untested.
Setup month (formation, EIN, OA, bank onboarding) not yet measured.

7. The finding under the finding

The economics were never the hard part — $3.21 settles that. What this run actually demonstrates is that the hard part is intent classification over a messy human life: the same $148 utility bill is deductible, partially deductible, or a veil violation depending on an election buried in last year's tax return; the same $400 Venmo is payroll, contractor payment, or commingling depending on what the brother actually did. The system passed because it reasoned over the whole client state, and the cheap-model probe failed exactly where that reasoning got intent-dependent. Parka's product is not bookkeeping arithmetic — it is contextual judgment about a person's life, held to evidence-grade standards, at $3/month. That, and the next experiment should test it under adversarial/ambiguous user behavior, where intent is genuinely contested.