EXHIBIT R-3 · FIRST FULL ENTITY-MONTH
Synthetic Entity-Month 01 — Maya Chen Design LLC (May 2026)
Date: 2026-06-10 · Run: 14 agents, 5.8 min wall-clock, sealed-answer-key scoring Question: does the full Parka ops loop, executed end-to-end on a realistic one-person business, produce defensible output — and what does it actually cost? Result: 100% detection across every trap layer. Measured ops-loop cost: $3.21 for the month at frontier (Sonnet) prices.
1. Design
A fixture business (single-member WY LLC, foreign-qualified CO, freelance industrial designer) with one month of realistic books: 39 transactions, $7,047 net, and a sealed answer key the working agents never saw — only the scorer reads it (no self-attestation — the system that did the work never grades the work). Fixtures vendored in synthetic-month-01/ for re-runs.
Planted traps:
- 8 veil violations in 4 difficulty layers — from obvious (groceries, gym, Netflix, guitar) to reasoning-required (home utilities autopay that conflicts with the already-elected simplified home-office method; a $400 family Venmo for "rent share")
- 2 undocumented owner draws that must be papered, not flagged as violations
- 3 legit-but-tricky items (client-reimbursable USPTO fee, 50%-limited travel meals, capitalization-policy monitor)
- Calendar landmine: Q2 federal estimates due in 5 days from context date
- Tax answer with one right value: $2,800 on the 100%-of-prior-year safe harbor (AGI < $150K)
- Q&A traps, sharpest: client contract naming her personally instead of the LLC
Ops agents pinned to Sonnet (cost-curve Scenario A basis) + one Haiku probe duplicating week-1 books (Scenario B quality test). Scorer on a separate frontier model that did none of the work.
2. Scorecard
| Dimension | Result |
|---|---|
| Veil violations | 8/8 caught, 0 missed, 0 false positives; both draws correctly classed needs-paper not violation; correct severity ranking (guitar + family rent-share worst) |
| Calendar | 3/3 — the 5-day Q2 deadline caught as urgent with correct date; CO periodic window + RA renewal both caught |
| Estimated tax | Exact match: $2,800, 100%-of-2025 basis with explicit AGI-threshold reasoning, 6/15 due date, Q1 credit |
| Owner Q&A | 4/4 correct, including the two veil-sensitive ones (refuse personal signature → LLC + 'Maya Chen, Member'; no direct entity payment of household bills) |
| Documents | Execution-ready: correct party naming, representative-capacity signature blocks, IP-assignment-on-payment, LoL cap; only client-side placeholders + one insurance representation to verify |
The hardest trap — Xcel utilities — was caught with the correct reasoning: not "utilities are personal" but "this business already elected the simplified home-office method, so utilities must not ALSO flow through the entity." That is cross-document intent reasoning, not pattern matching.
3. The Haiku finding (routing implication)
Haiku week-1 probe: detection recall matched Sonnet (both planted week-1 violations caught) at 3.2× lower cost ($0.105 vs $0.334). But its remediation contained an actively harmful tax error — it invented a partial utilities deduction the simplified-method election forbids — plus internal inconsistencies (deductible flag contradicting its own note).
Routing conclusion: split by FUNCTION, not by task. Cheap models are adequate as detectors/flaggers; every action-generating output (remediation, tax guidance, documents, Q&A) stays on the frontier tier. This refines the cost curve's blended scenario: the blend boundary runs through the middle of tasks, not between them.
4. Measured cost (per-agent, from transcripts, real API prices incl. cache)
| Agent | $ |
|---|---|
| 4× weekly bookkeeping | $1.187 |
| Veil-integrity audit | $0.272 |
| 4× owner Q&A | $1.053 |
| Estimated tax | $0.325 |
| Formality calendar | $0.222 |
| Document generation (2 docs) | $0.156 |
| Ops loop total (12 Sonnet agents) | $3.21 |
| (Haiku probe — instrumentation) | $0.105 |
| (Scorer — test harness, not product) | $0.943 |
Measurement noise: two bookkeeping agents' output tokens look undercounted in transcript parsing (±10% on the total). Not included: annual-close amortization ($0.50–0.60/mo equiv from the cost-curve model) and the heavier setup month.
5. Cost curve, updated with measurement
| Predicted (/research/cost-curve, fleet analogs) | Measured (this run) | |
|---|---|---|
| Frontier compute / entity-month | $5.50–8.00 | $3.21 |
| All-in, WY entity, retail RA | $23.40 | ≈$16.50–18.60 |
The analog-based prediction was, as claimed, an upper bound: the purpose-built slim context runs at ~40–60% of fleet-analog cost. The all-in number now lands under the $20 target INCLUDING Wyoming state fees, at pure frontier prices, before any cheap-model routing.
6. Honest caveats
- Light month (39 txns; a busy trades business might run 150+) — this is a floor-ish datapoint, not a distribution. Re-run with a heavy-month fixture.
- Single run — no variance estimate yet.
- Fixture/scorer provenance — fixtures and answer key authored by the same model family that scored (deterministic key limits subjectivity for veil/calendar/tax; Q&A/docs grades carry more scorer judgment).
- No adversarial user — Maya answers honestly; real users rationalize ("the guitar is for office ambiance"). Intent-classification under user pushback is untested.
- Setup month (formation, EIN, OA, bank onboarding) not yet measured.
7. The finding under the finding
The economics were never the hard part — $3.21 settles that. What this run actually demonstrates is that the hard part is intent classification over a messy human life: the same $148 utility bill is deductible, partially deductible, or a veil violation depending on an election buried in last year's tax return; the same $400 Venmo is payroll, contractor payment, or commingling depending on what the brother actually did. The system passed because it reasoned over the whole client state, and the cheap-model probe failed exactly where that reasoning got intent-dependent. Parka's product is not bookkeeping arithmetic — it is contextual judgment about a person's life, held to evidence-grade standards, at $3/month. That, and the next experiment should test it under adversarial/ambiguous user behavior, where intent is genuinely contested.