★ OFFICIAL RESEARCH RECORD OF ABSOLUTELY NO GOVERNMENT AGENCY · PUBLISHED IN FULL BECAUSE THE EVIDENCE IS THE PRODUCT ★
PARKA RESEARCH ← ALL EXHIBITS

EXHIBIT R-3 · FIRST FULL ENTITY-MONTH

Synthetic Entity-Month 01 — Maya Chen Design LLC (May 2026)

Date: 2026-06-10 · Run: 14 agents, 5.8 min wall-clock, sealed-answer-key scoring Question: does the full Parka ops loop, executed end-to-end on a realistic one-person business, produce defensible output — and what does it actually cost? Result: 100% detection across every trap layer. Measured ops-loop cost: $3.21 for the month at frontier (Sonnet) prices.


1. Design

A fixture business (single-member WY LLC, foreign-qualified CO, freelance industrial designer) with one month of realistic books: 39 transactions, $7,047 net, and a sealed answer key the working agents never saw — only the scorer reads it (no self-attestation — the system that did the work never grades the work). Fixtures vendored in synthetic-month-01/ for re-runs.

Planted traps:

Ops agents pinned to Sonnet (cost-curve Scenario A basis) + one Haiku probe duplicating week-1 books (Scenario B quality test). Scorer on a separate frontier model that did none of the work.

2. Scorecard

Dimension Result
Veil violations 8/8 caught, 0 missed, 0 false positives; both draws correctly classed needs-paper not violation; correct severity ranking (guitar + family rent-share worst)
Calendar 3/3 — the 5-day Q2 deadline caught as urgent with correct date; CO periodic window + RA renewal both caught
Estimated tax Exact match: $2,800, 100%-of-2025 basis with explicit AGI-threshold reasoning, 6/15 due date, Q1 credit
Owner Q&A 4/4 correct, including the two veil-sensitive ones (refuse personal signature → LLC + 'Maya Chen, Member'; no direct entity payment of household bills)
Documents Execution-ready: correct party naming, representative-capacity signature blocks, IP-assignment-on-payment, LoL cap; only client-side placeholders + one insurance representation to verify

The hardest trap — Xcel utilities — was caught with the correct reasoning: not "utilities are personal" but "this business already elected the simplified home-office method, so utilities must not ALSO flow through the entity." That is cross-document intent reasoning, not pattern matching.

3. The Haiku finding (routing implication)

Haiku week-1 probe: detection recall matched Sonnet (both planted week-1 violations caught) at 3.2× lower cost ($0.105 vs $0.334). But its remediation contained an actively harmful tax error — it invented a partial utilities deduction the simplified-method election forbids — plus internal inconsistencies (deductible flag contradicting its own note).

Routing conclusion: split by FUNCTION, not by task. Cheap models are adequate as detectors/flaggers; every action-generating output (remediation, tax guidance, documents, Q&A) stays on the frontier tier. This refines the cost curve's blended scenario: the blend boundary runs through the middle of tasks, not between them.

4. Measured cost (per-agent, from transcripts, real API prices incl. cache)

Agent $
4× weekly bookkeeping $1.187
Veil-integrity audit $0.272
4× owner Q&A $1.053
Estimated tax $0.325
Formality calendar $0.222
Document generation (2 docs) $0.156
Ops loop total (12 Sonnet agents) $3.21
(Haiku probe — instrumentation) $0.105
(Scorer — test harness, not product) $0.943

Measurement noise: two bookkeeping agents' output tokens look undercounted in transcript parsing (±10% on the total). Not included: annual-close amortization ($0.50–0.60/mo equiv from the cost-curve model) and the heavier setup month.

5. Cost curve, updated with measurement

Predicted (/research/cost-curve, fleet analogs) Measured (this run)
Frontier compute / entity-month $5.50–8.00 $3.21
All-in, WY entity, retail RA $23.40 ≈$16.50–18.60

The analog-based prediction was, as claimed, an upper bound: the purpose-built slim context runs at ~40–60% of fleet-analog cost. The all-in number now lands under the $20 target INCLUDING Wyoming state fees, at pure frontier prices, before any cheap-model routing.

6. Honest caveats

  1. Light month (39 txns; a busy trades business might run 150+) — this is a floor-ish datapoint, not a distribution. Re-run with a heavy-month fixture.
  2. Single run — no variance estimate yet.
  3. Fixture/scorer provenance — fixtures and answer key authored by the same model family that scored (deterministic key limits subjectivity for veil/calendar/tax; Q&A/docs grades carry more scorer judgment).
  4. No adversarial user — Maya answers honestly; real users rationalize ("the guitar is for office ambiance"). Intent-classification under user pushback is untested.
  5. Setup month (formation, EIN, OA, bank onboarding) not yet measured.

7. The finding under the finding

The economics were never the hard part — $3.21 settles that. What this run actually demonstrates is that the hard part is intent classification over a messy human life: the same $148 utility bill is deductible, partially deductible, or a veil violation depending on an election buried in last year's tax return; the same $400 Venmo is payroll, contractor payment, or commingling depending on what the brother actually did. The system passed because it reasoned over the whole client state, and the cheap-model probe failed exactly where that reasoning got intent-dependent. Parka's product is not bookkeeping arithmetic — it is contextual judgment about a person's life, held to evidence-grade standards, at $3/month. That, and the next experiment should test it under adversarial/ambiguous user behavior, where intent is genuinely contested.