Is the most expensive model worth it? I was watching the token counts at the end of Claude Code runs and had no idea what those numbers actually meant. Was this expensive? Did it matter?

That question has been running underneath everything since. Three articles and four experiments later, I have more data than I started with. What I didn't expect is that every time more data was collected, it would make the question harder to answer, not easier.

Every routing decision is an economic decision. The question is whether you are making it consciously.

The routing argument isn't new. It just hasn't been framed in economic terms for most practitioners. Chris Hutchins put it plainly on a recent podcast:1 "If I needed someone to wash my car, I'm not going to call my neurosurgeon to come in and wash my car." Right model for the right task. The question is what it costs when you get that wrong.

Also on that podcast, Anish Acharya observed: "On a per-token basis, consumers are going to need to get more efficient at some point." The subsidies that make this feel optional today won't last. When pricing normalises, the teams that have been defaulting to the most expensive model for every task will feel it.

This article is about what happens when you measure cost and quality together.

The cheap model doesn't cost you quality. It costs you something else.

One of the experiments earlier in this series tested four models on three DevOps incident types: a PostgreSQL storage exhaustion, a Kubernetes node eviction, and a fictionalised S3-style outage. Two judge models scored each response independently.

Model Simple Medium Complex Cost per call Latency
S / G S / G S / G
Haiku 8 / 8 8 / 6 10 / 8 ~$0.003 9-11s
Sonnet 10 / 8 9 / 8 10 / 9 ~$0.016 22-24s
GPT-5.4-mini 9 / 8 8 / 9 10 / 9 ~$0.001 5-6s
GPT-5.5 10 / 9 10 / 8 10 / 10 ~$0.031 22-24s

Three incidents. Directional, not definitive. S = Sonnet judge, G = GPT-5.5 judge.

The cost spread is real. So is what it obscures. GPT-5.4-mini at $0.001 matched Sonnet at $0.016 on quality scores across most scenarios, but the responses were structurally different. Haiku produced tight, actionable responses with prioritised steps and owner assignments. GPT-5.5 produced comprehensive responses with architectural depth, edge cases, and specific operational warnings. GPT-5.5's more comprehensive responses came with a tradeoff: harder to act on quickly, but more likely to catch something specific. On the PostgreSQL incident, that tradeoff mattered.

The VACUUM FULL finding from the PostgreSQL incident is the clearest example. Haiku scored 8/10. GPT-5.5 scored 10/10. The two-point gap contained a specific warning: do not run VACUUM FULL without careful planning, as it requires an exclusive lock on the table and additional disk space. On a storage-constrained system that warning could prevent someone from making the situation significantly worse.

GPT-5.5 — Simple Incident, PostgreSQL

Warning: Do not immediately run VACUUM FULL on primary tables unless planned carefully, because it requires heavy locks and additional disk space. Prefer regular VACUUM, autovacuum tuning, and index cleanup where appropriate.

Haiku — Simple Incident, PostgreSQL

No equivalent warning. Haiku's response covered severity, root cause, and recommended actions without flagging the VACUUM FULL risk.

Finding

Cost optimisation doesn't cost you the diagnosis. In this case it cost the warning that came after it. The cheap model and the expensive model both identified the problem. Only one of them told you what not to do.

That finding cuts against a bias most of us carry. Chris put it plainly: "I never want to use Haiku. So I'm like, oh god, that's the dumb model."1 But the data doesn't hold that consistently in either direction. On the most complex incident, Haiku scored 10/10. GPT-5.5 failed silently on the first run: the token limit that worked for every other model wasn't enough for it. It produced no output and billed for it. The expensive model must be better is a feeling, not a finding. Neither model won every scenario.

The incident triage experiment answered one question: does routing to a cheaper model cost you quality? The answer was mostly no, with the caveat that what you lose isn't always visible in the score. That raised a harder question: what happens when the task isn't just operational, but compliance-sensitive? Does cost optimisation create liability when the stakes are an audit finding rather than a missed operational warning?

For compliance tasks, cost optimisation doesn't create liability. But it changes urgency.

The SOC2 compliance experiment tested the same four models on three readiness scenarios:

The question was direct: does routing to the cheapest model create compliance risk?

Model CC6 IAM CC8 Change CC7 Monitoring Total cost vs cheapest
S / G S / G S / G
GPT-5.4-mini 10 / 9 10 / 10 10 / 10 $0.003 baseline
Haiku 9 / 9 9 / 9 9 / 8 $0.013 3.9x
Sonnet 10 / 9 10 / 9 10 / 9 $0.049 14.5x
GPT-5.5 10 / 9 10 / 10 10 / 10 $0.144 42.3x

Three scenarios. No critical findings were missed by any model. S = Sonnet judge, G = GPT-5.5 judge.

Every model caught every audit-critical gap. The cheapest model identified the correct SOC2 controls, correctly classified the gaps, and accurately assessed the compliance consequence. GPT-5.5 at 42x the cost caught nothing additional.

But there was one finding that didn't show up in the scores.

Same scenario. Same rubric. Different urgency.

Sonnet gave the only unambiguous Adverse classification across both the IAM role and change management scenarios. Haiku flagged Adverse in its response headers but hedged in the body text, leaving the overall classification ambiguous. GPT-5.4-mini and GPT-5.5 were consistently Qualified.

That distinction matters. Qualified Opinion Risk means an auditor would flag this as a finding, but you can still achieve SOC2 certification if you remediate. Adverse Opinion Risk means this is significant enough that it could prevent certification or cause customers to question your security posture.

Model CC6 IAM Role CC8 Change Mgmt CC7 Monitoring
Haiku Qualified Qualified Qualified
Sonnet Adverse Adverse Qualified
GPT-5.4-mini Qualified Qualified Qualified
GPT-5.5 Qualified Qualified Qualified

Sonnet's reasoning was specific. For the IAM role, multiple compounding Design Gaps (unrestricted S3, RDS, and IAM enumeration permissions on a data pipeline processing customer transaction data, with no access review in 18 months) reach Adverse rather than Qualified. For the change management scenario: 14 bypass events in one quarter is not an isolated deviation. It's a systemic pattern that an auditor would treat as the control not operating at all. The full responses from all four models are in the results dropdown at the end of this article.

The Finding

Sonnet gave the only internally consistent Adverse classification. This is not a cost pattern. It is a model behaviour pattern.

One possible explanation: the other models may be implicitly accounting for the SOC2 remediation window. Qualified Opinion Risk says: this is a finding but fixable before the audit report is issued. Adverse Opinion Risk says: the audit period already contains evidence of systemic non-compliance, and remediation alone may not satisfy an auditor. That's one interpretation. It hasn't been tested.

Both positions are defensible. Neither model is wrong. The difference is in the assumption each one makes about auditor discretion. The team reading the response doesn't know that. They just see the classification and act on it.

A team routed to any model other than Sonnet gets Qualified Opinion Risk and prepares accordingly. A team routed to Sonnet gets Adverse Opinion Risk and may treat it as a certification blocker. The routing decision changed the urgency of the response, not the accuracy of the diagnosis.

The data didn't answer the question I started with. It replaced it with a better one.

Both experiments pointed in the same direction: cost optimisation doesn't cost you the diagnosis. The cheap model caught the same gaps as the expensive one. What the data also showed, and what the scores didn't capture, is that models make different assumptions about what you need from a response.

Haiku gives you something you can act on immediately. GPT-5.5 gives you something you can learn from. Sonnet gives you a more conservative risk assessment that may or may not reflect how an auditor would actually rule. None of those is wrong. They're just different, in ways that matter depending on the moment you're in.

The VACUUM FULL warning appeared on the simplest incident. The audit opinion risk divergence appeared on scenarios every model scored identically. Neither showed up in the scores. A routing layer optimising on quality scores alone has no signal for either.

That's where this series runs out of road. Not because the question is unanswerable. Because answering it properly requires a different kind of experiment.

Where the Series Lands

Three articles. Four experiments. The question that started this series was whether inference layer architecture would become an everyday DevOps concern regardless of company size. The data says yes. Not because the infrastructure is complicated, but because the decisions inside it are.

The abstraction layer solves the dependency problem. Routing intelligence asks which model handles which task. Token economics tells you what those decisions cost, but not whether the cost was worth it. What none of it solves yet is the visibility problem: knowing when a routing decision was wrong before it causes a problem.

That's the question the next piece in this work is trying to answer. Not as part of this series. The series ends here. But the work doesn't.

1 Chris Hutchins and Anish Acharya, All the Hacks podcast, May 13 2026. Cited twice in this article. youtube.com/watch?v=nBPpwz4H85Y


This is the third and final piece in the Infinite Drive Field Notes series on LLM inference layer architecture. The work was run using the Anthropic and OpenAI APIs in May 2026.

Methodology Experiment 4: SOC2 Compliance Assessment Routing

Four models were run against three SOC2 readiness scenarios mapped to specific Common Criteria controls. Two judge models scored each assessment response.

A note on routing vs model selection: These experiments test fixed model selection, not a routing layer. Each model ran each scenario independently. The argument the article makes is that cost-based signals are insufficient for making model selection decisions, which is the problem a routing layer would need to solve.

Triage models: claude-haiku-4-5, claude-sonnet-4-6, gpt-5.4-mini (snapshot: gpt-5.4-mini-2026-03-17), gpt-5.5

Judge models: claude-sonnet-4-6, gpt-5.5

System prompt (identical across all four triage models):

You are a senior security and compliance engineer conducting
a SOC2 Type II readiness assessment. When given a description
of an infrastructure configuration, deployment process, or
operational setup, respond with a structured assessment covering:
1. The SOC2 Common Criteria control at risk (e.g. CC6.1, CC8.1)
   with a brief explanation of why it applies.
2. The specific gap: classify as Control Deficiency (control exists but does not operate effectively), Deviation (process exists but is not being followed), or Design Gap (no control exists for this criteria at all).
3. Audit opinion risk: Unqualified (no risk), Qualified Opinion
   Risk (auditor would flag this; you can still pass), or
   Adverse Opinion Risk (could prevent certification or cause
   customers to question your security posture).
4. Immediate remediation steps required before an audit.

Scoring rubric:

Control Identification (1-5):
5 = correctly identified the specific SOC2 control and sub-criteria
4 = correct control area, missed specific sub-criteria
3 = related control but not the primary one at risk
2 = wrong control area
1 = no meaningful identification

Gap Accuracy (1-5):
5 = correctly characterised gap, type, and audit consequence
4 = correct gap and type, understated or overstated consequence
3 = identified gap, missed compliance consequence or misclassified type
2 = partial identification, missing material elements
1 = missed the gap or characterised it incorrectly

Judge conflict of interest: claude-sonnet-4-6 was both a triage model and a judge model. LLM-as-judge bias is well documented. Scores should be read with that caveat in mind.

Temperature: API default (1.0) for all models. Results are directional, not precise.

The three scenarios:

CC6: IAM Role Review: A startup data pipeline service account with unrestricted s3:*, rds:*, ec2:Describe*, iam:ListRoles, and iam:GetRole permissions. Processing customer transaction data. No access review in 18 months.

CC8: Change Management: An engineering team with branch protection requiring one approving review. CTO has administrator access and bypassed branch protection 14 times in the last quarter during incidents. No documented emergency change procedure. No post-deployment review for bypass events.

CC7: Monitoring and Alerting: EKS application with Datadog dashboards configured but no alert policies. Dashboards reviewed weekly. Last month's database connection pool exhaustion discovered by manual testing, not automated signal.

Full Results SOC2 Compliance Experiment — All Responses and Scores
Cheapest
GPT-5.4-mini
$0.0034
Total across 3 scenarios
2nd cheapest
Haiku
$0.0131
3.9x GPT-5.4-mini
Mid-tier
Sonnet
$0.0493
14.5x GPT-5.4-mini
Most expensive
GPT-5.5
$0.1439
42.3x GPT-5.4-mini