Using LLMs for my own work (reviewing code, evaluating tools, thinking through infrastructure problems) the same question kept surfacing. Not whether the API responded. Whether what came back was actually right.
That question, can I actually trust this output, was present from the start. I just didn't have the vocabulary to connect it to infrastructure yet.
The first article in this series made the case for thinking about inference layer architecture before you need to. It was written for teams building LLM-powered products. But the experiments that followed surfaced something I hadn't fully articulated: the infrastructure concern doesn't start when you ship something to a customer. It starts the moment you start relying on the output. Internal tooling (code review, incident triage, documentation) carries exactly the same exposure. The routing decisions, the provider dependencies, the output variance: all apply whether or not there's a user on the other end.
This piece is about what building the abstraction layer and running the routing experiment actually revealed. Some of it confirmed what I expected. Some of it didn't.
The abstraction layer is the foundation. It is not the answer.
The first article documented the blast radius of a provider switch. The numbers tell the story better than prose does.
I'm showing this again because it matters for what comes next. Routing decisions only make sense once the dependency is in one place. If it isn't, the migration problem comes first.
Test app: 3 LLM call sites. Numbers scale with codebase size.
What the experiment also showed, and what became clearer after running more experiments, is that the blast radius extends beyond provider switching. When GPT-5.5 changed its API from max_tokens to max_completion_tokens, the abstracted version required a four-line fix in one file. Without abstraction, that change would have surfaced as an error in every call site across the application. Model version changes within a single provider follow the same pattern as provider switches. The abstraction doesn't eliminate maintenance. It centralises it.
The abstraction layer is necessary but it only solves the dependency problem. It doesn't solve the routing problem. Once you have a single place where provider decisions are made, the next question is: what goes in that place? Which model handles which task, and on what basis?
It's worth noting that tools for this already exist. OpenRouter has been routing LLM requests since 2023 and recently shipped a feature that automatically selects a model based on a quality threshold you set. The problem is solved enough that a well-funded company built a product around it. What those tools don't surface is whether a specific routing decision was right for a specific moment. That's what I wanted to understand.
The routing experiment: incident triage across four models.
To test routing decisions with real data, I built an incident triage assistant and ran three realistic DevOps incidents through four models:
- Claude Haiku (
claude-haiku-4-5) - Claude Sonnet (
claude-sonnet-4-6) - GPT-5.4-mini
- GPT-5.5
The incidents were designed at three complexity levels:
| Complexity | Incident | Key signal |
|---|---|---|
| Simple | PostgreSQL storage exhaustion | Storage at 87%, memory and CPU climbing alongside storage pressure |
| Medium | Kubernetes ephemeral storage, node-wide pod evictions | Weekly eviction pattern, blast radius wider than expected |
| Complex | Fictionalised S3-style outage | Customer 404s against a healthy application stack, static assets as subtle signal |
These incidents were designed to be discoverable — each one had enough signal to reason from. Real incidents are messier, and model performance under genuinely ambiguous conditions is a question this experiment doesn't answer.
Two judge models scored each triage response independently on diagnosis accuracy and response quality, each on a 1-5 scale, for a total out of 10.
| Model | Simple | Medium | Complex | Cost per call | Latency |
|---|---|---|---|---|---|
| Haiku | 8/10 | 8/10 | 10/10 | ~$0.003 | 9-11s |
| Sonnet | 10/10 | 9/10 | 10/10 | ~$0.016 | 22-24s |
| GPT-5.4-mini | 9/10 | 8/10 | 10/10 | ~$0.001 | 5-6s |
| GPT-5.5 | 10/10 | 10/10 | 10/10 | ~$0.031 | 22-24s |
Three incidents. Directional, not definitive.
The surface reading of this data is that cheap models are good enough. GPT-5.4-mini at $0.001 per call matched or exceeded Sonnet at $0.016 across most incidents. Haiku correctly identified the S3 outage from the subtle static asset signal and scored 10/10 on the most complex incident. The expensive model did not consistently earn its cost advantage in quality scores.
The medium incident was the most differentiated. Sonnet's 9/10 came from identifying the specific mechanism (Prometheus TSDB write-ahead log accumulation) where Haiku stayed at the general level. Same score range, different depth.
GPT-5.5 scored 10/10 across all three incidents once the token limit was fixed. Before that fix, it produced empty responses across all three incidents: billed, slow, and silent. No error. Just an empty string and a charge. The most expensive model in the experiment failed silently, and there was nothing in the response to tell us why.
The scores told one story. The responses told another.
The scores captured what happened. They didn't capture what mattered.
Actionable beats comprehensive at 2am.
The scores didn't show the full picture. The responses did. Here is how Haiku opened its response to the complex S3-style outage:
Severity: P2. Customer-facing impact with active user complaints. Intermittent nature suggests partial outage, not complete service failure. P2 rather than P1 because users have partial access, no revenue-critical service explicitly mentioned, infrastructure is stable.
Primary suspect: The correlation between app 404s and marketing site image asset loading failures is NOT coincidental. This is the key insight. Both point to a shared external resource dependency (likely CDN, DNS, or external asset service).
| Priority | Action | Owner | Timeline |
|---|---|---|---|
| 1 | Check CDN/DNS provider status, SSL certificate expiry, external API health dashboards | On-call Eng | Immediate |
| 2 | Examine application logs for errors after load balancer stage | On-call Eng | 5 min |
| 3 | Test DNS resolution from app containers | On-call Eng | 5 min |
| 4 | Check external service integrations for recent changes | Junior Eng | 5 min |
| 5 | Validate certificate chain for external dependencies | On-call Eng | 5 min |
| 6 | Implement targeted monitoring: segment metrics by request type to confirm hypothesis | On-call Eng | 10 min |
Key investigation question: "What external services did we add, update, or change in the last 24 hours?" Start there before vendor escalations.
Six actions, each with an owner and a timeline. The kind of response you can act on in thirty seconds. GPT-5.5 handled the same incident differently.
Severity: P2. Customer-facing application is experiencing increased latency and intermittent 404 errors for the past hour. If the affected user percentage is high or core workflows are blocked, this should be upgraded to P1.
Most likely issue is in the edge/static delivery layer, not the core application backend. The marketing asset clue is likely related. This suggests a shared dependency such as: CDN issue or misconfiguration, CDN cache corruption, object storage or origin issue for static assets, DNS or regional edge routing issue, WAF or routing rule change at the edge...
Immediate recommended actions:
1. Treat the marketing asset failure as related. Identify whether the app and marketing site share the same CDN, asset bucket, hostname, DNS zone, WAF, or edge configuration.
2. Check CDN/edge provider health and logs. Look for spikes in 404s by edge location, region, hostname, path pattern, user agent, cache status: HIT/MISS/ERROR.
3. Compare affected vs unaffected users. Gather examples from support: URLs returning 404, customer region, timestamp, request IDs if available.
4. Validate static asset/origin availability. Confirm application bundles, images, JS/CSS, and route assets exist in the backing storage. Check bucket permissions, lifecycle jobs, replication...
(8 action items total, each with sub-bullets. No owner assignments. No timeline.)
The diagnosis is correct. GPT-5.5 identified the CDN and object storage dependency, included mitigation options, and flagged communications. At 2am during an active incident, the question is not what might be wrong. It is what to do first, and who does it.
A fair counterargument: the format difference might be a prompting problem, not a routing problem. A more specific system prompt might collapse the gap entirely. I haven't tested that yet, but I intend to. What I do know is that under pressure, the response that tells you what to do and who does it is the one you reach for. Whether prompt engineering closes that gap is still an open question.
A score of 8/10 doesn't tell you what was missing.
That contrast (comprehensive versus actionable) is one way the scores fall short. There is another.
On the simple PostgreSQL incident, Haiku scored 8/10 and GPT-5.5 scored 10/10. Two points apart. This is one thing the more expensive model added:
Warning: Do not run VACUUM FULL without careful planning. Unlike regular VACUUM, VACUUM FULL requires an exclusive lock on the table, blocking all reads and writes, and needs additional disk space equal to the table size. On a storage-constrained system this could make the situation worse, not better.
Haiku did not surface this. That is operationally specific knowledge that could prevent a bad action under pressure. The score captures that something was different. It does not explain what, or why it matters.
The routing problem is not just about which model is most accurate. It is about what kind of response you need in the moment.
Severity is a business judgment. Without business context, the models couldn't make one.
The severity classifications tell the same story. Every model scored every incident P2 across all three complexity levels. The simple PostgreSQL incident at 87% storage capacity with memory and CPU reacting is arguably P1 if left unattended. The complex incident with active customer 404s for an hour is also arguably P1 in most organisations.
The models stayed entirely within the technical frame, which is exactly what a DevOps practitioner asked to triage an incident would do. But severity in a real organisation is not purely a technical judgment. It is a business judgment. P1 versus P2 often comes down to revenue impact, customer count affected, SLA breach risk. None of that context was in the incident descriptions, so none of it appeared in the responses.
That is not a model failure. It is a prompt design and context problem. A routing layer optimising on technical quality scores would have no signal that a P2 classification might be understating the actual business impact. The routing decision that looks correct from the technical side may be the wrong decision from the business side. The system has no way to know the difference.
Routing by cost and complexity isn't enough.
The conventional routing argument is straightforward: cheap models for simple tasks, expensive models for complex ones. The data supports that, as far as it goes.
What it doesn't account for is the moment. The same incident might need a completely different kind of response depending on whether you are actively fighting it or reviewing it afterward. When I looked at the Haiku response, my instinct was to act on it immediately. When I looked at the GPT-5.5 response, my instinct was to read it carefully and then decide. The experiment didn't test that difference directly. Whether routing can account for operational state is still an open question and one I intend to explore.
No current routing tool makes that distinction. Routing decisions are made at configuration time, based on task type and cost. What is happening operationally when the call is made is not part of the signal.
Before you route, know what the response needs to do. A triage response and an architectural analysis are different outputs. Treating them as the same routing decision is how you end up with the wrong answer at the wrong moment.
The VACUUM FULL finding is the clearest example of why scores aren't enough. Haiku scored 8/10. The two missing points contained a warning that could have prevented someone from making the situation worse. For tasks where a specific gap has real consequences, you need to know what the model tends to miss, not just how it scores overall.
The first run of GPT-5.5 produced empty responses, consumed tokens, and charged the account. No error surfaced. The routing layer had no signal that anything had gone wrong. Visibility into routing outcomes is not a nice-to-have. Without it, a bad routing decision is invisible until it causes a problem.
Every routing decision is an economic decision. The question is whether you are making it deliberately or by default.
The abstraction layer solves the dependency problem. It centralises the maintenance burden and makes provider switches tractable. That is worth doing and the data shows exactly what it costs to not do it.
The routing problem is harder. The experiment surfaced three incidents where cheap models performed as well as expensive ones on quality scores, and then revealed in the actual responses that the scores were missing something. The routing decision that optimises for cost may not optimise for operational utility. The routing decision that optimises for accuracy may not optimise for the moment.
The next article in this series addresses cost directly. The deeper question of how you build a routing layer that knows when its decisions are wrong is one this work is pointing toward, but hasn't answered yet.
This is the second in a series of Field Notes pieces on LLM inference layer architecture. The incident triage routing experiment was run using the Anthropic and OpenAI APIs in May 2026.
Methodology Experiment 3: Incident Triage Routing ▼
Four models were run against three incident descriptions at increasing complexity levels. Two judge models scored each response independently.
Triage models: claude-haiku-4-5, claude-sonnet-4-6, gpt-5.4-mini, gpt-5.5
Judge models: claude-sonnet-4-6, gpt-5.5
System prompt (identical across all four triage models):
You are a senior DevOps engineer conducting incident triage.
When given an incident description, respond with a structured
assessment covering:
1. Severity (P1/P2/P3/P4 with brief justification)
2. Likely root cause
3. Immediate recommended actions
4. Escalation decision (escalate now / monitor / no escalation needed)
Temperature: API default (1.0) for all models. Runs were not done at temperature 0, which means scores have variance that has not been quantified. A reader rerunning the same incidents would likely see different numbers. The findings are directional, not precise.
Judge conflict of interest: Two of the four judge models (claude-sonnet-4-6 and gpt-5.5) were also triage models. LLM-as-judge bias is well documented: models tend to score outputs stylistically similar to their own higher. This is a structural limitation of the experiment design. A third judge from a different provider family (such as Gemini) would have provided a cleaner comparison. The judge scores should be read with that caveat in mind.
Judge disagreement: When the two judges disagreed on a score, both scores are preserved in the raw results. The table in the article reflects the Sonnet judge scores. GPT-5.5 judge scores were largely consistent, with the exceptions noted above, but diverged on Haiku for the medium incident (Sonnet judge: 8/10, GPT judge: 6/10) and on GPT-5.4-mini for the medium incident (Sonnet judge: 8/10, GPT judge: 9/10).
Diagnosis Accuracy (1-5):
5 = identified the root cause correctly, nothing misleading
4 = directionally right, minor omission that would not send the team the wrong way
3 = partially correct, missing something meaningful that would slow resolution
2 = significant misdiagnosis or confused reasoning
1 = wrong or would send the team in the wrong direction
Response Quality (1-5):
5 = recommended actions are specific, sequenced, and immediately actionable
4 = good recommendations, minor gaps a senior engineer would catch
3 = reasonable starting point but missing critical steps
2 = vague or incomplete to the point of not being useful under pressure
1 = would mislead someone into making the situation worse
The three incidents:
Simple: PostgreSQL on db-primary-01 hitting storage capacity warnings for three days. Manual storage expansion as repeated temporary fix. Connection count stable. Memory and CPU climbing alongside storage pressure. Storage at 87% and climbing.
Medium: Prometheus monitoring pod evicting weekly in a Web3 application namespace. Eviction reason: node was low on resource: ephemeral-storage. Blast radius extends to other pods on the same node, interrupting stateful transaction calls and in-flight data ingestion. No resource limit or traffic changes. No storage-related alerts outside eviction events.
Complex: Increased latency and rising error rates across a customer-facing application for one hour. Customer 404 reports. No deployments in the affected window. EKS, database, and Redis all healthy. Internal health checks passing. Load balancer logs show requests reaching the application tier. Errors intermittent. A junior engineer notes that marketing site image assets stopped loading at the same time but flagged it as unrelated.
Technical note on GPT-5.5: GPT-5.5 requires max_completion_tokens=4096. At 1024 it produced empty responses across all three incidents: tokens were consumed and latency was real, but the response field was empty. No API error was raised. The fix required identifying the token limit as the cause, updating the parameter, and rerunning all three incidents.