When Does Your Inference Layer Become a DevOps Problem?

I left a talk recently with a question I couldn't shake: at what point does reliable inference architecture become a DevOps problem? Not an AI problem. Not a product problem. A DevOps problem.

The speaker was an Infrastructure Engineer at Sierra, presenting on how they built a resilient inference layer that dynamically routes across providers and reduces P99 latency by 70% or more, even during outages. She said something that stopped me: this kind of routing infrastructure used to be the domain of companies like Amazon Web Services and Google Cloud Platform. But with LLMs, providers can vary widely in latency, availability, and output quality, so you can't assume a single endpoint will be consistently fast or reliable. As a result, even small startups increasingly need routing layers to choose providers based on real-time health, performance, and whether the model is actually producing coherent output.

I've been a DevOps practitioner for long enough to recognize that feeling: the one where someone describes a problem you haven't fully named yet, where you realize the ground is shifting under an assumption you didn't know you were making. We've solved this for cloud providers. We've solved it for databases. We've solved it for REST APIs. The patterns exist. The tooling exists. Some of it even applies. Hedging, dynamic routing, tail latency management are all borrowed from distributed systems thinking. The talk made clear that teams like Sierra have already assembled this into practice. What hasn't been written is a playbook for DevOps and infrastructure engineers who are encountering this problem for the first time. That's the question driving this series.

So I ran some experiments. Not to prove a point, but because I genuinely wanted to understand what we're dealing with.

The Infrastructure Parallel

We've been here before. Sort of.

In Web3, you could run your own node. Full control, no dependency on a third party. But the operational overhead and cost meant most teams ended up on managed node providers anyway. And when those providers had outages (which they did, more than once, across the major names in the space), the blast radius extended across every team that had made the same pragmatic choice. Convergence on a small number of providers is efficient right up until it isn't.

LLMs have the same structure. You could run your own models on your own GPU infrastructure. Full control, no provider dependency. But running your own models at scale requires GPU infrastructure and operational overhead that most teams aren't set up for, so most end up on one of a handful of managed providers, like Anthropic or OpenAI. The same convergence dynamic plays out. The same shared dependency risk accumulates quietly.

The difference is that LLM providers introduce a class of uncertainty that most infrastructure engineers haven't had to account for before. Availability is only part of it. There's also latency that varies unpredictably inside shared infrastructure you have no visibility into, model deprecation cycles that can turn a routine update into an unplanned sprint, and something I didn't expect before running these experiments: switching providers can change what your application actually does, not just how it's wired.

I didn't expect that last one. The experiments made it concrete.

Experiment One

The variance is structural, not load-driven.

The first question I wanted to answer was basic: how much does LLM latency actually vary, and does time of day matter? I ran 50 identical requests to Claude Haiku and Claude Sonnet via the API across five time windows, covering peak and off-peak hours across Thursday, Friday, and Saturday. 500 requests per model, 1,000 total.

I expected to find a meaningful difference between peak and off-peak. I didn't.

Finding

Haiku median latency barely moved across all five windows: 1,574ms to 1,678ms, a 104ms spread across three days and multiple load conditions. Sonnet told the same story: 2,626ms to 2,829ms. Whatever is driving latency variance in Anthropic's infrastructure, it isn't the time of day.

Cost followed the same pattern for a simpler reason: Anthropic charges by token, not by load. No surge pricing, no peak hour premium. The bill is the same at 10pm Thursday as it is at 9:30am Friday when both coasts are online.

The gap that did show up consistently was between models. Haiku averaged 1,620ms median latency across all runs. Sonnet averaged 2,728ms. That 68% difference held across every time window, every day, every load condition. If you're optimising for latency, the conversation starts with which model you're calling, not when.

The more revealing finding was hiding in the outliers. Across 1,000 requests, 59 exceeded 3,000ms. You might expect those slow requests to have produced more tokens: more output, more processing time. They did not. Requests over 3,000ms averaged 94 output tokens. Requests under 2,000ms averaged 90. A difference of 4 tokens across nearly a thousand requests. The slowest call in the entire dataset was a Sonnet response at 11,579ms that produced 86 tokens, completely average output. Something happened inside Anthropic's infrastructure that had nothing to do with the task, the model, or the time of day.

This is the problem the talk was pointing at. One of the patterns the speaker described was tail latency hedging: if a request hasn't responded within a threshold, fire a duplicate to a fallback and take whichever comes back first. I understood the concept when she described it. After running this experiment, I understand why it exists. That 11-second response wasn't slow because the model had more to say. Something happened inside the provider's infrastructure and there was no way to see it coming, route around it, or predict when it would happen again. The unpredictability is structural. The only honest response to it is architectural.

One scope note worth making: this experiment ran against Anthropic's API only. Whether other providers show the same stability across time of day is an open question and one worth testing before drawing broader conclusions.

Experiment Two

The blast radius of a provider switch.

The second experiment measured the real cost of switching providers: not in dollars, but in time, files, and debugging sessions. I built a small but realistic code review assistant in Python with three stages and three structurally different LLM call sites: a single-turn completion for an overall assessment, a streaming response for detailed line-by-line feedback, and a structured JSON output for a summary of findings. Then I built two versions of it: one wired directly to the Anthropic API, one with a thin abstraction layer sitting between the application and the provider.

Then I migrated both from Anthropic to OpenAI and timed myself.

Metric	Direct (no abstraction)	Abstracted
Migration time	15 minutes 57 seconds	35 seconds
Files changed	1	0
Lines changed	8	0
Errors to debug	3	0

Migrating the abstracted version to OpenAI took 35 seconds. One environment variable: export LLM_PROVIDER=openai. Nothing in the application changed. The only file that ever needs to know about providers is the abstraction layer, and it was already written to handle both.

The 16 minutes was something different. The Anthropic and OpenAI SDKs differ in ways that aren't obvious until you're in the middle of a migration. The client initialisation is different. The response shape is different: response.content[0].text versus response.choices[0].message.content. The streaming implementation is meaningfully different: Anthropic uses a context manager, OpenAI doesn't. Each difference is a potential error, and I hit three of them. Each required understanding both APIs well enough to diagnose under time pressure. That's a skills dependency hiding in plain sight. Writing the abstraction layer is the work that removes it.

For someone comfortable with both SDKs it is roughly 45 minutes of work: a thin wrapper, a provider switch, a handful of functions. For someone less familiar with the APIs it could take considerably longer. That is a one-time cost. The 35-second migration is what every subsequent provider switch costs, for anyone on the team, regardless of whether they have ever read either SDK's documentation.

Finding Worth Noting

The providers didn't just differ in how they were wired. They differed in what they said. Running identical code and identical prompts against Anthropic versus OpenAI produced meaningfully different outputs. Anthropic scored the sample code 3/10 and surfaced 9 issues. OpenAI scored the same code 5/10 and surfaced 4 issues. Some variation was expected. A gap this size was not.

This is not a quality benchmark, and a single experiment is not a conclusion. But for a team running a code review tool in production, a provider switch isn't just an infrastructure event. A gap this size in output variance is worth paying attention to. If this holds across more samples and more task types, the downstream effects on what users see, what metrics report, and what decisions get made could be significant. That's a question this series intends to pursue.

The Levers

You can't control the provider. You can control your exposure to it.

Running these experiments surfaced four questions I think are worth sitting with. Not as a checklist, just as a starting point for thinking about where you actually stand.

Lever 01

Where does the provider dependency live in your codebase?

If it's scattered across every file that makes an LLM call, a provider switch is a search-and-replace operation across your entire application, with debugging sessions baked in. If it's in one place, it's 35 seconds and an environment variable. Experiment two made that difference concrete: 16 minutes versus 35 seconds, on a small app with three call sites. Tools like LiteLLM and Portkey offer a managed version of this abstraction. A comparison of the build-versus-use-existing decision is worth its own piece. What matters here is that the abstraction exists, wherever it lives.

Lever 02

Do you have visibility into what's happening at the call level?

Not in aggregate from a dashboard, but per call, per model, per task type. The 11-second outlier in experiment one was invisible from the outside. There was no signal before it happened and no explanation after. You can only hedge against what you can see.

Lever 03

Is your model selection deliberate or just whatever you had an API key for?

Haiku costs roughly 3.6x less than Sonnet per request and responds 68% faster. That gap is stable across time of day and load conditions. Whether the quality difference justifies the cost depends on the task. That's the question the next article in this series is going to try to answer.

Lever 04

How much work would a provider switch actually require right now?

Not hypothetically. Concretely, in your current codebase, today. The answer tells you a lot about your current exposure. If you don't know, that's worth finding out before you need to answer it under pressure.

The Open Question

I started with a question from a talk: at what point does reliable inference architecture become a DevOps problem? After running these experiments, I have a clearer sense of the shape of the answer, but I don't have the full answer yet.

What I know is this: the variance is real and structural. The blast radius of ignoring it is measurable. The behaviour change from a provider switch can be invisible to the team making it. And the infrastructure parallel of companies converging on a small number of providers without thinking through what that convergence means has played out before in ways that weren't obvious until they were painful.

Whether this becomes an everyday DevOps concern for every team regardless of size, or whether it stays a specialist concern for teams building inference-heavy products: I'm still working on that. The next article goes deeper on the architecture. The question stays open until the evidence closes it.

This is the first in a series of Field Notes pieces on LLM inference layer architecture. The experiments described here were run using the Anthropic and OpenAI APIs in April and May 2026.

Methodology Experiment 1: Provider Variance Under Identical Conditions ▼

50 identical requests per session, across 5 time windows and 2 models (Haiku and Sonnet), for 1,000 total API calls. Each request used the same prompt:

"Explain in 2-3 sentences why infrastructure reliability matters for software companies."

Latency was measured using Python's time.perf_counter() from just before the API call to just after the response completed. Token counts and costs were pulled directly from the API response usage object.

Summary results across all 10 runs:

Run	Model	Median (ms)	p95 (ms)	Max (ms)	p95/median
Thursday off-peak 10pm	Haiku	1,600	1,923	6,656	1.20x
Thursday off-peak 10pm	Sonnet	2,676	3,302	11,579	1.23x
Friday peak morning 9:30am	Haiku	1,678	2,781	6,155	1.66x
Friday peak morning 9:30am	Sonnet	2,829	3,538	7,148	1.25x
Friday peak afternoon 1pm	Haiku	1,608	2,303	2,958	1.43x
Friday peak afternoon 1pm	Sonnet	2,691	3,963	4,828	1.47x
Saturday off-peak morning 9:30am	Haiku	1,640	2,167	4,509	1.32x
Saturday off-peak morning 9:30am	Sonnet	2,818	4,134	5,938	1.47x
Saturday off-peak evening 6pm	Haiku	1,574	2,439	6,180	1.55x
Saturday off-peak evening 6pm	Sonnet	2,626	3,254	3,693	1.24x

Core measurement loop (Python):

start = time.perf_counter()
response = client.messages.create(
    model=MODEL,
    max_tokens=256,
    messages=[{"role": "user", "content": PROMPT}],
)
elapsed_ms = (time.perf_counter() - start) * 1000

result = {
    "request_number": i,
    "latency_ms":     round(elapsed_ms, 1),
    "input_tokens":   response.usage.input_tokens,
    "output_tokens":  response.usage.output_tokens,
    "cost_usd":       estimate_cost(model_id, input_tokens, output_tokens),
    "response_text":  response.content[0].text,
}

A 0.3 second pause was added between requests to avoid rate limiting. Results were appended to a JSON file after each run to allow cross-session comparison.

Methodology Experiment 2: Abstraction Cost of Switching Providers ▼

A small code review assistant was built in Python with three structurally different LLM call sites, then implemented in two versions: one wired directly to the Anthropic API, one with a thin provider abstraction layer. Both were migrated to OpenAI and the migration was timed.

The three call site types:

Single-turn completion: overall code assessment. Simplest call pattern, cleanest to migrate.
Streaming response: detailed line-by-line feedback streamed to the terminal. Provider implementations differ meaningfully here.
Structured JSON output: summary of findings with severity levels and recommendations. Response parsing varies by provider.

The direct version (no abstraction), key sections:

# Provider-specific imports and client
import anthropic
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
MODEL = "claude-sonnet-4-6"

# Single-turn: Anthropic response shape
response = client.messages.create(model=MODEL, ...)
text = response.content[0].text

# Streaming: Anthropic uses a context manager
with client.messages.stream(model=MODEL, ...) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

The abstracted version, application code:

# Zero provider imports in the application layer
import llm  # the only import that matters

# Single-turn: provider-agnostic
response = llm.complete(prompt=prompt, max_tokens=512)
text = response.text

# Streaming: provider-agnostic
for chunk in llm.stream(prompt=prompt, max_tokens=1024):
    print(chunk, end="", flush=True)

Switching providers in the abstracted version:

export LLM_PROVIDER=openai
python3 review.py sample_code.py

That is the entire migration. The abstraction layer handles provider-specific client initialisation, response parsing, and streaming differences internally. The application code is unchanged.

Migration results:

Metric	Direct	Abstracted
Migration time	15 min 57 sec	35 sec
Files changed	1	0
Lines changed	8	0
Errors to debug	3	0

The sample code reviewed contained 9 deliberate issues across security, error handling, and Python best practices. Both providers identified the critical issues. Output consistency findings are discussed in the article body.