The Trust Boundary Problem in AI-Assisted Operations

A reader commented on an earlier Field Notes piece about Terraform MCP with a challenge I had been wanting to do a deeper dive into:

"The real bottleneck at enterprise scale often isn't generation quality, it's trust boundaries. How are you thinking about validation and policy guardrails, drift detection, auditability, and human approval workflows before deployment? That's the gap between 'AI can generate Terraform' and 'AI-generated infrastructure is actually safe to operationalize in regulated or production environments.'"

That question is more urgent than it was when it was asked. At Uber, an internal background coding agent called Minions produces roughly 1,800 code changes a week, used by 95% of the engineering organization. Uber reached that scale because it built three governance layers before scaling adoption: an LLM gateway handling PII redaction and audit logging, an MCP gateway governing every agent-to-tool connection, and an agent identity system extending Zero Trust to multi-agent workflows. From what I've seen, most teams are scaling first and asking governance questions after.

The problem

The layer that's missing

The tools that exist today handle governance as process and compliance. What they don't address is whether the AI output those processes govern is reliable enough to act on. That gap is what motivated this. Not to replace what exists, but to build toward something that addresses the layer that's missing.

While researching the governance problem, I came across a piece that put into words what the data had already shown me.

"You cannot govern what you cannot see. Many organizations struggle with AI governance because they begin at the policy layer while ignoring the inventory layer."

The inventory layer is the foundation: knowing what AI activity exists, where it ran, and what it found. But knowing a review ran tells you nothing about whether the finding was reliable. Without both, policy is just aspiration.

A commenter on that piece added a sixth pillar the author missed: the consensus layer. Without it, you never reach a trusted shared state.

Having worked in Web3, that framing landed immediately. Consensus is the mechanism by which distributed nodes agree on a shared truth without trusting any single participant. We could apply the same principle to AI-assisted review: instead of trusting a single run, measure whether findings converge across multiple runs and use that convergence as the confidence signal. Would this make sense as part of governance? Let's see.

What we needed to know

The questions that shaped the design

Three questions needed answering before I built anything.

First: does detection rate vary by finding type, or do models catch everything consistently? If they're consistent, one review is enough. If not, you need a different approach.

Second: when a model misses a critical finding, does it score the configuration higher? Because if it does, a score isn't a proxy for quality. It's noise.

Third: does switching models because of latency change what gets found? If yes, latency isn't an infrastructure problem. It's a governance variable.

To test whether those questions had answers worth acting on, I ran a security review of a sanitized Databricks Terraform configuration 30 times across three models. Real production infrastructure. Three target issues with known ground truth:

A wildcard S3 policy granting every cluster node full access to every bucket in the account
An unbounded firewall variable that accepts any URL with no validation
A subnet flag that looks like a misconfiguration without knowing the network topology behind it

The kind of findings where getting it wrong has real consequences.

What the data told us

The answers, and what they meant

Detection rates across 90 runs

Not all findings are created equal.

30 runs per model. Detection rate = percentage of runs where the finding was identified.

Finding 1: Detection rate varies by finding type. The wildcard S3 policy was caught in 100% of runs across all three models. The unbounded firewall variable was missed by GPT-5.5 in 7% of runs. The map_public_ip_on_launch finding barely registered: under 7% across all models, and never with correct architectural context. One review can't tell you which category you're in.

Finding 2: A missed finding raises the score. When GPT-5.5 missed the firewall variable, its score went up from 3/10 to 4/10. The model found fewer issues and rated the configuration as more secure. A score isn't a signal when you don't know what the model missed.

Finding 3: Latency variance changes what gets found. GPT-5.5's median latency was 50 seconds, its p99 was 469 seconds. More than the speed difference, GPT-5.5 found the S3 VPC endpoint policy issue that Claude missed, while Claude Opus found the cross-account IAM concern that GPT-5.5 didn't flag. Switching models because of latency changes more than speed. It changes what gets found.

The solution

What I built and why

The data made the design decisions for me. And it confirmed the consensus principle: a single review is one node's opinion. The answer to all three questions was the same. Run the same review multiple times, measure whether findings converge, and return a verdict per finding based on that rate. Not a score. A detection rate.

This is what implementing that consensus principle returns:

{
  "verdict": "HUMAN_REVIEW_REQUIRED",
  "reasons": [
    "unbounded_firewall detected in 93% of runs, below 95% threshold"
  ],
  "finding_verdicts": {
    "wildcard_s3": { "rate": 1.0, "confidence": "HIGH" },
    "unbounded_firewall": { "rate": 0.93, "confidence": "REVIEW" }
  },
  "latency_budget_exceeded": false,
  "runs_completed": 30,
  "runs_failed": 0
}

The verdict is per-finding, not per-review

Detection rates vary by finding type. The wildcard S3 policy and map_public_ip_on_launch are not the same kind of problem. The governance layer doesn't produce one verdict for the whole configuration. It produces one per finding. A configuration can be HIGH confidence on one finding and REJECT on another at the same time.

A timeout counts as a non-detection

A run that didn't complete didn't catch anything. Timeouts, parse failures, truncated responses: all count as non-detections. Detection rate is calculated over everything attempted, not just what succeeded. If a model times out in two of five runs, its detection rate ceiling is 60%.

Latency budget is a hard parameter

Switching models changes what gets found. The governance layer doesn't silently fall back to a different model when latency is high. If a run exceeds the latency budget, it counts as a non-detection. If enough runs time out, the detection rate falls below the threshold and human review is triggered. The governance layer doesn't route around latency. It governs it.

Three confidence tiers

The thresholds are derived from this experiment but configurable for your own infrastructure:

High (95%+): reliable, act on it.
Review (80-94%): human review required.
Reject (below 80%): do not trust, requires investigation.

At 30 runs, the boundary between tiers is statistically fragile. A two-run difference separates 93% from 100%. The thresholds are a starting point for calibration, not a precision instrument. Treat them as policy levers, not measured cutoffs, until you have more runs behind them.

OpenRouter recently published benchmark data showing that panels of models consistently outperform individual models on complex tasks. Not just from model diversity, but from the synthesis step itself. Running the same prompt twice produces different reasoning paths and different outputs. Detection rate across multiple runs is a more reliable signal than any single verdict.

An IBM piece on enterprise AI puts it directly: governance only feels like a bottleneck when decision rights are unclear. Building it into the review workflow rather than treating it as a post-deployment audit is what makes it useful rather than friction.

The verdict

What the governance layer returned

Every governance check against the Databricks configuration returned at least one REJECT verdict.

The wildcard S3 policy and unbounded firewall variable were caught reliably across runs. The map_public_ip_on_launch finding never cleared 10% detection across any model, returning REJECT on every check.

That REJECT is the right call for that finding. But the reason matters. "REJECT because of a critical security gap" and "REJECT because of a context-dependent finding we don't have context for" require different responses from a human reviewer.

The governance layer can't tell you which one you're looking at. That's what makes context-dependent findings the harder problem to solve.

What the governance layer can't solve

No model correctly contextualized the map_public_ip_on_launch finding across 90 runs. Not once. The finding is invisible to automated review without explicit architectural context in the prompt. When models did flag it, they rated it high or critical without understanding that the Network Firewall in front of the subnet changes the risk profile entirely.

This is not a failure of any specific model. It's a structural limit. The reviewer doesn't know what it doesn't know. It can't ask "what sits in front of this subnet?" It can only work with what's in the files.

The governance layer handles the clear cases well. It surfaces the context-dependent cases as low-confidence findings that need human judgment. It's a triage tool, not a replacement for review.

Where this leaves you

What you can do today

Tools like Checkov, tfsec, and OPA/Conftest already handle a large part of this. They catch misconfigurations deterministically, without variance, at near-zero cost per run. What I'm building isn't trying to replace them. What they can't do is reason about context-dependent findings or tell you whether that reasoning is consistent enough to trust across runs. This doesn't solve that problem either. What it does is make the gap visible: a low-confidence verdict tells you where human judgment is required.

Based on what the data showed, here's what I'd recommend:

Run the same review multiple times. A single pass gives you a finding. Multiple passes give you a detection rate. Those are different things.
Treat a timeout as a non-detection, not a retry. A run that didn't complete didn't catch anything.
Identify the context-dependent findings in your infrastructure before you build a governance policy. Those findings need a different approach, not more runs, but richer context in the prompt.
Don't use latency as a reason to switch models silently. A different model is a different reviewer with different blind spots.

What's next

The context problem

The deeper question is what to do about the context-dependent findings. Three approaches worth exploring: an exceptions registry for known intentional architectural decisions, context injection that passes topology descriptions alongside the Terraform, and a two-pass review that asks targeted questions about low-confidence findings. Let's keep building on this in the next article.

Methodology How data was collected to inform the design

I needed data before I could make design decisions. Specifically: does detection rate vary enough to warrant multiple runs? Does latency affect what gets found? The answers shaped every design decision in the governance layer.

Models: Claude Sonnet (claude-sonnet-4-6), Claude Opus (claude-opus-4-8), GPT-5.5 (gpt-5.5-2026-04-23). Runs: 30 per model, 90 total. Max tokens: 1,500 for Anthropic models, 4,096 for GPT-5.5 (GPT-5.5 is significantly more verbose and requires more output budget).

Review target: Sanitized Databricks E2 deployment on AWS. Client name, AWS account ID, Databricks account ID, and RDS endpoint replaced with placeholders. The security findings are real; the identifying details are not.

Files reviewed per run: e2/vpc.tf, e2/firewall.tf, e2/iam_cross_account.tf, e2/S3.tf, mws/s3_access_policy.tf. Total input: approximately 5,600 tokens per run for Anthropic models, 4,300 for OpenAI.

Detection methodology: Each run output was scanned for keyword patterns associated with each target issue. Detection rate was calculated over total runs attempted, not valid runs. Timeouts and parse failures counted as non-detections. False positive rate was not measured in this experiment. The focus was detection consistency for known true positives. False positive analysis is a necessary next step before applying this approach to configurations without known ground truth.

Three issues worth noting from setup:

GPT-5.5 hit the token limit at 1,500 output tokens on the first test run and produced no parseable output. Raising the ceiling to 4,096 resolved this. GPT-5.5 used between 1,600 and 3,400 tokens per review depending on verbosity.
Claude Opus had one parse failure in 30 runs (Run 12). The model completed successfully but the JSON formatting was malformed. This counted as a non-detection for that run.
The keyword detection lists used to scan for target issues are not published. They represent accumulated calibration work that improves with each infrastructure type covered.

Raw results Score and latency distributions across all three models

Score distribution:

Model	Min	Median	Mean	Max	p95
Claude Sonnet	3	3	3.0	3	3
Claude Opus	3	3	3.31	4	4
GPT-5.5	3	3	3.33	4	4

Latency distribution (seconds):

Model	Median	p95	p99
Claude Sonnet	21s	26s	151s
Claude Opus	16s	19s	46s
GPT-5.5	50s	79s	469s

Claude Sonnet was the most consistent model across all metrics: perfect score consistency, 100% detection on both primary target issues, lowest output token variance. Claude Opus was fastest. GPT-5.5 latency variance is significant enough that its p99 would exceed most CI/CD time budgets.

The governance layer How the verdict engine works

The governance layer is a Python script that wraps the review loop. It accepts a model, a run count, and a latency budget. It returns a structured JSON verdict.

Key design decisions:

Timeout is treated as non-detection, not retry. If a run exceeds the latency budget, it counts against the detection rate.
Detection rate is calculated over all attempted runs, including failures. A model that times out in 2 of 5 runs has a detection rate ceiling of 60% regardless of what the other 3 runs found.
The verdict is per-finding, not per-review. A configuration can have some findings at HIGH confidence and others at REJECT simultaneously.
Thresholds are configurable. The defaults (95% HIGH, 80% REVIEW) are a starting point, not a standard. Calibrate against your own infrastructure.

What the governance layer does not do: It does not tell you whether a finding is correct. It tells you how consistently the model finds it. Those are different things. A finding detected in 100% of runs might still be wrong if the model lacks the architectural context to interpret it correctly.

Code availability: The governance layer is a Python script built for this experiment. It is not published. If you are interested in the approach, the design decisions and verdict output are documented in full in this article.