Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator

The extraction pipeline processed 2,400 documents overnight. Cost: $380. The next morning I diffed the inputs against the previous batch—87% were near-duplicates with trivial whitespace changes. I’d burned $330 re-extracting answers I already had.

Not because the cache missed.

Because my cache had no right to hit.

A TTL can tell you when something is old. It cannot tell you when something is wrong. And for an AI extraction pipeline, “wrong” is the only thing that matters.

So I rebuilt the caching layer around a different idea: caching is a statistical validity problem, not an expiry problem. Then I paired it with a second idea that sounds obvious until you implement it: reasoning depth is a budget allocation problem, not a model selection problem.

What I ended up with in production is a two-stage system:

Confidence-gated cache: per-selector reuse vs partial rebuild using a multi-signal score and conformal thresholds.
Reasoning budget allocator: per-span compute decisions under a fixed budget using a value-of-insight objective.

Together, they cut API costs by 90% and took batch processing from hours to minutes.

Key insight (the part that changed everything)

The naive approach to caching an AI extraction pipeline is:

hash the input
store the output
add a TTL

That works for pure functions. Extraction isn’t a pure function.

Even with identical text, the “right” output can change because:

your feature set changes (new fields, different normalization)
your template changes (versioned prompt / schema)
your downstream expectations change (what counts as acceptable)
your similarity assumptions were wrong (two texts look close but differ on a critical constraint)

So instead of asking “is this cached value fresh?” I ask:

“is this cached value still valid for the specific selectors I’m about to use?”

Selectors are the trick: I don’t treat the extraction artifact as one blob. I treat it as a set of spans grouped by selectors (field groups). The cache gate returns either:

("reuse", entry.artifact)
("rebuild", dirty_spans)

That second path is the whole point: partial rebuilds.

The budget allocator then gets the spans and spends compute only where quality is below the target.

One gate answers “is reuse statistically justified?” The other answers “if not, what’s the cheapest way to fix it?”

How it works

Stage 1 — Confidence-gated cache: score similarity like you mean it

I compute a single similarity score s between the new request and cached metadata. It’s not one signal; it’s a blend of four.

Here’s the exact scoring logic I run:

α, β, γ, η = 0.6, 0.3, 0.08, 0.02
s  = α * _cosine(np.array(req.get("embed", [])), np.array(meta.get("embed", [])))
s += β * _feature_drift(req.get("fields", {}), meta.get("fields", {}))
s += γ * min(72, (time.time() - meta.get("created_at", time.time())) / 3600.0) / 72
s += η * (0 if req.get("fields", {}).get("template_version")==meta.get("fields", {}).get("template_version") else 1)
return float(s)

This surprised me the first time I tuned it: the score isn’t “semantic similarity” with a little seasoning. It’s a weighted argument about why cached output might be invalid.

The four signals are:

Embedding cosine (weight α = 0.6)
Feature drift across key fields (weight β = 0.3)
Age decay capped at 72 hours (weight γ = 0.08)
Template version mismatch (weight η = 0.02)

That 72-hour cap matters: I don’t want “very old” to dominate the score forever. Age is a weak prior, not a verdict.

My one analogy for this whole post: this score is a four-sensor smoke detector. One sensor (embeddings) can be fooled by “similar enough.” Another (feature drift) catches the quiet but deadly changes. Age is the battery that slowly drains your trust. Template mismatch is the “someone swapped the wiring” alarm.

Stage 1.5 — Conformal prediction: thresholds that come from reality

A fixed threshold is where these systems go to die.

If you pick a global constant and ship it, you’ll either:

reuse too aggressively and serve stale extractions, or
rebuild too often and defeat the point of caching

So I compute a conformal threshold tau from calibration history.. The gate reflects empirical error rates rather than a hand-tuned constant.

The threshold is computed from historical scores where realized span error exceeded eps. I sort those “bad” scores and pick a quantile controlled by delta.

Here’s the exact logic:

over = sorted([s for s,e in calib_scores if e > eps])
if not over: return 1e9
idx = int(max(0,(1-delta)*(len(over)-1)))
return float(over[idx])

Two details I like about this:

If there are no “over-epsilon” examples yet, I return 1e9. That’s intentionally permissive: the system starts by reusing and learns its way into being stricter.
delta controls which quantile I take. I’m not guessing a threshold; I’m choosing a risk tolerance.

Stage 2 — Reuse vs partial rebuild: decide per selector, return dirty spans

Now the part that makes this operationally useful: I don’t decide “cache hit” globally.

I decide per selector, and I return the spans that need work.

s = score(req, entry.meta)
dirty = []
for sel in req.get("touched_selectors", []):
    selector_tau = entry.dc.selector_tau.get(sel, entry.tau_delta)
    if _worst_probe_delta(entry.probes.get(sel,[])) > eps or s > selector_tau:
        dirty.extend(entry.dc.spans.get(sel, []))
if not dirty:
    return ("reuse", entry.artifact)
else:
    return ("rebuild", dirty)

The non-obvious engineering win is that dirty is a list of spans, not a boolean.

That turns caching from a blunt instrument into a scalpel:

If only one selector looks risky, I rebuild only its spans.
If everything looks safe, I reuse the full artifact.

Also note the two independent failure modes that mark a selector dirty:

_worst_probe_delta(...) > eps (probe-based evidence of staleness)
s > selector_tau (similarity score exceeds the selector’s conformal threshold)

I like having both. Similarity is predictive; probes are forensic.

A side mechanism I still use: adaptive TTL sampling (BDAT)

The conformal gate handles the validity axis—whether reuse is statistically defensible right now. But there’s a second axis it deliberately ignores: time. BDAT handles that. I maintain TTL parameters per selector and update them based on staleness observations, so the system learns how quickly each selector’s reality drifts.

The update logic looks like this:

params = entry.selector_ttl[selector]
if was_stale:
    params['beta'] = max(1, params['beta'] - 0.5)
    params['alpha'] = min(10, params['alpha'] + 0.5)
else:
    if actual_ttl > params['last_sampled_ttl'] * 1.5:
        params['alpha'] = max(1, params['alpha'] - 0.2)
        params['beta'] = min(10, params['beta'] + 0.2)

This is one of those pieces that looks “small” but changes behavior over time. I’m not freezing TTL policy; I’m letting selectors drift toward what production traffic teaches me.

What surprised me here is how asymmetric the update is: when something is stale, I move the parameters more aggressively (±0.5) than when it’s not stale (±0.2 and only under a condition). That matches the real pain: stale reuse is more expensive than an unnecessary rebuild.

The relationship to conformal tau is direct: BDAT adjusts when to re-evaluate, and tau decides whether to rebuild when you do. A selector whose TTL keeps shrinking is one whose conformal threshold will tighten too, because more frequent checks means more calibration data, which means tau converges faster. They’re two feedback loops on the same signal—one temporal, one statistical.

Stage 2 — Reasoning budget allocator: spend compute like it’s cash

Once the cache gate returns either a reused artifact or dirty spans, I still have a second problem:

Even inside a rebuild, not all spans deserve the same attention.

The naive approach is “pick a model tier for the whole extraction.” That’s just a different kind of blunt instrument.

Instead, I treat each span like a line item in a budget.

Step 1: sort spans by uncertainty

Spans arrive with context. I sort them by a combined uncertainty score:

spans = artifact_ctx.get("spans", [])
spans.sort(key=lambda s: (
    s.get("ctx", {}).get("retrieval_dispersion", 0) +
    s.get("ctx", {}).get("rule_conflicts", 0) +
    s.get("ctx", {}).get("cache_margin", 0)
), reverse=True)

This ordering is where the allocator gets its teeth.

retrieval_dispersion: when retrieval is scattered, the span is uncertain.
rule_conflicts: when rules disagree, the span is uncertain.
cache_margin: when the cache gate barely passed, the span is uncertain.

I push the weirdest spans to the front so they get first claim on the budget.

Step 2: choose an action by value-of-insight

For each span, I evaluate action candidates and pick the one with the highest value-of-insight (VOI):

qgain is how much quality I expect to gain
cost is the compute cost
latency is the latency cost
lam and mu trade off cost vs latency

Then I pick the max.

Here’s the exact loop I run:

for s in spans:
    if total_budget <= 0 or s.get("quality", 0) >= target_quality:
        continue
    candidates = [
        ("reuse", cached_text, 0.01, 0.0, 0.0),
        ("small", llm_mini_result, 0.15, 1.0, 1.0),
        ("tool",  tool_result, 0.22, 1.8, 1.2),
        ("deep",  llm_full_result, 0.30, 3.5, 2.0),
    ]
    name, text, qgain, cost, lat = max(candidates, key=lambda c: _voi(c[2], c[3], c[4], lam, mu))
    if cost <= total_budget:
        total_budget -= cost
        s["text"] = text
        s["quality"] = min(1.0, s.get("quality", 0) + qgain)
        s["action_taken"] = name
return assemble(spans)

Two things make this work in production:

The early exit: if a span already meets target_quality, I don’t touch it.
The reuse candidate: ("reuse", ..., 0.01, 0.0, 0.0).

That “reuse gives 0.01 quality gain” is a very opinionated line in the sand. It encodes a truth I learned the hard way: even when you reuse, you’re not getting perfect certainty—just a small nudge in confidence because the span existed and passed the cache gate.

And because reuse costs 0.0, most spans clear the bar without spending anything.

How the two systems snap together

The confidence-gated cache is the first gate. It answers:

“Is this selector safe to reuse?”
“If not, which spans are dirty?”

The reasoning budget allocator is the second gate. It answers:

“Given a fixed budget, which spans deserve compute?”
“What action maximizes quality per unit cost and latency?”

Here’s the architecture as it exists conceptually in my pipeline:

The important part isn’t the boxes. It’s the contract between them:

cache gate outputs spans with enough metadata (quality, ctx) for the allocator to make sane decisions
allocator respects target_quality and total_budget so it can’t run away

What went wrong (and what I changed)

The failure mode that pushed me to this design was simple: I was caching like a web server.

A TTL-based cache for AI extraction looks comforting because it’s familiar. But it gives you the wrong safety guarantee.

A long TTL saves money but increases the chance of serving stale extractions.
A short TTL reduces staleness but rebuilds too often.

That’s not a tuning problem. That’s the wrong axis.

The axis that matters is: how similar is this request to the one that produced the cached artifact, in the ways that affect correctness?

So I replaced “time since write” as the primary decision variable with:

embedding similarity
feature drift
capped age decay
template version mismatch

Then I stopped pretending the whole artifact is one unit of work and made the gate return dirty spans.

The second failure mode was compute allocation.

Even after partial rebuilds, I was still overspending by treating a rebuild as “run the expensive path.” The allocator fixed that by making every span compete for budget.

Nuances and tradeoffs

1) The score is a blend, not a model

I like that the score is explicit weights (α, β, γ, η). It’s debuggable.

The tradeoff is you’re committing to a worldview. If you overweight embeddings, you’ll miss structural drift. If you overweight feature drift, you’ll rebuild too often on harmless changes.

I chose weights that keep embeddings dominant (0.6) but let drift be loud (0.3). Age and template mismatch are present but intentionally small.

2) Conformal thresholds require calibration data

The conformal tau computation depends on calib_scores with observed errors. Early on, you may have none—hence the 1e9 default.

That’s a trade: you start permissive and tighten as reality arrives.

3) Partial rebuilds are only as good as your span mapping

Returning dirty spans is only useful if entry.dc.spans[sel] is accurate.

If you mis-assign spans to selectors, you’ll either:

rebuild too much (safe but expensive), or
rebuild too little (cheap but wrong)

4) The allocator is greedy

The budget controller iterates spans in sorted order and spends budget if it can.

That’s pragmatic and fast.

The tradeoff is it’s not globally optimal. It’s a greedy knapsack with a VOI heuristic. In practice, the uncertainty sorting makes it behave like I want: fix the sketchiest spans first.

5) VOI weights (`lam`, `mu`) encode product priorities

The allocator’s behavior changes dramatically depending on how you set the cost and latency penalties.

That’s not a bug. It’s the point: the same pipeline can run in a “cheap batch” mode or a “fast interactive” mode by changing what you punish.

The takeaway I wish I’d internalized earlier

Caching AI extractions isn’t about time. It’s about whether reuse is defensible.

And “how much reasoning to do” isn’t about picking a model. It’s about spending a fixed budget where it buys the most certainty.

Once I treated both as gating problems—first statistical validity, then cost-optimal depth—the pipeline stopped paying full price for answers it already had.