I hit the wall when a single noisy signal kept dragging candidate selection around like a shopping cart with one bad wheel. The composite score looked stable on paper, but the moment I watched different weight sets compete on real samples, it became obvious that I was tuning a judgment function by instinct and hoping the surface was kind.
The idea behind the calibrator
The part that changed my approach was treating the weights as a search space, not a set of sacred constants. The reward mixer already exposed five signals — visualDrift, colorHarmony, motionContinuity, compositionStability, and narrativeCoherence — and the calibrator sits above that layer to ask a different question: which combination of weights best matches the benchmark data?
That is a very different job from adjusting thresholds. A threshold says, “accept or reject at this line.” A calibrator says, “given these signals, what should the system believe more or less strongly?” That distinction matters because the calibrator is searching the parameter surface of the judgment function itself. It is not just moving a gate; it is reshaping the scorer that the gate depends on.
The reward mixer starts with explicit signal types, and that structure is what makes calibration possible in the first place.
export interface RewardSignals {
/** CLIP embedding cosine similarity to source frame (0-1) */
visualDrift: number | null
/** Color palette consistency score (0-1) */
colorHarmony: number | null
/** Motion direction alignment score (0-1) */
motionContinuity: number | null
/** Spatial composition stability score (0-1) */
compositionStability: number | null
/** LLM-derived narrative coherence score (0-1) */
narrativeCoherence: number | null
}
What I like about this shape is that every signal is visible, named, and separable. That makes the failure mode obvious too: if one signal is noisy, it does not deserve to dominate the entire score just because it arrived with confidence.
How I combine signals before calibration
The non-obvious part is that the composite score cannot assume every signal is present. In the calibrator, null values are skipped and the weights are renormalized over the remaining signals. That guardrail keeps a missing or unavailable signal from pretending it should still count.
Here is the core scoring helper from the calibrator.
function computeComposite(
weights: RewardWeights,
signals: RewardSignals
): number {
let totalWeight = 0
let weightedSum = 0
for (const key of SIGNAL_KEYS) {
const signal = signals[key]
if (signal !== null && signal !== undefined) {
const w = weights[key]
weightedSum += w * signal
totalWeight += w
}
}
return totalWeight > 0 ? weightedSum / totalWeight : 0
}
That renormalization is the quiet guardrail in the whole system. If I had simply multiplied every signal by its weight and summed the result, a missing signal would have behaved like a silent penalty. Instead, the score is computed from the signals that actually exist, and the denominator follows the evidence.
The calibrator's own config spells out the bounds the search runs under.
const DEFAULT_CONFIG: Required<CalibratorConfig> = {
gridSteps: 5,
refinementIterations: 10,
refinementStep: 0.02,
minWeight: 0.05,
maxWeight: 0.50,
normalizeWeights: true,
holdoutFraction: 0.2,
}
Those bounds matter because they keep one signal from swallowing the rest during the search. The calibrator is allowed to explore, but not to wander into a regime where a single weight becomes the whole story.
What the search loop actually does
The calibrator uses two optimization strategies: a grid search followed by coordinate descent. The first is broad and exhaustive over a discretized weight space. The second starts from the best grid point and nudges one dimension at a time until the score stops improving.
That structure is easier to trust than a single clever optimizer because it gives me a baseline and a refinement pass. I can see whether the coarse search found a decent region before I ask the local search to polish it.
I wanted the calibrator to feel more like a lab bench than a black box: try a weight set, score it against the samples, keep what wins, move on. That makes the search legible when I inspect it later, which is exactly when these systems usually try to become mysterious.
The calibrator's result type shows that I care about more than just the final weights.
export interface CalibrationResult {
/** Optimized weights */
weights: RewardWeights
/** Accuracy on holdout set (if k-fold used) */
accuracy: number
/** Correlation between predicted composite and actual quality */
correlation: number
/** Number of samples used */
sampleCount: number
/** Number of weight configurations evaluated */
configurationsEvaluated: number
/** Phase 1 vs Phase 2 improvement */
gridBestAccuracy: number
refinedAccuracy: number
}
gridBestAccuracy and refinedAccuracy are the before-and-after story — both hold the combined objective the calibrator actually optimizes, which weights accuracy at 70% and Pearson correlation with the quality score at 30%. If the two numbers are close, refinement did not earn its keep; if they diverge, coordinate descent found a ridge the grid stepped past. accuracy and correlation on their own are the holdout-set read of the final weight set, computed after the train/validation split. configurationsEvaluated is less glamorous but load-bearing for debugging: at gridSteps: 5 the Phase 1 alone walks 3,125 weight vectors, so the count tells me whether Phase 2 contributed fifty more candidates or five hundred — the difference between a run that bounced off its bounds immediately and one that actually polished the grid point. sampleCount is the denominator everything else is calibrated against, and the calibrator short-circuits to defaults when it drops below ten.
Why this is not hand-tuning
Hand-tuning starts with a belief and tries to make the numbers agree with it. Calibration starts with data and asks which belief survives contact with the benchmark set.
The calibrator operates on logged samples from real pipeline runs. Each sample carries the raw reward signals, a quality score, and whether the candidate was accepted. That means the search is not guessing in a vacuum; it is optimizing against examples where the pipeline already revealed something about what "good" looked like.
export interface CalibrationSample {
/** Raw reward signals for this candidate */
signals: RewardSignals
/** Human or automated quality assessment (0-1) */
qualityScore: number
/** Was this candidate selected/accepted? */
accepted: boolean
/** Optional metadata for filtering */
sceneType?: string
timestamp?: number
}
The important limitation here is that calibration is only as honest as the samples feeding it. If the benchmark data is narrow, the search can still find a polished lie. So the calibrator is not a replacement for judgment; it is a way to make the judgment function answer to evidence instead of habit.
What Phase 1 already told me
The weights in production today — visualDrift: 0.30, colorHarmony: 0.25, motionContinuity: 0.15, compositionStability: 0.15, narrativeCoherence: 0.15 — came from the Phase 1 baseline run on 2026-02-21 across 100 contracts, and formal hyperparameter search was deferred in that phase while the HITL gate was calibrated first. But Phase 1 surfaced exactly the kind of signal evidence the calibrator is built to reconcile. On the 69 GPU-rendered contracts, visualDrift averaged 0.529 and drove zero HITL flags, while narrativeCoherence averaged 0.317 and drove 65.2% of them. motionContinuity came in at 0.387 with another 17.4%. The two signals sitting at weight 0.15 together accounted for over 80% of the failure signal, while the signal carrying the largest weight at 0.30 silently passed everything.
That is the distribution the calibrator exists to rebalance. A weight vector that puts its largest share on the signal doing no discrimination is not a tuned instrument — it is a ritual. The correlation field is exactly the statistic that would catch it: Pearson correlation between the composite score and the logged quality score, which punishes weight vectors that let a flat signal dominate.
How the baseline comparison keeps me honest
The sharpest part of the design is the before-and-after comparison. The calibrator does not just report a weight vector and call it done. It compares the refined result against the grid-search baseline, which makes the improvement legible instead of assumed.
That matters because search algorithms can easily produce movement without progress. A local refinement step can feel productive while actually circling the same hill. By keeping the baseline in view, I can tell whether the search surface rewarded the new weights or merely entertained the optimizer.
The reward calibrator is also explicit about its two phases in the code comments: coarse grid search first, then coordinate descent around the best grid point. That sequence is not an accident. The grid gives me coverage; the refinement gives me precision. If I skipped the grid, I would be polishing a guess. If I skipped the refinement, I would be leaving useful accuracy on the table.
The guardrails are doing real work here. The minimum and maximum weight bounds keep the search from overcommitting to one signal. The normalization option keeps the total scale stable. And the null-skipping composite score means missing evidence does not masquerade as a negative vote.
Architecture Overview
Where the calibrator fits in the larger scoring stack
The reward calibrator extends the evaluation layer that already handles quality-predictor weights, but it targets the reward weights used by the multi-signal reward mixer. That separation is what makes the architecture clean: the mixer defines how scores are combined, and the calibrator searches for better parameters for that combination.
In practice, that means I can improve the judgment function without rewriting the signals themselves. If the system learns that composition stability should matter a little more than motion continuity on the benchmark corpus, the calibrator can discover that. If the opposite is true, it can discover that too. I do not need to hard-code a philosophical stance into the scorer.
That is the real trick. The system is not merely deciding what threshold to cross. It is learning how to weigh its own senses.
The result is a scoring layer that feels less like a fixed rulebook and more like a tuned instrument: bounded, testable, and willing to admit when one note has been playing too loudly for too long.
