Small in Code. Large in Behavior.

Neuroloq's attachment pipeline hangs on one boundary, and it sits before transcription. Should this upload be treated as document-like at all? Only after that does the second question come up: which OCR mode should read it. The shape of that decision is the product.

What the pipeline optimizes for is cheap, deterministic text recovery that survives search, not maximum accuracy on every messy upload. If the goal were maximum accuracy on hostile inputs, the right move would be a vision-language model handling routing and transcription in one pass. Neuroloq doesn't do that. Predictable transcripts that still work a week later matter more here than the marginal accuracy you buy with unpredictable latency and a per-attachment bill. That tradeoff runs through everything below.

The move is also smaller than it sounds. Picking an OCR mode ahead of transcription is standard in any OCR pipeline. The earlier branch is the interesting one, the branch that decides whether an attachment deserves OCR at all, and that's where both the wins and the failures live.

What broke first

The earliest version trusted the filename, sent every image through a single vision pass, and stored the description back as though it were a transcript. Generic image analysis is good at describing what a frame contains. It is not good at preserving the text surface a learner expects to search later.

Two failures fell out of that. Screenshots of code and terminal output got read as visual scenes rather than text-bearing material. Handwritten material went through the same assumptions as typed text, which produced transcripts that looked plausible enough to ship and were noisy enough to be useless later.

The fix: classify the attachment first, then pick an OCR mode only when the upload behaves like a document.

The split that mattered

The split lives in classifyAttachment and detectOcrMode. The classifier answers whether the attachment deserves OCR. The mode detector answers how to OCR it.

export type AttachmentKind = 'document_like' | 'freeform';
export type OcrMode = 'printed' | 'handwritten';

export interface AttachmentInput {
  filename: string;
  userText: string;
}

const DOCUMENT_KEYWORDS = [
  'screenshot', 'code', 'doc', 'scan', 'pdf', 'note', 'text',
  'terminal', 'log', 'error', 'config', 'output',
  'whiteboard', 'handwritten', 'notebook', 'marker', 'written', 'board',
];

const OCR_MODE_HINTS = {
  handwritten: ['whiteboard', 'board', 'marker', 'handwritten', 'my writing', 'my notes'],
  printed: ['screenshot', 'code', 'pdf', 'scan', 'terminal', 'log', 'error', 'config', 'output'],
};

export function classifyAttachment(input: AttachmentInput): AttachmentKind {
  const haystack = `${input.filename} ${input.userText}`.toLowerCase();
  const hit = DOCUMENT_KEYWORDS.some((keyword) => haystack.includes(keyword));
  return hit ? 'document_like' : 'freeform';
}

export function detectOcrMode(input: AttachmentInput): OcrMode {
  const haystack = `${input.filename} ${input.userText}`.toLowerCase();
  const handwrittenHit = OCR_MODE_HINTS.handwritten.some((keyword) => haystack.includes(keyword));
  return handwrittenHit ? 'handwritten' : 'printed';
}

This is a v1 heuristic. The classifier only sees the filename and the user's caption, so a learner who uploads IMG_2847.png of a terminal pane and types "help with this" falls straight through to freeform vision. The next iteration needs at least one clue taken from the image itself to close that hole: aspect ratio, EXIF, a cheap zero-shot pass on the pixels. Until then, what the classifier actually reads is "did the user or the OS hint that this is text-bearing?", not "is this actually text-bearing?"

The two functions aren't really two questions either. Every keyword in OCR_MODE_HINTS.printed already appears in DOCUMENT_KEYWORDS. By the time detectOcrMode runs the input is known to be document-like, so the only thing left to ask is whether a handwritten keyword fired. Printed is the residual. Handwritten wins ties on purpose, because misrouting a handwritten note through printed TrOCR produces confident nonsense, while the reverse produces messy text you can still recover.

The decision tree is the product. Everything else is code arranged around that tree.

The OCR client is thin on purpose

The client invents no policy. It takes an image URL and an OCR mode, fetches the image, runs the matching Hugging Face TrOCR model, and hands back the extracted text with metadata.

import { HfInference } from '@huggingface/inference';

export type OcrMode = 'printed' | 'handwritten';

const OCR_MODELS: Record<OcrMode, string> = {
  printed: 'microsoft/trocr-base-printed',
  handwritten: 'microsoft/trocr-large-handwritten',
};

const OCR_PROVIDER = 'huggingface';

export interface OcrResult {
  extractedText: string;
  ocrProvider: string;
  ocrModel: string;
  ocrMode: OcrMode;
  confidence?: number;
}

export async function runOcr(imageUrl: string, mode: OcrMode = 'printed'): Promise<OcrResult> {
  const apiKey = process.env.HUGGING_FACE_API_KEY;
  if (!apiKey) throw new Error('HUGGING_FACE_API_KEY is not set — cannot run OCR');

  const response = await fetch(imageUrl);
  if (!response.ok) throw new Error(`Failed to fetch image for OCR (HTTP ${response.status})`);

  const imageBlob = await response.blob();
  const hf = new HfInference(apiKey);
  const model = OCR_MODELS[mode];

  const result = await hf.imageToText({ model, data: imageBlob }) as { generated_text?: string; score?: number };

  return {
    extractedText: (result.generated_text ?? '').trim(),
    ocrProvider: OCR_PROVIDER,
    ocrModel: model,
    ocrMode: mode,
    confidence: result.score,
  };
}

Provider, model, and mode all ride along together in the result shape, which is what keeps transcripts traceable. When a transcript looks odd, the stored mode tells me right away whether the wrong model ran or the source image was the problem.

The model choice deserves a caveat. microsoft/trocr-base-printed is a single-line transformer, and it struggles with multi-column code, deep indentation, anything where spatial structure needs preserving. Tesseract with layout analysis, PaddleOCR, or a vision-language model driven by a structured prompt would probably do better on that class of input. I picked TrOCR on its latency and cost profile rather than a benchmark. That benchmark is on the list. Of everything in this pipeline, the model is the piece with the least evidence behind it.

Why GPT-5.2 sits after OCR

When OCR succeeds, the raw transcript goes to GPT-5.2 in text-only mode. That pass cleans line breaks, trims noise, and reshapes the transcript into something stable enough to store and search. The image doesn't get passed in again.

This is a tradeoff rather than a virtue. Text-only synthesis is cheaper, more deterministic, and easier to debug than a multimodal cleanup pass carrying the original image. It also makes TrOCR's errors permanent. If printed-mode TrOCR confuses a 0 for an O, or drops the indentation on a Python block, the synthesis pass has no pixels left to recover from. A multimodal cleanup is strictly more informative. It just costs more and produces less predictable output. Reproducibility and cost are why this stays text-only. Discipline has nothing to do with it.

Why the route has to stay explicit

The hybrid case is the one the heuristic handles worst. Picture a notebook page: a printed derivation across the top half, a learner's handwritten step-by-step underneath. Handwritten-first sends the whole upload to handwritten TrOCR. That model reads the handwriting cleanly and turns the printed formula into characters that look almost right. x² becomes x2. The radical sign becomes a slash. Line breaks land wherever the model felt strongest. The reverse default would mirror the failure on the bottom half: clean printed transcription, garbled handwriting.

Neither default is correct.

Until segmentation happens before OCR, splitting the page and running each region through the right model, the system commits to one failure mode per attachment. Handwritten-first picks the mode that fails less catastrophically when it's wrong.

The persistence layer follows the decision

Storage stays downstream of the analysis decision. Extracted text, provider, model, and OCR mode all travel with the message record, which is what keeps the tutor thread searchable and explainable. If GPT-5.2 normalized the OCR output, that version gets indexed. If vision handled the attachment, the descriptive text from that path gets indexed instead. The interesting work happened earlier in the pipeline; persistence is straightforward once the route has decided.

What holds up, and what is still soft

The improvement here was narrow. A tighter boundary around the route that handles document-like attachments, plus an honest accounting of where that boundary is still soft. It was not a broader claim about OCR. The classifier is a substring check, blind to anything that shows up only in the pixels. The synthesis pass trades accuracy for determinism. TrOCR is a constraint here, not a verdict. The hybrid case picks its failure mode instead of solving it.

What holds up through all of that is the decision the route enforces: not whether an image can be analyzed, but whether it can be read correctly enough to survive search, review, and a return visit days later.

Small in code. Large in behavior.