Neuroloq's attachment pipeline turns on one boundary that comes before transcription: whether to treat an upload as document-like at all, and only then which OCR mode should read it. The shape of that decision is the product.
The thing Neuroloq is optimizing for is cheap, deterministic, searchable text recovery — not maximum accuracy on every messy upload. If the goal were maximum accuracy on hostile inputs, the right move would be a vision-language model handling routing and transcription in one pass. Neuroloq does not do that, because predictable transcripts that survive search a week later matter more than the marginal accuracy bought with unpredictable latency and per-attachment bills. That tradeoff shapes every decision below.
The interesting move is also smaller than it sounds. Selecting an OCR mode ahead of transcription is standard in any OCR pipeline. The earlier branch — deciding whether an attachment deserves OCR at all — is where the wins and the failures both live.
What broke first
The earliest version trusted the filename, sent every image to a single vision pass, and stored the description back as if it were a transcript. Generic image analysis is good at describing what a frame contains. It is not good at preserving the text surface a learner expects to search later.
Two failures came out of this. Screenshots of code and terminal output were interpreted as visual scenes instead of text-bearing material. Handwritten material was fed through the same assumptions as typed text, producing transcripts that looked plausible enough to ship and noisy enough to be useless later.
The fix was to classify the attachment first, then pick an OCR mode only when the upload behaves like a document.
The split that mattered
The split lives in classifyAttachment and detectOcrMode. The classifier answers whether the attachment deserves OCR. The mode detector answers how to OCR it.
export type AttachmentKind = 'document_like' | 'freeform';
export type OcrMode = 'printed' | 'handwritten';
export interface AttachmentInput {
filename: string;
userText: string;
}
const DOCUMENT_KEYWORDS = [
'screenshot', 'code', 'doc', 'scan', 'pdf', 'note', 'text',
'terminal', 'log', 'error', 'config', 'output',
'whiteboard', 'handwritten', 'notebook', 'marker', 'written', 'board',
];
const OCR_MODE_HINTS = {
handwritten: ['whiteboard', 'board', 'marker', 'handwritten', 'my writing', 'my notes'],
printed: ['screenshot', 'code', 'pdf', 'scan', 'terminal', 'log', 'error', 'config', 'output'],
};
export function classifyAttachment(input: AttachmentInput): AttachmentKind {
const haystack = `${input.filename} ${input.userText}`.toLowerCase();
const hit = DOCUMENT_KEYWORDS.some((keyword) => haystack.includes(keyword));
return hit ? 'document_like' : 'freeform';
}
export function detectOcrMode(input: AttachmentInput): OcrMode {
const haystack = `${input.filename} ${input.userText}`.toLowerCase();
const handwrittenHit = OCR_MODE_HINTS.handwritten.some((keyword) => haystack.includes(keyword));
return handwrittenHit ? 'handwritten' : 'printed';
}
This is a v1 heuristic. The classifier only sees the filename and the user's caption, which means a learner who uploads IMG_2847.png of a terminal pane and types "help with this" falls straight through to freeform vision. The next iteration needs at least one image-side signal — aspect ratio, EXIF, a cheap zero-shot pass on the pixels — to close that hole. Until then, the classifier reads "did the user or the OS hint that this is text-bearing?", not "is this actually text-bearing?"
The two functions are also not really two questions. Every keyword in OCR_MODE_HINTS.printed already appears in DOCUMENT_KEYWORDS. By the time detectOcrMode runs, the input is known to be document-like, so the only question being asked is whether any handwritten keyword fired. Printed is the residual. Handwritten wins ties on purpose: misrouting a handwritten note through printed TrOCR produces confident nonsense, while the reverse produces messy but recoverable text.
The decision tree is the product. The rest of the implementation is just code around that tree.
The OCR client is thin on purpose
The OCR client does not invent policy. It receives an image URL and an OCR mode, fetches the image, runs the matching Hugging Face TrOCR model, and returns the extracted text with metadata.
import { HfInference } from '@huggingface/inference';
export type OcrMode = 'printed' | 'handwritten';
const OCR_MODELS: Record<OcrMode, string> = {
printed: 'microsoft/trocr-base-printed',
handwritten: 'microsoft/trocr-large-handwritten',
};
const OCR_PROVIDER = 'huggingface';
export interface OcrResult {
extractedText: string;
ocrProvider: string;
ocrModel: string;
ocrMode: OcrMode;
confidence?: number;
}
export async function runOcr(imageUrl: string, mode: OcrMode = 'printed'): Promise<OcrResult> {
const apiKey = process.env.HUGGING_FACE_API_KEY;
if (!apiKey) throw new Error('HUGGING_FACE_API_KEY is not set — cannot run OCR');
const response = await fetch(imageUrl);
if (!response.ok) throw new Error(`Failed to fetch image for OCR (HTTP ${response.status})`);
const imageBlob = await response.blob();
const hf = new HfInference(apiKey);
const model = OCR_MODELS[mode];
const result = await hf.imageToText({ model, data: imageBlob }) as { generated_text?: string; score?: number };
return {
extractedText: (result.generated_text ?? '').trim(),
ocrProvider: OCR_PROVIDER,
ocrModel: model,
ocrMode: mode,
confidence: result.score,
};
}
The result shape carries provider, model, and mode together so transcripts stay traceable. If a transcript looks odd, the stored mode tells me immediately whether the wrong model ran or the source image was the problem.
The model choice deserves a caveat. microsoft/trocr-base-printed is a single-line transformer; it struggles with multi-column code, deep indentation, and any spatial structure that needs preserving. Tesseract with layout analysis, PaddleOCR, or a vision-language model with a structured prompt would probably do better on that class. I picked TrOCR for its latency and cost profile, not on a benchmark — that benchmark is on the list, and the model is the part of this pipeline with the least evidence behind it.
Why GPT-5.2 sits after OCR
When OCR succeeds, Neuroloq passes the raw transcript to GPT-5.2 in text-only mode. The pass cleans line breaks, trims noise, and reshapes the transcript into something stable enough to store and search. The image is not passed in again.
This is a tradeoff, not a virtue. Text-only synthesis is cheaper, more deterministic, and easier to debug than a multimodal cleanup pass with the original image attached — but it also means TrOCR errors are permanent. If printed-mode TrOCR confuses a 0 for an O or drops the indentation on a Python block, the synthesis pass has no pixels left to recover from. A multimodal cleanup is strictly more informative; it just costs more and produces less predictable output. Reproducibility and cost are the reasons to keep this text-only, not discipline.
Why the route has to stay explicit
The hybrid case is the one the heuristic handles worst. Picture a notebook page with a printed derivation on the top half and a learner's handwritten step-by-step on the bottom. Handwritten-first routes the whole upload to handwritten TrOCR. That model reads the handwriting cleanly and turns the printed formula into characters that look almost right — x² becomes x2, the radical sign becomes a slash, line breaks land wherever the model felt strongest. The reverse default would mirror the failure on the bottom half: clean printed transcription, garbled handwriting.
Neither default is correct.
Until segmentation happens before OCR — split the page, OCR each region with the right model — the system commits to one failure mode per attachment. Handwritten-first picks the mode that fails less catastrophically when wrong.
The persistence layer follows the decision
Storage stays downstream of the analysis decision. Extracted text, provider, model, and OCR mode travel with the message record so the tutor thread stays searchable and explainable. If GPT-5.2 normalized the OCR output, that version gets indexed. If vision handled the attachment, the descriptive text from that path gets indexed. The interesting work happens earlier in the pipeline; persistence is straightforward once the route has decided.
What stays right—and what is still soft
The improvement was not a broader claim about OCR. It was a tighter boundary around the route that handles document-like attachments — and an honest accounting of where that boundary is still soft. The classifier is a substring check that misses pixel-only signals. The synthesis pass trades accuracy for determinism. TrOCR is a constraint, not a verdict. The hybrid case still picks its failure mode rather than solving it.
What stays right, even with all of that, is the decision the route enforces: not whether an image can be analyzed, but whether it can be read correctly enough to survive search, review, and a return visit days later.
Small in code. Large in behavior.
