Blog

Technical deep-dives into AI engineering, full-stack architecture, and lessons learned.

Per‑Region PBR From One Photo: The Cropping Trick That Stops RGB‑X From Bleeding Materials Across Boundaries
pbrdiffusionsegmentationthreejsrendering

Per‑Region PBR From One Photo: The Cropping Trick That Stops RGB‑X From Bleeding Materials Across Boundaries

I watched a wall’s “roughness” spill into a roof the first time I ran RGB‑X on a full-frame photo—and it made the whole result feel fake. The fix wasn’t a new model; it was an engineering decision: run diffusion-based PBR decomposition per architectural region, cropped from SAM3 masks with padding and U‑Net-friendly sizing, plus a procedural fallback so the viewer never goes dark when the GPU pipeline isn’t available.

Daniel Anthony Romitelli Jr. · March 11, 2026

My Three‑Phase Parallel Orchestrator: Typed Results, Exception‑Proof Phases, and a Rollout That Never Flaps
pythonorchestrationvoicelatencyreliability

My Three‑Phase Parallel Orchestrator: Typed Results, Exception‑Proof Phases, and a Rollout That Never Flaps

I replaced a ~3,500ms linear voice pipeline with a parallel, three‑phase orchestrator that targets <600ms P95 by treating “agents” like a compilation pipeline: Phase 1 produces typed intermediate results, Phase 2 consumes them as precomputed search inputs, and Phase 3 formats the response. The trick isn’t asyncio.gather—it’s typed contracts plus a fallback cascade so every phase keeps moving even when a component fails.

Daniel Anthony Romitelli Jr. · March 11, 2026

Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator
cachingconformal-predictioncost-engineeringllm-systemspython

Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator

I watched my email extraction pipeline burn money re-deriving the same fields over and over—then I realized the bug wasn’t “missing cache,” it was caching without a validity model. The fix was a two-stage system: a confidence-gated cache that decides reuse vs partial rebuild using conformal prediction, and a reasoning budget allocator that spends compute per span only when quality is genuinely below target.

Daniel Anthony Romitelli Jr. · March 11, 2026

The Closed‑Loop Consistency Trick: Keeping Scene 12 Faithful to Scene 1 Without Global Memory
ml-systemsvideocontrol-systemstypescriptpipelines

The Closed‑Loop Consistency Trick: Keeping Scene 12 Faithful to Scene 1 Without Global Memory

I used to assume long‑form visual consistency required hauling a growing “story so far” memory through every generation step. Scenematic works the opposite way: I propagate what must persist, measure what actually persisted, then locally correct what didn’t—plus a periodic re‑anchor that prevents slow drift from compounding across a 20‑scene run.

Daniel Anthony Romitelli Jr. · March 11, 2026

Search That Refuses to Think: The Pattern‑First Query Parser I Use for Fast Intent + Entity Extraction
searchnlpvoice-assistantsengineeringpythonlatencyinformation-retrieval

Search That Refuses to Think: The Pattern‑First Query Parser I Use for Fast Intent + Entity Extraction

In my voice-first operations product, I stopped treating “search” as retrieval and started treating it as compilation: speech → intent → entities → an executable query plan. This post documents the query parser I built after an LLM-first attempt produced latency spikes and inconsistent structure. I’ll show the intent contract, the rule engine (compiled regex + token maps), the entity extractors (locations, titles, numeric limits), caching strategy, ambiguity detection, and the benchmark harness I used to validate latency at scale.

Daniel Anthony Romitelli Jr. · March 5, 2026

Multi‑Vector Embeddings in Production: Typed Vectors, Cache Keys, and a Generator That Refuses Poison Records
pythonembeddingscachingsearchdata-pipelines

Multi‑Vector Embeddings in Production: Typed Vectors, Cache Keys, and a Generator That Refuses Poison Records

I built an embedding pipeline for our recruitment platform that represents each record as four typed vectors instead of one pooled blob: profile, experience, skills, and general. The interesting part isn’t the model call—it’s everything around it: cache key design that includes model IDs, idempotent storage of multi-vector blobs, retry logic with backoff, and a DLQ path that keeps bad records from stalling reindex/backfill.

Daniel Anthony Romitelli Jr. · March 5, 2026

MR‑GRPO in Practice: The Reward Mixer That Stops CLIP From Lying to Your Scene Compiler
ml-systemsrankingnormalizationtypescriptprompt-engineering

MR‑GRPO in Practice: The Reward Mixer That Stops CLIP From Lying to Your Scene Compiler

I replaced CLIP-only candidate ranking with a multi-signal reward mixer that scores each candidate across independent signals, normalizes them group-relatively (GRPO-style), and then composes a final score with null-signal skipping and weight re-normalization. The result is a scoring pipeline that’s harder to game, easier to debug, and robust to missing signals—because it treats “unknown” as a first-class state instead of faking certainty.

Daniel Anthony Romitelli Jr. · March 4, 2026

My Voice Router That Refuses to Think: Pattern‑First Multi‑Agent Orchestration for Sub‑Second Latency
voicemulti-agentroutinglatencypythonarchitectureobservability

My Voice Router That Refuses to Think: Pattern‑First Multi‑Agent Orchestration for Sub‑Second Latency

I rebuilt my voice agent’s orchestration around a stubborn rule—don’t burn an LLM call on a problem a regex can solve. The turning point was a real latency incident caused by model-first routing: obvious intents still waited on classification, pushing p95 past the “this feels broken” threshold. The fix was a pattern-first RouterAgent with an LLM fallback only for ambiguity, an orchestrator that coordinates (not improvises), and a voice integration layer that enforces formatting, timeouts, and stable contracts. This post walks through the architecture, the post-mortem, the measurement methodology behind the latency numbers, and a runnable reference implementation of the router/orchestrator/voice processor loop.

Daniel Anthony Romitelli Jr. · March 4, 2026

Multi‑Agent Firecrawl Research: My Fallback Chain That Refuses to Pretend It Knows the Company
pythonweb-researchfirecrawlbing-searchdata-quality

Multi‑Agent Firecrawl Research: My Fallback Chain That Refuses to Pretend It Knows the Company

I built a research pipeline that treats company enrichment like an investigation: start with the best source, log every step, and fall back gracefully when the web lies or goes quiet. The core idea is simple—separate “fetching” from “deciding”—then let specialized components (Firecrawl + Bing fallback, plus tracing) do their jobs without smearing uncertainty across the output.

Daniel Anthony Romitelli Jr. · March 3, 2026

Cache-First Geocoding with Azure Maps: Key Topology, TTL Heuristics, and Quota Smoothing
pythonfastapiazure-mapscachinggeocodinghttpxrate-limitingobservabilitylanggraph

Cache-First Geocoding with Azure Maps: Key Topology, TTL Heuristics, and Quota Smoothing

I built our Azure Maps integration as a cache-first geocoder because the first real failures weren’t “bad results” — they were wasted calls. In our LangGraph enrichment flow we shipped fixes like address-vs-POI routing and a hard short-circuit when Firecrawl already produced city/state. That experience drove a design that treats geocoding like a budgeted, deduplicated service: stable cache keys (geohash + query normalization), TTLs that reflect spatial volatility, and quota smoothing/backoff so bursts don’t turn into 429 storms. I also tie it back to the advisor-enrichment worker patterns (credits_used, max_credits, feature flags) because the platform’s core discipline is the same: cost and reliability are explicit outputs of the pipeline.

Daniel Anthony Romitelli Jr. · March 3, 2026

Adaptive Keyframe Sampling: How I Spend a Frame Budget Like It’s Cash
video-understandingcomputer-visionmultimodalcost-engineeringcloud-runnextjswebhookshmac

Adaptive Keyframe Sampling: How I Spend a Frame Budget Like It’s Cash

Uniform frame sampling made my screen-workflow analyzer both expensive and blind to short UI bursts. I replaced it with an adaptive sampler that scores cheap visual change, segments the timeline, and allocates a fixed keyframe budget with guardrails. This post covers the failure that triggered the rewrite, the production integration seam (Cloud Run → webhook → Next.js), and complete runnable code for scoring, segmentation, allocation, and frame extraction—plus proper HMAC verification on the webhook receiver.

Daniel Anthony Romitelli Jr. · March 2, 2026

Notification Adjudication in My Ops Intelligence Agent: Canonical Events, Cheap Arbitration, and a Sender That Refuses to Spam
ml-systemsobservabilitypythonanomaly-detectionincident-response

Notification Adjudication in My Ops Intelligence Agent: Canonical Events, Cheap Arbitration, and a Sender That Refuses to Spam

I built an Ops Intelligence Agent alongside a recruitment platform Operations Dashboard to turn a noisy real-time event stream into a small number of high-signal Microsoft Teams notifications. The interesting part isn’t “sending a webhook”—it’s adjudication: normalizing heterogeneous telemetry, scoring it fast, collapsing duplicates, and dispatching defensively so the first alert lands quickly without triggering an alert storm.

Daniel Anthony Romitelli Jr. · March 2, 2026

Phase 2 Calibration: Per‑Category OOD Thresholds + Group‑Relative Reward Normalization in My Scene Compiler
ml-systemscalibrationood-detectiontypescriptobservability

Phase 2 Calibration: Per‑Category OOD Thresholds + Group‑Relative Reward Normalization in My Scene Compiler

Phase 2 is where I stopped treating out‑of‑distribution detection as a single global knob and started calibrating it per prompt category, using a baseline sweep to derive category-specific thresholds and logging the exact threshold source at runtime. In the same calibration pass, I fixed reward fusion instability by adding per-head z‑score normalization and GRPO‑style group-relative normalization so multi-signal scoring doesn’t collapse when one signal’s scale drifts.

Daniel Anthony Romitelli Jr. · March 1, 2026

My RAG Stack for Code Retrieval: pgvector HNSW + Metadata Filters + Reranking (and the Parts I Refuse to Guess About)
ragpgvectorembeddingsretrievaltypescript

My RAG Stack for Code Retrieval: pgvector HNSW + Metadata Filters + Reranking (and the Parts I Refuse to Guess About)

In my portfolio repo I reference a RAG-powered vector search stack built on OpenAI embeddings stored in pgvector with HNSW indexing, plus a cross-encoder reranking step via BGE and metadata-filtered search. The honest version: the repo context you gave me proves the components exist and shows some of the glue code around retrieval, deduping, and formatting—but it does not include the actual SQL query shapes, BGE batching code, or latency-budget enforcement logic, so I’m going to document exactly what’s implemented and exactly what’s still unresolved in the evidence.

Daniel Anthony Romitelli Jr. · March 1, 2026

I got SAM3 video tracking wrong: the session wasn’t the problem—my reprojection was
computer-visionvideoinferencetrackingdebugging

I got SAM3 video tracking wrong: the session wasn’t the problem—my reprojection was

I built a GPU server that streams SAM3 masks frame-by-frame, and I initially blamed “model instability” for flicker and churn. The real culprit was how I handled session reuse, mixed-resolution frames, and reprojecting outputs back to each frame’s original size. This post drills into the inference session lifecycle, the streaming output shape assumptions (batch size 1), and the debug hooks I added (like /segment/debug-model) to make flicker diagnosable instead of mystical.

Daniel Anthony Romitelli Jr. · February 28, 2026

Turning CRM Audit Noise into a Transition Graph: Normalizing Events, Sessionizing Creation Bursts, and Extracting Time‑Weighted State Edges
data-engineeringprocess-miningcrmobservabilitypython

Turning CRM Audit Noise into a Transition Graph: Normalizing Events, Sessionizing Creation Bursts, and Extracting Time‑Weighted State Edges

A practical pipeline for reconstructing deal timelines from messy webhook/API audit trails: normalize heterogeneous events, split them into deal-centric sessions, compress them into canonical state paths, then extract a transition graph whose edges carry both counts and observed durations—while handling missing and out-of-order events without pretending the data is perfect.

Daniel Anthony Romitelli Jr. · February 28, 2026

Diversification After Scoring: The Step That Stops My Scene Compiler From Picking Five Paraphrases
rankingdiversityrerankingselectionpromptingpipelines

Diversification After Scoring: The Step That Stops My Scene Compiler From Picking Five Paraphrases

My scene compiler was returning “top 5” candidates that were basically the same idea. The fix wasn’t in generation—it was in the post-scoring selection layer: add a diversification pass after scoring (and after any routing/gating decisions), then re-rank and select. This rewrite removes repo-identifying details, deletes truncated snippets, and clearly marks what I can’t verify from the retrieved source context.

Daniel Anthony Romitelli Jr. · February 27, 2026

Defensive Multi‑Agent Scoring: How I Made LLM Reviews Clamp, Stream, and Fail Loudly
LLMsTypeScriptMulti-agent systemsReliability engineeringEvaluation

Defensive Multi‑Agent Scoring: How I Made LLM Reviews Clamp, Stream, and Fail Loudly

Multi‑agent scoring isn’t “math,” it’s input validation under adversarial-ish conditions: truncated JSON, missing fields, and silent no‑responses. This rewrite focuses on making failure states explicit (a sentinel review), keeping partial streams observable for debugging, and ensuring aggregation can’t be poisoned by malformed outputs—without leaking repo structure, domains, or config keys.

Daniel Anthony Romitelli Jr. · February 27, 2026