Blog

Technical deep-dives into AI engineering, full-stack architecture, and lessons learned.

Series

1 series

13-Part Series

How to Architect an Enterprise AI System (And Why the Engineer Still Matters)

Every post in this series is a decision I made that no model would have made on its own. Not because the model is bad — because the model doesn't know what it doesn't know.

13 of 13 parts published

Read the series
0The Day My AI Forgot Everything (So I Built a Context-Continuity Inference Stack)
1I Stopped Letting Emails Poison My Extractor: The Pre-LLM Gate That Made the Rest of the Pipeline Reliable
2I Turned Temperature Up to Save My Extractions: The 3‑Node LangGraph That Trades Variance for Truth
10 more parts planned

Posts

32 posts

Why I Kept Search Scope Inside a Single Supabase RPC
SupabasePostgrespgvectorRAGTypeScript

Why I Kept Search Scope Inside a Single Supabase RPC

I rewrote search so the vector, the scope filter, and the candidate count travel together through one `search_embeddings` RPC. The database applies the metadata predicate inside SQL, and the same contract drives the pgvector HNSW index on the embedding column.

Daniel Anthony Romitelli Jr. · April 16, 2026

The AgentGroupChat Pattern That Keeps the Mapper from Drifting
semantic-kernelazure-foundrypromptflowagent-orchestrationstate-managementworkflow-automation

The AgentGroupChat Pattern That Keeps the Mapper from Drifting

I rebuilt the orchestration around a narrow AgentGroupChat loop in the workflow analyzer SaaS: analyzer, mapper, generator, validator, with state rehydration before the run and persistence after it. The result is a system that can reject bad structure locally, retry from the right step, and resume from prior history instead of starting blind.

Daniel Anthony Romitelli Jr. · April 16, 2026

Coverage Before Creativity: The RAG Gate That Keeps My Blog Pipeline Honest
RAGSupabaseNext.jsTypeScriptblog pipelineretrievalcodebase indexing

Coverage Before Creativity: The RAG Gate That Keeps My Blog Pipeline Honest

Before the generator writes a paragraph, a three-lane query fan-out, file-path-aware dedupe, a sufficiency threshold, and pinned excerpts decide whether the retrieval actually covered enough of the repository to deserve a draft. The most useful thing this pipeline does is reject topics whose evidence set is too thin.

Daniel Anthony Romitelli Jr. · April 14, 2026

The Reward Calibrator That Learns the Shape of Its Own Judgment
typescriptml-systemsscoringoptimizationvideo

The Reward Calibrator That Learns the Shape of Its Own Judgment

I built a reward calibrator that sits above the scoring signals and searches for better weights instead of hand-tuning them by feel. It normalizes each signal, combines them into a composite score, and evaluates candidate weight sets against benchmark data so the system can compare itself to its own baseline.

Daniel Anthony Romitelli Jr. · April 14, 2026

The Startup Gate That Makes a Python App Feel Native
pythonstartupconfigurationdesktop-appopenai

The Startup Gate That Makes a Python App Feel Native

I rewrote the post around the real startup path in `app/yapper.py`: the dependency check, the import order, and the shared settings path in `app/core/config.py`. The result stays on one concrete system behavior instead of speculative packaging details.

Daniel Anthony Romitelli Jr. · March 29, 2026

How I Carve Objects Out of Depth Instead of Texture
computer-visiondepthgeometrynextjsgpu

How I Carve Objects Out of Depth Instead of Texture

I built a depth segmentation path that treats shape as the signal and texture as noise. The interesting part is the sequence: depth arrives, gets validated, gets carved into regions, and still returns usable output when the scene is nearly dark.

Daniel Anthony Romitelli Jr. · March 29, 2026

The Signal-Processing Boundary That Keeps Coaching Useful in Real Time
realtime-systemsaudio-processingwebsocketssignalrpython

The Signal-Processing Boundary That Keeps Coaching Useful in Real Time

How I turned the a recruiting platform voice path into a deterministic streaming pipeline: ACS media frames enter the backend as JSON metadata plus binary PCM, the server waits for `session.updated`, resamples 16kHz audio to 24kHz with `resample_poly`, and streams partial transcripts plus coaching back through SignalR without blocking the live call.

Daniel Anthony Romitelli Jr. · March 29, 2026

Fresh Enough to Render: How I Encode Market-Data Trust in the Cache Layer
typescriptcachingfinancial-dashboardttlarchitecture

Fresh Enough to Render: How I Encode Market-Data Trust in the Cache Layer

I built the financial dashboard cache so freshness is not a UI guess; it’s a rule baked into the data layer. The interesting part is the TTL split: quotes, intraday history, news, and search each age differently, and the cache decides when a read is still trustworthy enough to show or when it should fall through to upstream data.

Daniel Anthony Romitelli Jr. · March 29, 2026

Text in a Frame Is Contamination, Not Decoration
typescriptnextjsocrvideo-processingscene-compilerscoring

Text in a Frame Is Contamination, Not Decoration

A deep dive into `lib/scene-compiler/text-detector.ts`, showing how I normalize fal.ai and Florence-2 OCR output, classify subtitle and watermark regions, and convert text contamination into a normalized router score that the scene pipeline can actually act on.

Daniel Anthony Romitelli Jr. · March 29, 2026

I Gave My Video Generator Scratch Paper — How Think Frames Saved My GPU Budget
aivideogenerationscoringtypescript

I Gave My Video Generator Scratch Paper — How Think Frames Saved My GPU Budget

I built a pre-generation pass that lets my video pipeline explore cheap sketches before it commits to full-quality keyframes. The trick is to score a small cohort of “think frames” with the same multi-signal mixer I use elsewhere, then pick the best path by group-relative rank instead of a brittle absolute cutoff.

Daniel Anthony Romitelli Jr. · March 26, 2026

The Boundary That Makes iOS Capture Safe on the Web
nextjscomputer-visioncalibrationpersistencegeometry

The Boundary That Makes iOS Capture Safe on the Web

I built the handoff so the web app never has to guess what a segment, bounding box, or measurement means. The trick is to normalize geometry before it crosses the boundary, then persist only the version the viewer can trust.

Daniel Anthony Romitelli Jr. · March 24, 2026

How I Built a Patient Check-In Kiosk for a Specialty Medical Practice
react-nativeexposupabasehealthcarequeue-management

How I Built a Patient Check-In Kiosk for a Specialty Medical Practice

I built a multilingual patient check-in kiosk for a specialty medical practice after watching the front desk break down under real-world pressure. The interesting part isn’t the iPad UI — it’s the queue engine, real-time synchronization, and fallback-heavy notification flow that keep the room moving when patients, staff, and Wi‑Fi are all imperfect at the same time.

Daniel Anthony Romitelli Jr. · March 23, 2026

Firecrawl Part 2: The Confidence Gate That Decides When Bing Gets a Vote
pythondata-qualitypipelinesfeature-flagsrecruiting-tech

Firecrawl Part 2: The Confidence Gate That Decides When Bing Gets a Vote

In Part 1, I showed the shape of my company research chain: Firecrawl first, Bing second, and a tracer so the whole thing stays legible when the web lies. Part 2 is about the decision boundary—the exact “needs improvement” gate that determines when I augment Firecrawl with Bing, and why I made that gate intentionally blunt.

Daniel Anthony Romitelli Jr. · March 13, 2026

Tracing an Extraction Pipeline Like a Ledger: Trace Nodes, DLQ Boundaries, and Replayable Failures
observabilitytracingpythonreliabilityworkflows

Tracing an Extraction Pipeline Like a Ledger: Trace Nodes, DLQ Boundaries, and Replayable Failures

I built a step-by-step extraction tracer so I can answer one question under pressure: “what exactly happened inside this run?” In this post I document the tracer’s node schema, where the hooks live in the extraction pipeline, how retries/DLQ boundaries show up in the trace, and how I replay a trace locally to reproduce and fix failures—without guessing.

Daniel Anthony Romitelli Jr. · March 12, 2026

Per‑Region PBR From One Photo: The Cropping Trick That Stops RGB‑X From Bleeding Materials Across Boundaries
pbrdiffusionsegmentationthreejsrendering

Per‑Region PBR From One Photo: The Cropping Trick That Stops RGB‑X From Bleeding Materials Across Boundaries

I watched a wall’s “roughness” spill into a roof the first time I ran RGB‑X on a full-frame photo—and it made the whole result feel fake. The fix wasn’t a new model; it was an engineering decision: run diffusion-based PBR decomposition per architectural region, cropped from SAM3 masks with padding and U‑Net-friendly sizing, plus a procedural fallback so the viewer never goes dark when the GPU pipeline isn’t available.

Daniel Anthony Romitelli Jr. · March 11, 2026

My Three‑Phase Parallel Orchestrator: Typed Results, Exception‑Proof Phases, and a Rollout That Never Flaps
pythonorchestrationvoicelatencyreliability

My Three‑Phase Parallel Orchestrator: Typed Results, Exception‑Proof Phases, and a Rollout That Never Flaps

I replaced a ~3,500ms linear voice pipeline with a parallel, three‑phase orchestrator that targets <600ms P95 by treating “agents” like a compilation pipeline: Phase 1 produces typed intermediate results, Phase 2 consumes them as precomputed search inputs, and Phase 3 formats the response. The trick isn’t asyncio.gather—it’s typed contracts plus a fallback cascade so every phase keeps moving even when a component fails.

Daniel Anthony Romitelli Jr. · March 11, 2026

Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator
cachingconformal-predictioncost-engineeringllm-systemspython

Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator

I watched my email extraction pipeline burn money re-deriving the same fields over and over—then I realized the bug wasn’t “missing cache,” it was caching without a validity model. The fix was a two-stage system: a confidence-gated cache that decides reuse vs partial rebuild using conformal prediction, and a reasoning budget allocator that spends compute per span only when quality is genuinely below target.

Daniel Anthony Romitelli Jr. · March 11, 2026

The Closed‑Loop Consistency Trick: Keeping Scene 12 Faithful to Scene 1 Without Global Memory
ml-systemsvideocontrol-systemstypescriptpipelines

The Closed‑Loop Consistency Trick: Keeping Scene 12 Faithful to Scene 1 Without Global Memory

I used to assume long‑form visual consistency required hauling a growing “story so far” memory through every generation step. Scenematic works the opposite way: I propagate what must persist, measure what actually persisted, then locally correct what didn’t—plus a periodic re‑anchor that prevents slow drift from compounding across a 20‑scene run.

Daniel Anthony Romitelli Jr. · March 11, 2026

Search That Refuses to Think: The Pattern‑First Query Parser I Use for Fast Intent + Entity Extraction
searchnlpvoice-assistantsengineeringpythonlatencyinformation-retrieval

Search That Refuses to Think: The Pattern‑First Query Parser I Use for Fast Intent + Entity Extraction

In my voice-first operations product, I stopped treating “search” as retrieval and started treating it as compilation: speech → intent → entities → an executable query plan. This post documents the query parser I built after an LLM-first attempt produced latency spikes and inconsistent structure. I’ll show the intent contract, the rule engine (compiled regex + token maps), the entity extractors (locations, titles, numeric limits), caching strategy, ambiguity detection, and the benchmark harness I used to validate latency at scale.

Daniel Anthony Romitelli Jr. · March 5, 2026

Multi‑Vector Embeddings in Production: Typed Vectors, Cache Keys, and a Generator That Refuses Poison Records
pythonembeddingscachingsearchdata-pipelines

Multi‑Vector Embeddings in Production: Typed Vectors, Cache Keys, and a Generator That Refuses Poison Records

I built an embedding pipeline for our recruitment platform that represents each record as four typed vectors instead of one pooled blob: profile, experience, skills, and general. The interesting part isn’t the model call—it’s everything around it: cache key design that includes model IDs, idempotent storage of multi-vector blobs, retry logic with backoff, and a DLQ path that keeps bad records from stalling reindex/backfill.

Daniel Anthony Romitelli Jr. · March 5, 2026

MR‑GRPO in Practice: The Reward Mixer That Stops CLIP From Lying to Your Scene Compiler
ml-systemsrankingnormalizationtypescriptprompt-engineering

MR‑GRPO in Practice: The Reward Mixer That Stops CLIP From Lying to Your Scene Compiler

I replaced CLIP-only candidate ranking with a multi-signal reward mixer that scores each candidate across independent signals, normalizes them group-relatively (GRPO-style), and then composes a final score with null-signal skipping and weight re-normalization. The result is a scoring pipeline that’s harder to game, easier to debug, and robust to missing signals—because it treats “unknown” as a first-class state instead of faking certainty.

Daniel Anthony Romitelli Jr. · March 4, 2026

My Voice Router That Refuses to Think: Pattern‑First Multi‑Agent Orchestration for Sub‑Second Latency
voicemulti-agentroutinglatencypythonarchitectureobservability

My Voice Router That Refuses to Think: Pattern‑First Multi‑Agent Orchestration for Sub‑Second Latency

I rebuilt my voice agent’s orchestration around a stubborn rule—don’t burn an LLM call on a problem a regex can solve. The turning point was a real latency incident caused by model-first routing: obvious intents still waited on classification, pushing p95 past the “this feels broken” threshold. The fix was a pattern-first RouterAgent with an LLM fallback only for ambiguity, an orchestrator that coordinates (not improvises), and a voice integration layer that enforces formatting, timeouts, and stable contracts. This post walks through the architecture, the post-mortem, the measurement methodology behind the latency numbers, and a runnable reference implementation of the router/orchestrator/voice processor loop.

Daniel Anthony Romitelli Jr. · March 4, 2026

Multi‑Agent Firecrawl Research: My Fallback Chain That Refuses to Pretend It Knows the Company
pythonweb-researchfirecrawlbing-searchdata-quality

Multi‑Agent Firecrawl Research: My Fallback Chain That Refuses to Pretend It Knows the Company

I built a research pipeline that treats company enrichment like an investigation: start with the best source, log every step, and fall back gracefully when the web lies or goes quiet. The core idea is simple—separate “fetching” from “deciding”—then let specialized components (Firecrawl + Bing fallback, plus tracing) do their jobs without smearing uncertainty across the output.

Daniel Anthony Romitelli Jr. · March 3, 2026

Cache-First Geocoding with Azure Maps: Key Topology, TTL Heuristics, and Quota Smoothing
pythonfastapiazure-mapscachinggeocodinghttpxrate-limitingobservabilitylanggraph

Cache-First Geocoding with Azure Maps: Key Topology, TTL Heuristics, and Quota Smoothing

I built our Azure Maps integration as a cache-first geocoder because the first real failures weren’t “bad results” — they were wasted calls. In our LangGraph enrichment flow we shipped fixes like address-vs-POI routing and a hard short-circuit when Firecrawl already produced city/state. That experience drove a design that treats geocoding like a budgeted, deduplicated service: stable cache keys (geohash + query normalization), TTLs that reflect spatial volatility, and quota smoothing/backoff so bursts don’t turn into 429 storms. I also tie it back to the advisor-enrichment worker patterns (credits_used, max_credits, feature flags) because the platform’s core discipline is the same: cost and reliability are explicit outputs of the pipeline.

Daniel Anthony Romitelli Jr. · March 3, 2026

Adaptive Keyframe Sampling: How I Spend a Frame Budget Like It’s Cash
video-understandingcomputer-visionmultimodalcost-engineeringcloud-runnextjswebhookshmac

Adaptive Keyframe Sampling: How I Spend a Frame Budget Like It’s Cash

Uniform frame sampling made my screen-workflow analyzer both expensive and blind to short UI bursts. I replaced it with an adaptive sampler that scores cheap visual change, segments the timeline, and allocates a fixed keyframe budget with guardrails. This post covers the failure that triggered the rewrite, the production integration seam (Cloud Run → webhook → Next.js), and complete runnable code for scoring, segmentation, allocation, and frame extraction—plus proper HMAC verification on the webhook receiver.

Daniel Anthony Romitelli Jr. · March 2, 2026

Notification Adjudication in My Ops Intelligence Agent: Canonical Events, Cheap Arbitration, and a Sender That Refuses to Spam
ml-systemsobservabilitypythonanomaly-detectionincident-response

Notification Adjudication in My Ops Intelligence Agent: Canonical Events, Cheap Arbitration, and a Sender That Refuses to Spam

I built an Ops Intelligence Agent alongside a recruitment platform Operations Dashboard to turn a noisy real-time event stream into a small number of high-signal Microsoft Teams notifications. The interesting part isn’t “sending a webhook”—it’s adjudication: normalizing heterogeneous telemetry, scoring it fast, collapsing duplicates, and dispatching defensively so the first alert lands quickly without triggering an alert storm.

Daniel Anthony Romitelli Jr. · March 2, 2026

Phase 2 Calibration: Per‑Category OOD Thresholds + Group‑Relative Reward Normalization in My Scene Compiler
ml-systemscalibrationood-detectiontypescriptobservability

Phase 2 Calibration: Per‑Category OOD Thresholds + Group‑Relative Reward Normalization in My Scene Compiler

Phase 2 is where I stopped treating out‑of‑distribution detection as a single global knob and started calibrating it per prompt category, using a baseline sweep to derive category-specific thresholds and logging the exact threshold source at runtime. In the same calibration pass, I fixed reward fusion instability by adding per-head z‑score normalization and GRPO‑style group-relative normalization so multi-signal scoring doesn’t collapse when one signal’s scale drifts.

Daniel Anthony Romitelli Jr. · March 1, 2026

My RAG Stack for Code Retrieval: pgvector HNSW + Metadata Filters + Reranking (and the Parts I Refuse to Guess About)
ragpgvectorembeddingsretrievaltypescript

My RAG Stack for Code Retrieval: pgvector HNSW + Metadata Filters + Reranking (and the Parts I Refuse to Guess About)

In my portfolio repo I reference a RAG-powered vector search stack built on OpenAI embeddings stored in pgvector with HNSW indexing, plus a cross-encoder reranking step via BGE and metadata-filtered search. The honest version: the repo context you gave me proves the components exist and shows some of the glue code around retrieval, deduping, and formatting—but it does not include the actual SQL query shapes, BGE batching code, or latency-budget enforcement logic, so I’m going to document exactly what’s implemented and exactly what’s still unresolved in the evidence.

Daniel Anthony Romitelli Jr. · March 1, 2026

I got SAM3 video tracking wrong: the session wasn’t the problem—my reprojection was
computer-visionvideoinferencetrackingdebugging

I got SAM3 video tracking wrong: the session wasn’t the problem—my reprojection was

I built a GPU server that streams SAM3 masks frame-by-frame, and I initially blamed “model instability” for flicker and churn. The real culprit was how I handled session reuse, mixed-resolution frames, and reprojecting outputs back to each frame’s original size. This post drills into the inference session lifecycle, the streaming output shape assumptions (batch size 1), and the debug hooks I added (like /segment/debug-model) to make flicker diagnosable instead of mystical.

Daniel Anthony Romitelli Jr. · February 28, 2026

Turning CRM Audit Noise into a Transition Graph: Normalizing Events, Sessionizing Creation Bursts, and Extracting Time‑Weighted State Edges
data-engineeringprocess-miningcrmobservabilitypython

Turning CRM Audit Noise into a Transition Graph: Normalizing Events, Sessionizing Creation Bursts, and Extracting Time‑Weighted State Edges

A practical pipeline for reconstructing deal timelines from messy webhook/API audit trails: normalize heterogeneous events, split them into deal-centric sessions, compress them into canonical state paths, then extract a transition graph whose edges carry both counts and observed durations—while handling missing and out-of-order events without pretending the data is perfect.

Daniel Anthony Romitelli Jr. · February 28, 2026

Diversification After Scoring: The Step That Stops My Scene Compiler From Picking Five Paraphrases
rankingdiversityrerankingselectionpromptingpipelines

Diversification After Scoring: The Step That Stops My Scene Compiler From Picking Five Paraphrases

My scene compiler was returning “top 5” candidates that were basically the same idea. The fix wasn’t in generation—it was in the post-scoring selection layer: add a diversification pass after scoring (and after any routing/gating decisions), then re-rank and select. This rewrite removes repo-identifying details, deletes truncated snippets, and clearly marks what I can’t verify from the retrieved source context.

Daniel Anthony Romitelli Jr. · February 27, 2026

Defensive Multi‑Agent Scoring: How I Made LLM Reviews Clamp, Stream, and Fail Loudly
LLMsTypeScriptMulti-agent systemsReliability engineeringEvaluation

Defensive Multi‑Agent Scoring: How I Made LLM Reviews Clamp, Stream, and Fail Loudly

Multi‑agent scoring isn’t “math,” it’s input validation under adversarial-ish conditions: truncated JSON, missing fields, and silent no‑responses. This rewrite focuses on making failure states explicit (a sentinel review), keeping partial streams observable for debugging, and ensuring aggregation can’t be poisoned by malformed outputs—without leaking repo structure, domains, or config keys.

Daniel Anthony Romitelli Jr. · February 27, 2026

Audiobook

New

Enterprise AI Architecture

How to Architect an Enterprise AI System (And Why the Engineer Still Matters)

2h 16m · 13 chapters

The full series narrated as a 2h 16m audiobook. Listen on your preferred platform.