13-Part Series

How to Architect an Enterprise AI System (And Why the Engineer Still Matters)

Every post in this series is a decision I made that no model would have made on its own. Not because the model is bad , because the model doesn't know what it doesn't know.

Built from a production email intake pipeline processing thousands of inbound emails daily. LangGraph orchestration, GPT extraction at temperature=1, Docker container apps, CRM integration. 13 engineering decisions, each one a place where the AI raised the floor and the engineer decided what to build on top of it.

Start with Part 0

13 of 13 parts published

The Floor and the Ceiling

What the AI handles vs. what the engineer decides — the thesis of the entire series in one table.

The Floor (AI handles this)	The Ceiling (Engineer decides this)
Extract structured fields from email text	Run extraction at temperature=1, then validate downstream
Generate embeddings for semantic search	Build 4 parallel embedding vectors per person record, not 1
Parse intent from a voice query	Route through a <100ms classifier before the LLM touches it
Summarize text and suggest next steps	Build a context-continuity stack so sessions never lose state
Match person records to job descriptions	Weight recency, compensation, and relocation independently
Cache responses to reduce latency	Use $16/mo Redis with a circuit breaker that makes it optional

Floor

Extract structured fields from email text

Ceiling

Run extraction at temperature=1, then validate downstream

Floor

Generate embeddings for semantic search

Ceiling

Build 4 parallel embedding vectors per person record, not 1

Floor

Parse intent from a voice query

Ceiling

Route through a <100ms classifier before the LLM touches it

Floor

Summarize text and suggest next steps

Ceiling

Build a context-continuity stack so sessions never lose state

Floor

Match person records to job descriptions

Ceiling

Weight recency, compensation, and relocation independently

Floor

Cache responses to reduce latency

Ceiling

Use $16/mo Redis with a circuit breaker that makes it optional

The Series

The Day My AI Forgot Everything (So I Built a Context-Continuity Inference Stack)Published

A production assistant that can’t resume state can’t own a workflow. After watching real teams lose hours to “AI amnesia” across resets, I built a multi-layer context-continuity stack: a Postgres-backed Context API with stable keys and typed artifacts, hybrid search, progress snapshots, and a mandatory session boot protocol. The failure mode, the architecture that answered it, and a runnable FastAPI slice you can adapt without leaking secrets.

I Stopped Letting Emails Poison My Extractor: The Pre-LLM Gate That Made the Rest of the Pipeline ReliablePublished

Part 1 of my series “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”: real email is adversarial input, so my intake path sanitizes it, isolates forwarded threads, and detects scheduler reschedules before any model call. That ordering eliminated the most expensive class of production mistake, confidently writing the wrong person into the system of record.

I Turned Temperature Up to Save My Extractions: The 3‑Node LangGraph That Trades Variance for TruthPublished

My production extractor runs at temperature=1, on purpose. Part 2 of “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)” covers the sequential extract→research→validate LangGraph that contains that variance, plus the concrete sanitization and job-title rejection logic that keeps silent data loss and string-null pollution out of downstream systems.

The Six‑Tier Enrichment Cascade: How I Stop “Helpful” Data From Overwriting True DataPublished

One enrichment API sounds clean until production hands you 30–40% missing fields. Worse, relying on the one source teaches your system to accept gaps as normal. Part 3 of my series walks through the six-tier enrichment chain I built (Firecrawl, Bing, domain-variation retry, Serper, Azure Maps POI, phone area code approximation) and the per-field provenance rules that stop low-confidence guesses from overwriting high-quality facts.

User Corrections Always Win: The Streaming Email Client Add‑in UI That Turns Human Edits Into Training Signal (Series Part 4)Published

The email client add-in never lets the AI write straight into the CRM. It streams a draft into a form a human reviews field by field, with a confidence badge on each one, and then treats the user's edits as the authoritative version. The part that matters most: the override payload keeps both the AI's original extraction and the human's final edits, so every correction becomes durable training data instead of a one-off fix.

I Hardcoded the Kill Switch: Feature Flags as AI Guardrails (Series Part 5)Published

Part 5 of "How to Architect an Enterprise AI System (And Why the Engineer Still Matters)" is about treating feature flags as a safety control plane for AI behavior. Some belong in runtime config, where ops can ramp an experiment without a deploy. Others belong in code, so a behavior that hurt users can't come back without an intentional deploy. I walk through the patterns I use: env-driven booleans and ints, stable percentage rollouts, a code-level kill switch that environment variables cannot override, and the routing that keeps chat working by falling back from orchestration to a legacy flow.

The Queue Was a Table: How I Built Claim/Unclaim Workers with SKIP LOCKED, Stale Recovery, and Retry CapsPublished

A worker fleet that retries safely is mostly about being explicit: atomic claiming with FOR UPDATE SKIP LOCKED, leases that can go stale and be recovered, bounded retries that turn poison pills into triage, and metrics at the claim boundary so you can see failure modes before users do.

The CRM Sync Engine I Had to Reverse‑Engineer: Two‑Step Fetches, 50‑Field Limits, and a Mapper That Refuses to DriftPublished

the CRM's API behaved differently in production than its documentation implied: custom views ignored the fields parameter, requests hit an undocumented field-count limit, and a record cap made naive pagination a trap. So the sync engine generates its own field schema from the API, fetches IDs before records, pages across date windows, and fingerprints the normalized record with SHA‑256 so only changed rows get processed. The CRM's inconsistent types and messy freetext locations get normalized into something my downstream systems can trust.

The Search Router That Saved Me From One Index to Rule Them All: Azure AI Search for CRM, pgvector for Transcripts, the CRM as the Live Escape Hatch (Series Part 8)Published

Part 8 of "How to Architect an Enterprise AI System (And Why the Engineer Still Matters)" is about refusing to build one index for everything. One query wants faceting and synonym maps, the next wants vector similarity over Zoom transcripts that already sit in Postgres, and a third needs the CRM because the index has not caught up yet. I walk through UnifiedSearchService: keyword-based dispatch, multi-vector scoring across four embedding fields fused with RRF, pgvector for transcripts, and the CRM API as the real-time escape hatch, all behind one result shape, so the caller doesn't have to care which backend answered.

I Hid the AI’s “Thinking” in Plain Sight: Dual-Channel Streaming for an AI Search Chatbot That Works Mid-Call (Series Part 9)Published

Streaming a chatbot well takes more than sending tokens faster. In my atlas chatbot I split the stream into two channels, THINKING and QUERY_RESULT, so client-call users see clean, structured candidate cards while power users can expand a collapsible panel and inspect why candidates were ranked the way they were.

I Almost Built a Second Search Index, Then I Realized Privacy Was a Runtime TogglePublished

I needed recruiters to search candidates live while sharing their screen on client calls, without leaking identities. The tempting answer was “build an anonymized index.” The better answer was a per-conversation privacy mode that keeps the same search, ranking, and data, and only redacts identity fields on the way out, backed by a partial index so normal mode pays essentially nothing.

I Bought the Cheapest Redis and Dared It to Fail: The Circuit Breaker That Made Cache Optional (Series Part 11)Published

My processing pipeline slowed to a crawl the first time Redis blinked, and the bug turned out to be my own assumption: that the cache had to be there. Here’s the pattern I shipped instead, a tiny circuit breaker state machine plus a wrapper that returns None on failure, which makes every Redis call safe enough to run on the cheapest tier and treat caching as an optimization rather than a dependency.

I Lost Three Hours to a Blank Slate, So I Made “Forgetting” Structurally Impossible (Series Part 12)Published

Part 0 was the diagnosis: a stateless assistant that doesn’t know what it doesn’t know. This is the treatment. A database-backed task ledger, a mandatory boot protocol, and a session handoff format that force full project recovery before any new work starts. Continuity stops depending on willpower and becomes infrastructure.