Skip to content

13-Part Series

How to Architect an Enterprise AI System (And Why the Engineer Still Matters)

Every post in this series is a decision I made that no model would have made on its own. Not because the model is bad — because the model doesn't know what it doesn't know.

Built from a production email intake pipeline processing thousands of recruitment emails daily. LangGraph orchestration, GPT extraction at temperature=1, Docker container apps, CRM integration. 13 engineering decisions, each one a place where the AI raised the floor and the engineer decided what to build on top of it.

Start with Part 0

13 of 13 parts published


The Floor and the Ceiling

What the AI handles vs. what the engineer decides — the thesis of the entire series in one table.

Floor

Extract structured fields from email text

Ceiling

Run extraction at temperature=1, then validate downstream

Floor

Generate embeddings for semantic search

Ceiling

Build 4 parallel embedding vectors per candidate, not 1

Floor

Parse intent from a voice query

Ceiling

Route through a <100ms classifier before the LLM touches it

Floor

Summarize text and suggest next steps

Ceiling

Build a context-continuity stack so sessions never lose state

Floor

Match candidates to job descriptions

Ceiling

Weight recency, compensation, and relocation independently

Floor

Cache responses to reduce latency

Ceiling

Use $16/mo Redis with a circuit breaker that makes it optional


The Series

0
The Day My AI Forgot Everything (So I Built a Context-Continuity Inference Stack)Published

A production assistant that can’t resume state can’t own a workflow. After watching real teams lose hours to “AI amnesia” across resets, I built a multi-layer context-continuity stack: a Postgres-backed Context API with stable keys + typed artifacts, hybrid search, progress snapshots, and a mandatory session boot protocol. This post explains the failure mode, the architecture, and a runnable FastAPI slice you can adapt without leaking secrets.

1
I Stopped Letting Emails Poison My Extractor: The Pre-LLM Gate That Made the Rest of the Pipeline ReliablePublished

Part 1 of my “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)” series: why I rebuilt my email intake path around deterministic sanitization + forwarded-thread isolation + scheduler-reschedule detection—before any model call—and how that one decision eliminated the most expensive class of production mistakes: confidently writing the wrong person into the system of record.

2
I Turned Temperature Up to Save My Extractions: The 3‑Node LangGraph That Trades Variance for TruthPublished

In Part 2 of my “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)” series, I explain why my production extractor runs at temperature=1, how I contain that variance with a sequential extract→research→validate LangGraph, and the concrete sanitization + job-title rejection logic that prevents silent data loss and string-null pollution in downstream systems.

3
The Six‑Tier Enrichment Cascade: How I Stop “Helpful” Data From Overwriting True DataPublished

I learned the hard way that “just call one enrichment API” turns into 30–40% missing fields in production—and worse, it quietly teaches your system to accept gaps as normal. In Part 3 of my series, I’ll walk through the six-tier enrichment cascade I built (Firecrawl → Bing → domain-variation retry → Serper → Azure Maps POI → phone area code approximation) and the per-field provenance rules that prevent low-confidence guesses from overwriting high-quality facts.

4
User Corrections Always Win: The Streaming Outlook Add‑in UI That Turns Human Edits Into Training Signal (Series Part 4)Published

I designed the Outlook Add-in so the AI never writes straight into the CRM. It streams a draft into a human-reviewed form—field by field, with confidence badges—then treats the user’s edits as the authoritative truth. The punchline is that the override payload preserves both the AI’s original extraction and the human’s final version, so every correction becomes a durable learning signal instead of a one-off fix.

5
I Hardcoded the Kill Switch: Feature Flags as AI Guardrails (Series Part 5)Published

In Part 5 of “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”, I explain why I treat feature flags as a safety control plane for AI behavior: some flags belong in runtime config for experiments, and some belong in code so they can’t be re-enabled without an intentional deploy. I show the exact flag patterns I use (env-driven booleans/ints, stable percentage rollouts, and a code-level kill switch), plus the routing logic that keeps chat functional by falling back from orchestration to a legacy flow.

6
The Queue Was a Table: How I Built Claim/Unclaim Workers with SKIP LOCKED, Stale Recovery, and Retry CapsPublished

A worker fleet that retries safely is mostly about being explicit: atomic claiming with FOR UPDATE SKIP LOCKED, leases that can go stale and be recovered, bounded retries that turn poison pills into triage, and metrics at the claim boundary so you can see failure modes before users do.

7
The CRM Sync Engine I Had to Reverse‑Engineer: Two‑Step Fetches, 50‑Field Limits, and a Mapper That Refuses to DriftPublished

I built a sync engine against a CRM API that behaved differently in production than its documentation implied. The result is a pipeline that auto-generates its own field schema, paginates past record caps with date chunking, and only processes changed records via SHA‑256—while normalizing the CRM’s inconsistent types and messy freetext locations into something my downstream systems can trust.

8
The Search Router That Saved Me From One Index to Rule Them All: Azure AI Search for CRM, pgvector for Transcripts, the CRM as the Live Escape Hatch (Series Part 8)Published

I hit the moment every enterprise AI system eventually hits: one query needed faceting and synonym maps, another needed raw vector similarity over transcripts already sitting in Postgres, and a third needed the CRM as the only source of truth. This post is the routing engine I built to make those choices automatically—Azure AI Search for structured CRM indexes, pgvector for Zoom transcripts, and the CRM API as a real-time fallback—without making the user care which backend answered them.

9
I Hid the AI’s “Thinking” in Plain Sight: Dual-Channel Streaming for an AI Search Chatbot That Works Mid-Call (Series Part 9)Published

I learned the hard way that streaming a chatbot isn’t just “send tokens faster.” In my vault chatbot, I split the stream into two channels—THINKING and QUERY_RESULT—so client-call users see clean, structured candidate cards while power users can expand a collapsible panel to inspect why candidates were ranked the way they were.

10
I Almost Built a Second Search Index—Then I Realized Privacy Was a Runtime TogglePublished

I needed recruiters to search candidates live while sharing their screen on client calls—without leaking identities. The tempting answer was “build an anonymized index.” The better answer was a per-conversation privacy mode that keeps the same search, ranking, and data, and only redacts identity fields on the way out—backed by a partial index so normal mode pays essentially nothing.

11
I Bought the Cheapest Redis and Dared It to Fail: The Circuit Breaker That Made Cache Optional (Series Part 11)Published

I watched my processing pipeline slow to a crawl the first time Redis blinked—and realized the real bug wasn’t downtime, it was my assumption that cache was required. This post is the exact pattern I shipped: a tiny circuit breaker state machine plus a “return None on failure” wrapper that makes every Redis call safe, so I can run on the cheapest tier and treat cache as an optimization, not a dependency.

12
I Lost Three Hours to a Blank Slate—So I Made “Forgetting” Structurally Impossible (Series Part 12)Published

Part 0 was the diagnosis: a stateless assistant that doesn’t know what it doesn’t know. This is the treatment: a database-backed task ledger, a mandatory boot protocol, and a session handoff format that forces full project recovery before any new work starts. The result is boring in the best way—continuity as infrastructure, not willpower.