13-Part Series
How to Architect an Enterprise AI System (And Why the Engineer Still Matters)
Every post in this series is a decision I made that no model would have made on its own. Not because the model is bad — because the model doesn't know what it doesn't know.
Built from a production email intake pipeline processing thousands of recruitment emails daily. LangGraph orchestration, GPT extraction at temperature=1, Docker container apps, CRM integration. 13 engineering decisions, each one a place where the AI raised the floor and the engineer decided what to build on top of it.
Start with Part 013 of 13 parts published
The Floor and the Ceiling
What the AI handles vs. what the engineer decides — the thesis of the entire series in one table.
| The Floor (AI handles this) | The Ceiling (Engineer decides this) |
|---|---|
| Extract structured fields from email text | Run extraction at temperature=1, then validate downstream |
| Generate embeddings for semantic search | Build 4 parallel embedding vectors per candidate, not 1 |
| Parse intent from a voice query | Route through a <100ms classifier before the LLM touches it |
| Summarize text and suggest next steps | Build a context-continuity stack so sessions never lose state |
| Match candidates to job descriptions | Weight recency, compensation, and relocation independently |
| Cache responses to reduce latency | Use $16/mo Redis with a circuit breaker that makes it optional |
Floor
Extract structured fields from email text
Ceiling
Run extraction at temperature=1, then validate downstream
Floor
Generate embeddings for semantic search
Ceiling
Build 4 parallel embedding vectors per candidate, not 1
Floor
Parse intent from a voice query
Ceiling
Route through a <100ms classifier before the LLM touches it
Floor
Summarize text and suggest next steps
Ceiling
Build a context-continuity stack so sessions never lose state
Floor
Match candidates to job descriptions
Ceiling
Weight recency, compensation, and relocation independently
Floor
Cache responses to reduce latency
Ceiling
Use $16/mo Redis with a circuit breaker that makes it optional
The Series
A production assistant that can’t resume state can’t own a workflow. After watching real teams lose hours to “AI amnesia” across resets, I built a multi-layer context-continuity stack: a Postgres-backed Context API with stable keys + typed artifacts, hybrid search, progress snapshots, and a mandatory session boot protocol. This post explains the failure mode, the architecture, and a runnable FastAPI slice you can adapt without leaking secrets.
Part 1 of my “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)” series: why I rebuilt my email intake path around deterministic sanitization + forwarded-thread isolation + scheduler-reschedule detection—before any model call—and how that one decision eliminated the most expensive class of production mistakes: confidently writing the wrong person into the system of record.
In Part 2 of my “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)” series, I explain why my production extractor runs at temperature=1, how I contain that variance with a sequential extract→research→validate LangGraph, and the concrete sanitization + job-title rejection logic that prevents silent data loss and string-null pollution in downstream systems.
I learned the hard way that “just call one enrichment API” turns into 30–40% missing fields in production—and worse, it quietly teaches your system to accept gaps as normal. In Part 3 of my series, I’ll walk through the six-tier enrichment cascade I built (Firecrawl → Bing → domain-variation retry → Serper → Azure Maps POI → phone area code approximation) and the per-field provenance rules that prevent low-confidence guesses from overwriting high-quality facts.
I designed the Outlook Add-in so the AI never writes straight into the CRM. It streams a draft into a human-reviewed form—field by field, with confidence badges—then treats the user’s edits as the authoritative truth. The punchline is that the override payload preserves both the AI’s original extraction and the human’s final version, so every correction becomes a durable learning signal instead of a one-off fix.
In Part 5 of “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”, I explain why I treat feature flags as a safety control plane for AI behavior: some flags belong in runtime config for experiments, and some belong in code so they can’t be re-enabled without an intentional deploy. I show the exact flag patterns I use (env-driven booleans/ints, stable percentage rollouts, and a code-level kill switch), plus the routing logic that keeps chat functional by falling back from orchestration to a legacy flow.
A worker fleet that retries safely is mostly about being explicit: atomic claiming with FOR UPDATE SKIP LOCKED, leases that can go stale and be recovered, bounded retries that turn poison pills into triage, and metrics at the claim boundary so you can see failure modes before users do.
I built a sync engine against a CRM API that behaved differently in production than its documentation implied. The result is a pipeline that auto-generates its own field schema, paginates past record caps with date chunking, and only processes changed records via SHA‑256—while normalizing the CRM’s inconsistent types and messy freetext locations into something my downstream systems can trust.
I hit the moment every enterprise AI system eventually hits: one query needed faceting and synonym maps, another needed raw vector similarity over transcripts already sitting in Postgres, and a third needed the CRM as the only source of truth. This post is the routing engine I built to make those choices automatically—Azure AI Search for structured CRM indexes, pgvector for Zoom transcripts, and the CRM API as a real-time fallback—without making the user care which backend answered them.
I learned the hard way that streaming a chatbot isn’t just “send tokens faster.” In my vault chatbot, I split the stream into two channels—THINKING and QUERY_RESULT—so client-call users see clean, structured candidate cards while power users can expand a collapsible panel to inspect why candidates were ranked the way they were.
I needed recruiters to search candidates live while sharing their screen on client calls—without leaking identities. The tempting answer was “build an anonymized index.” The better answer was a per-conversation privacy mode that keeps the same search, ranking, and data, and only redacts identity fields on the way out—backed by a partial index so normal mode pays essentially nothing.
I watched my processing pipeline slow to a crawl the first time Redis blinked—and realized the real bug wasn’t downtime, it was my assumption that cache was required. This post is the exact pattern I shipped: a tiny circuit breaker state machine plus a “return None on failure” wrapper that makes every Redis call safe, so I can run on the cheapest tier and treat cache as an optimization, not a dependency.
Part 0 was the diagnosis: a stateless assistant that doesn’t know what it doesn’t know. This is the treatment: a database-backed task ledger, a mandatory boot protocol, and a session handoff format that forces full project recovery before any new work starts. The result is boring in the best way—continuity as infrastructure, not willpower.