Skip to content
Back to BlogFour Vectors, One Record: How I Split Embeddings Before They Hit Search
embeddingssearchredispythonazure-ai-searchproduction-systems

Four Vectors, One Record: How I Split Embeddings Before They Hit Search

Daniel Anthony Romitelli Jr. · May 29, 2026

Listen to this article
0:00
-0:00 / 0:00

The failure that pushed me into this design was not subtle. I had a blended embedding that kept returning candidate matches that looked reasonable at a glance and wrong on inspection. A query about a very specific certification would drag in people with the right industry background but no credential. A query about deep experience in a narrow domain would surface candidates whose skills summary happened to mention the right keywords but whose work history did not support the match. The single vector was doing exactly what I asked it to do: compressing the entire record into one semantic point. The problem was that a candidate record is not one thing.

This is the same failure dense-retrieval people call representational collapse: when you ask one embedding to summarize a heterogeneous object, the model gravitates to the dominant mode and discards the long tail. For a document, that tail is a paragraph. For a candidate record, the tail is an entire field. Credentials get smoothed by industry text. Career trajectory gets smoothed by a polished summary. A single vector turns a structured record into a compromise the search layer has to live with.

A profile has several different centers of meaning — multiple semantic modes that do not collapse cleanly. Work history answers one class of search intent. Skills and designations answer another. A broad profile summary is useful for discovery, but it is a poor substitute for the details that actually separate one candidate from another. The fix is to stop forcing one embedding to carry all of them.

That is what app/agents/embedding_agent.py is for in this system. It generates four parallel embeddings for the same logical record: profile_vector, experience_vector, skills_vector, and general_vector. This is late fusion in the retrieval sense — embed each view independently, defer the "which view matched" decision until query time, and let the search layer pick the best-matching vector per query. Then app/jobs/embedding_generator.py takes over the production concerns: it validates the input, applies retry logic, and routes terminal failures to the dead-letter queue. The separation is deliberate. The agent handles semantic decomposition and caching. The job handles durability and failure control. That two-layer split is the bulkhead pattern — keep the failure modes of one concern from contaminating the other, the way watertight compartments work on a ship.

The shape of the pipeline

I like systems that are honest about their steps.

The agent, its constants, and the cache key

The embedding agent is specialized on purpose. Its job is not "make vectors." Its job is "make four specific vectors, keep them cached, and do it fast enough that repeated calls do not punish the API."

The contract is declared as constants at the top of the file:

EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_DIMENSIONS = 3072  # text-embedding-3-large native dimensions
EMBEDDING_TTL_SECONDS = 86400  # 24 hours
CACHE_KEY_PREFIX = "emb:"

Those four lines are the boundary conditions for the whole embedding pipeline — what a working systems engineer would call a contract module. The model choice defines the embedding family. The dimension count defines what downstream systems must expect, and what the HNSW or IVF index was sized for — if it ever drifts, the mismatch is immediate. The TTL is the literal floor on how stale the cache can get, which makes it a direct knob on the freshness/cost tradeoff. The prefix scopes the embedding namespace inside Redis, because a multi-tenant cache without prefixes becomes archaeology the moment anything goes wrong.

The cache key itself is where multi-vector output stays safe:

def _generate_cache_key(self, query: str, vector_type: str = "all") -> str:
    """
    Generate a deterministic cache key from query text.

    Returns: emb:{md5_hash}:{vector_type}
    """
    normalized = query.lower().strip()
    query_hash = hashlib.md5(normalized.encode()).hexdigest()
    return f"{CACHE_KEY_PREFIX}{query_hash}:{vector_type}"

The vector_type token is the part that makes this cache safe for a four-vector bundle. This is key-space scoping in the small — each vector view gets its own deterministic slot, so a lookup for profile_vector cannot accidentally return the value cached for skills_vector. Without the token, the keys would collide and a lookup would return whichever vector wrote last; with it, the cache stays correct under concurrent writes from all four views. The lower().strip() normalization is what gives the key its deterministic property — semantically identical queries that differ only in whitespace or case hit the same slot. Repeated content returns quickly. New content pays the model cost once. The cache stays legible because nothing else in Redis uses the emb: prefix.

The agent then runs the four model calls in parallel on a cache miss. The vectors are independent semantic views, so there is no reason to serialize them — asyncio.gather runs the four embedding requests concurrently and the worst-case latency is fan-out latency, not four-times-sequential. In practice the four calls complete in roughly the wall-clock time of a single call. The agent also tracks cache hits, cache misses, embeddings generated, errors, tokens used, and total latency. Those metrics are what separate a slow cache miss from a model issue from a bad keying strategy when the system gets noisy — observability has to be richer than the failure mode you are trying to debug, or you end up guessing.

The job that makes the system durable

The agent solves the semantic problem. The job in app/jobs/embedding_generator.py solves the production problem.

Its responsibility is not to invent vectors. Its responsibility is to make sure the vectors that enter the index are valid, and to make sure failures are classified correctly when something goes wrong. The job handles three things: content-length validation, embedding-dimension validation, and retry/DLQ routing.

The retry policy is exponential backoff with capped intervals — 30 seconds, 2 minutes, 10 minutes for transient failures. That growth pattern (~4× between steps) gives temporary upstream issues time to clear without flooding the pipeline with immediate retries, and the cap keeps the worst-case retry budget bounded. The pattern is older than queue infrastructure itself — TCP's RTO calculation uses the same shape — and it works because most transient failures recover on a timescale that does not require sub-second polling.

But not every failure deserves another attempt. Some failures are terminal by definition. If the content is too long, the input needs to change. If the dimension count does not match the expected shape, the embedding is not fit for indexing. Those cases go straight to the DLQ as poison messages — input that no amount of retrying will fix because the retry policy operates on time, not on the cause of the failure.

That is the right line to draw. A retryable failure says "try again later." A terminal failure says "this record needs intervention or upstream correction." Mixing those two up is how queues become retry-storm amplifiers — the same bad input cycling through the worker pool, burning capacity that healthy traffic needs.

Why terminal failures go straight to the DLQ

I treat the dead-letter queue as an operational control surface, not as a waste bin. If a record lands there because the content is too long or the dimensions are wrong, that is useful signal. It tells me exactly what kind of correction is needed.

That approach keeps the normal path clean. Retryable failures get another chance. Invalid records get isolated. The DLQ becomes the place where malformed input, schema drift, or upstream data quality issues are surfaced — instead of being hidden under repeated attempts that look like noise on a dashboard. That is also why the terminal failure list is explicit: content_too_long and dimension_mismatch. Those are not transient conditions. Retrying them would only burn time and make the failure harder to interpret.

The job's logic protects downstream search from poison data. That is a boring sentence, and it is the whole point of the job.

The real benefit of four vectors

The payoff is not an abstract claim that multi-vector is better. It is that a candidate record can answer multiple search intents without being flattened into a single compromise representation. This is exactly the problem classical information retrieval solved with BM25F — field-weighted BM25 that lets a search over "title, body, anchor text" weight each field separately rather than concatenating them into a single bag of words. The four-view embedding is the dense-retrieval analog: each field gets its own representation, and the search layer composes them at query time instead of relying on a pre-flattened average.

The work-history view preserves sequence and career motion. The skills view preserves credentials, designations, and specific capabilities. The profile view keeps the higher-level picture intact. The general view gives broad coverage when the query is not narrowly about one field — the safety net for queries that do not map cleanly to any one mode.

The record keeps its internal structure while still becoming searchable. If I had stayed with one blended embedding, queries about certification would have kept drifting toward domain-adjacent candidates, and queries about career history would have kept overvaluing polished summary language. Splitting the record before embedding makes those tradeoffs explicit instead of accidental — the search layer can now choose which view drives the score, and which one only gets to break ties.

How this changes the way I debug search

One of the best side effects of the split is that debugging got a lot cleaner. When the embedding layer is monolithic, every retrieval complaint feels like one problem. With four vectors and a separate job layer, each failure mode has a coordinate — I can point at which vector, which stage, and which agent owns the problem instead of triangulating.

If latency is high, I look at cache misses and generation time. If the system is burning API calls, I look at cache hit rate. If records are failing to persist, I check validation and DLQ counts. If the embedding shapes are wrong, I know the issue is in the contract, not in the ranking layer. That mapping from symptom to layer is what makes the system observable in the precise sense — not "we have dashboards," but "every reasonable question has a corresponding metric to read."

That split in responsibilities is what makes the system tractable. The embedding agent is about generating the right semantic inputs. The job is about making sure those inputs survive production constraints. Each layer is narrow enough to inspect without guessing.

Why I kept the implementation narrow

I did not want the embedding path to become a catch-all orchestration layer. A lot of systems become hard to reason about because one component starts doing semantic prep, retry control, persistence, error handling, and cache management all at once — what designers call god-object accretion, where convenience eats coherence one feature at a time.

This one stays focused. The agent prepares and caches the vectors. The job validates and routes outcomes. The search layer consumes the resulting embeddings through the normal indexing path. That separation keeps each piece easier to test, easier to replace, and free of accidental coupling — changing retry behavior never forces a rewrite of semantic preparation, and adjusting the vector split never touches dead-letter routing. It is the same logic that drives single-responsibility design at the class level, applied at the service-boundary level.

What the split preserved

The thing I wanted to preserve was not just accuracy — it was the shape of the candidate record itself. A blended embedding makes a record searchable but smooths away the distinctions that separate one candidate from the next: experience is not the same as skills, and a polished title can hide a work history that tells a different story. Four semantic views keep those distinctions where search can still use them.

That is the real win. The record remains one record, but it no longer has to pretend it only means one thing.

The next step is to connect this multi-vector representation to the retrieval side in a way that keeps the same discipline: explicit field intent, explicit score handling, and no hidden magic between the index and the ranking layer. Late-interaction models like ColBERT push this idea even further — preserving per-token vectors and aggregating at query time, which gives precision without committing to a field schema up front. But that is a much bigger lift, and the four-view split is the right step from where I started. Once the record is no longer flattened, the search layer has to earn its keep.