Four Vectors, One Record: How I Split Embeddings Before They Hit Search

The failure that pushed me into this design wasn't subtle. My blended embedding kept returning candidates who looked fine at a glance and wrong on inspection. Ask about a very specific certification and you'd get people with the right industry background and no credential. Ask about deep experience in a narrow domain and you'd get candidates whose skills summary happened to mention the right keywords, with a work history that didn't support the match. The vector was doing exactly what I'd asked it to do: compressing an entire record into one semantic point. A candidate record just isn't one thing.

Dense-retrieval people have a name for this. Representational collapse: ask one embedding to boil down a heterogeneous object and the model gravitates to the dominant mode, discarding the long tail. For a document, that tail is a paragraph. For a candidate record, the tail is an entire field. Credentials get smoothed over by industry text. Career trajectory gets smoothed over by a polished summary. One vector turns a structured record into a compromise the search layer then has to live with.

A profile has several different centers of meaning, several semantic modes that don't collapse cleanly into one. Work history answers one class of search intent. Skills and designations answer another. A broad profile summary is useful for discovery, but it's a poor substitute for the details that actually separate one candidate from the next. So stop asking one embedding to carry all of them.

That's what app/agents/embedding_agent.py is for in this system. It generates four parallel embeddings for the same logical record: profile_vector, experience_vector, skills_vector, and general_vector. In retrieval terms that's late fusion, meaning you embed each view independently, defer the "which view matched" decision until query time, and let the search layer pick the best-matching vector per query. Then app/jobs/embedding_generator.py takes over the production concerns. It validates the input, applies retry logic, and routes terminal failures to the dead-letter queue. The split is deliberate. The agent handles semantic decomposition and caching; the job handles durability and failure control. And the reason for two layers is the bulkhead pattern, keeping the failure modes of one concern from contaminating the other, the way watertight compartments work on a ship.

The shape of the pipeline

I like a system that's honest about its steps.

The agent, its constants, and the cache key

The embedding agent is specialized on purpose. Its whole assignment: produce four specific vectors, keep them cached, and run fast enough that repeated calls don't punish the API.

The contract sits in four constants at the top of the file.

EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_DIMENSIONS = 3072  # text-embedding-3-large native dimensions
EMBEDDING_TTL_SECONDS = 86400  # 24 hours
CACHE_KEY_PREFIX = "emb:"

Those four lines are the boundary conditions for the whole embedding pipeline, what a working systems engineer would call a contract module. The model choice defines the embedding family. The dimension count defines what the rest of the system has to expect, and what the HNSW or IVF index was sized for; if the two ever fall out of step, the mismatch shows up immediately. The TTL is the literal floor on how stale the cache can get, which makes it a direct knob on the freshness/cost tradeoff. And the prefix scopes the embedding namespace inside Redis, because a multi-tenant cache without prefixes becomes archaeology the moment anything goes wrong.

The cache key itself is where the multi-vector output stays safe.

def _generate_cache_key(self, query: str, vector_type: str = "all") -> str:
    """
    Generate a deterministic cache key from query text.

    Returns: emb:{md5_hash}:{vector_type}
    """
    normalized = query.lower().strip()
    query_hash = hashlib.md5(normalized.encode()).hexdigest()
    return f"{CACHE_KEY_PREFIX}{query_hash}:{vector_type}"

The vector_type token is the piece that makes this cache safe for a four-vector bundle. Call it key-space scoping in the small. Each view gets its own deterministic slot, so a lookup for profile_vector can't come back holding the value cached for skills_vector. Without the token the keys would collide and a lookup would return whichever vector wrote last. With it, the cache stays correct under concurrent writes from all four views. The lower().strip() normalization is what gives the key its deterministic property: semantically identical queries that differ only in whitespace or case land in the same slot. Repeated content returns quickly. New content pays the model cost once. And the cache stays legible because nothing else in Redis uses the emb: prefix.

On a miss, the agent fires all four model calls in parallel. The vectors are independent semantic views, so there's no reason to serialize them. asyncio.gather runs the four embedding requests concurrently, and the worst case is fan-out latency rather than four-times-sequential. In practice the four calls finish in roughly the wall-clock time of a single one. The agent also tracks cache hits, cache misses, embeddings generated, errors, tokens used, and total latency. Those metrics are what let me tell a slow cache miss from a model issue from a bad keying strategy once the system gets noisy. Observability has to be richer than the failure mode you're trying to debug, or you end up guessing.

The job that makes the system durable

The agent solves the semantic problem. The job in app/jobs/embedding_generator.py solves the production problem.

It doesn't invent vectors. It makes sure the vectors entering the index are valid, and that failures get classified correctly when something goes wrong. Three responsibilities: content-length validation, embedding-dimension validation, and retry/DLQ routing.

The retry policy is exponential backoff with capped intervals, 30 seconds, then 2 minutes, then 10 minutes for transient failures. That growth pattern (~4x between steps) gives a temporary upstream issue room to clear without flooding the pipeline with immediate retries, and the cap keeps the worst-case retry budget bounded. The shape is older than queue infrastructure itself, since TCP's RTO calculation does the same thing, and it works because most transient failures recover on a timescale that doesn't require sub-second polling.

But not every failure deserves another attempt. Some are terminal by definition. Content too long? Then the input has to change. Dimension count doesn't match the expected shape? Then the embedding isn't fit for indexing. Those cases go straight to the DLQ as poison messages, input that no amount of retrying will fix, because the retry policy operates on time and not on the cause of the failure.

That's the right line to draw. A retryable failure says "try again later." A terminal one says "this record needs intervention or upstream correction." Confusing the two is how queues turn into retry-storm amplifiers, with the same bad input cycling through the worker pool, burning capacity that healthy traffic needs.

Why terminal failures go straight to the DLQ

I treat the dead-letter queue as an operational control surface, not a waste bin. When a record lands there because its content is too long or its dimensions are wrong, that's worth knowing. It tells me exactly what kind of correction is needed.

It also keeps the normal path clean. Retryable failures get another chance. Invalid records get isolated. The DLQ becomes the place where malformed input, schema changes upstream, or data quality problems surface, instead of hiding under repeated attempts that read as noise on a dashboard. Which is why the terminal list is explicit: content_too_long and dimension_mismatch. Neither one is a transient condition. Retrying them would only burn time and make the failure harder to interpret.

The job's logic protects downstream search from poison data. Unglamorous sentence. It's also the entire point of the job.

The real benefit of four vectors

The payoff here has nothing to do with multi-vector being better in the abstract. What it buys is a candidate record that can answer multiple search intents without being flattened into a single compromise representation. Classical information retrieval solved this exact problem with BM25F, field-weighted BM25, which lets a search over "title, body, anchor text" weight each field separately rather than concatenating them into one bag of words. The four-view embedding is the dense-retrieval analog. Each field gets its own representation, and the search layer composes them at query time instead of leaning on a pre-flattened average.

The work-history view preserves sequence and career motion. The skills view preserves credentials, designations, and specific capabilities. The profile view keeps the higher-level picture intact. And the general view gives broad coverage when a query isn't narrowly about one field, the safety net for anything that doesn't map cleanly onto a single mode.

The record keeps its internal structure and still becomes searchable. Had I stayed with one blended embedding, queries about certification would have kept sliding toward domain-adjacent candidates, and queries about career history would have kept overvaluing polished summary language. Splitting the record before embedding makes those tradeoffs explicit instead of accidental. The search layer can now choose which view drives the score, and which one only gets to break ties.

How this changes the way I debug search

One of the best side effects of the split is that debugging got a lot cleaner. When the embedding layer is monolithic, every retrieval complaint feels like one problem. With four vectors and a separate job layer, each failure mode has a coordinate, so I can point at which vector, which stage, and which agent owns the problem instead of triangulating.

Latency high? I look at cache misses and generation time. Burning API calls? Cache hit rate. Records failing to persist? Validation and DLQ counts. Embedding shapes wrong? Then I know the issue lives in the contract, not in the ranking layer. That mapping from symptom to layer is what makes a system observable in the precise sense. It doesn't mean "we have dashboards." It means every reasonable question has a corresponding metric to read.

That division of responsibility is what makes the whole thing tractable. The embedding agent is about generating the right semantic inputs. The job is about making sure those inputs survive production constraints. Each layer stays narrow enough to inspect without guessing.

Why I kept the implementation narrow

I didn't want the embedding path turning into a catch-all orchestration layer. Plenty of systems get hard to reason about because one component starts doing semantic prep, retry control, persistence, error handling, and cache management all at once. Designers call that god-object accretion, where convenience eats coherence one feature at a time.

This one stays focused. The agent prepares and caches the vectors. The job validates and routes outcomes. The search layer consumes the resulting embeddings through the normal indexing path. That separation keeps each piece easier to test, easier to replace, and free of accidental coupling. Change the retry behavior and semantic preparation needs no rewrite. Adjust the vector split and dead-letter routing never moves. Same logic as single-responsibility design at the class level, applied at the service boundary.

What the split preserved

Accuracy was part of what I wanted to protect. The bigger thing was the shape of the candidate record itself. A blended embedding makes a record searchable while smoothing away the distinctions that separate one candidate from the next: experience isn't the same as skills, and a polished title can hide a work history that tells a different story. Four semantic views keep those distinctions somewhere search can still reach them.

That's the real win. The record stays one record, but it no longer has to pretend it only means one thing.

Next is connecting this multi-vector representation to the retrieval side with the same discipline: explicit field intent, explicit score handling, and no hidden magic between the index and the ranking layer. Late-interaction models like ColBERT push the idea further still, preserving per-token vectors and aggregating at query time, which buys precision without committing to a field schema up front. That's a much bigger lift, though, and the four-view split is the right step from where I started. Once the record isn't flattened anymore, the search layer has to earn its keep.