Vector Split by Chunk: Why My Retrieval Stops at the Boundary I Drew

A draft of mine missed the exact file span it needed. The vector was "close." But the chunk I wanted was buried inside a larger blob, and the miss was clean enough to be embarrassing: the sort of thing that passes in a demo and grates in production, because the answer is almost right the way a blurry photo is almost a portrait.

Dense retrieval people have a name for the mechanic underneath it. Representational collapse. Ask one embedding to summarize an entire document and the model picks the dominant mode, then discards the long tail. The precise sentence you wanted lives in that tail. So you get back the right neighborhood and the wrong house, topically coherent and operationally useless.

The fix is to stop asking one vector to do that work. Split the text before embedding, embed each slice on its own, and let retrieval search the slices instead of the document. Everything below falls out of that one decision.

The boundary I chose on purpose

I split embeddings by chunk because the retrieval layer needed a smaller unit than a document. A whole-file vector is too coarse when the system has to answer from a specific path, a specific code block, or a specific rewrite. A chunk sits in between: small enough to isolate one semantic mode, large enough to keep the local context wrapped around it.

The implementation starts with a windowing function in lib/supabaseVectorStorage.ts that does one thing, slice text into overlapping fixed-size chunks. It's the dumbest viable splitter, and that's deliberate. Smarter ones exist. Sentence-aware splitters, recursive structural splitters, semantic-similarity splitters like LangChain's SemanticChunker. Each brings its own failure modes and its own tuning surface. A naive windowed splitter has exactly two knobs and no hidden behavior, which is what you want from a baseline before you let cleverness in.

/**
 * Split text into chunks for embedding
 */
export function splitTextIntoChunks(text: string, chunkSize: number = 1000, overlap: number = 200): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = start + chunkSize;
    const chunk = text.slice(start, end);
    chunks.push(chunk);
    start = end - overlap;
  }

  return chunks;
}

The overlap does more work than it looks like it does. It encodes a locality assumption: an important sentence often straddles a chunk boundary, and when it does, the embedding for either neighbor still anchors near that sentence in vector space. Drop the overlap and coherent thoughts become disjoint halves with weaker mutual similarity. Common starting values in production run 10 to 20%. This one runs at 20% (200 of 1000), the upper end, trading a bit of storage for resilience at the boundaries. The Pinecone team's chunking strategies guide is the canonical writeup if you want the longer comparison.

Why the naive version fails

Embedding the whole document and calling it done works right up until retrieval needs a narrow answer. Then the vector starts behaving like a resume summary. It remembers the broad shape and forgets the sentence where the detail actually lives. That's the precision side of the precision/recall tradeoff giving way: you keep finding related documents and missing the exact one.

Chunk-level embedding changes the unit of truth. Instead of one embedding representing everything, many embeddings represent adjacent slices, and cosine similarity decides which slice the query sits closest to. More embeddings per document means more candidate matches, which means a much better chance of landing on the specific span that answers the question rather than the document that happens to contain it.

How the chunk becomes a vector

Once the text is split, each chunk gets embedded independently with OpenAI. The call is tied to the chunk text, not the parent document.

/**
 * Generate embedding for a given text using OpenAI
 */
export async function generateEmbedding(text: string): Promise<number[]> {
  try {
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-large',
      input: text,
      dimensions: 2000,
    });

    return response.data[0].embedding;
  } catch (error) {
    console.error('Error generating embedding:', error);
    throw error;
  }
}

Two things in that call are worth naming. input: text receives the chunk and not the document, which is the whole purpose of the split. And dimensions: 2000 leans on OpenAI's Matryoshka representation learning: text-embedding-3-large is natively 3072-dimensional, but it's trained so that any leading prefix of the vector is itself a usable embedding. Truncating to 2000 dims keeps roughly 99% of the retrieval quality at two-thirds of the storage cost and roughly two-thirds of the cosine computation per query. Across tens of thousands of chunks behind an HNSW index, that compounds fast.

The storage path follows the same logic when chunks get inserted. Each one becomes its own record with its own index and metadata.

const chunkPromises = chunks.map(async (chunkText, index) => {
  const embedding = await generateEmbedding(chunkText);

  return {
    document_id: documentId,
    chunk_index: index,
    chunk_text: chunkText,
    embedding: embedding, // Pass as array, Supabase will handle vector conversion
    metadata: {
      title: document.title,
      chunk_index: index,
      total_chunks: chunks.length,
    },
  };
});

const chunksWithEmbeddings = await Promise.all(chunkPromises);

// Insert chunks with embeddings
const { error: embError } = await supabase
  .from('embeddings')
  .insert(chunksWithEmbeddings);

The non-obvious part is the metadata. It carries chunk_index and total_chunks alongside the vector, so a chunk is usable on its own and still locatable inside the original document. That's what enables a downstream pattern called context expansion: when a chunk matches, you can also fetch its neighbors by index and widen the window the LLM sees without polluting the similarity score. Retrieval stays narrow. Presentation gets to be generous.

Why this pattern beats one vector per file

A single file vector is cheap to reason about and expensive to trust, because it compresses too much. Split first, embed second, and you preserve the local neighborhood around each idea, which is what lets the search layer return the exact span instead of the general topic. The honest cost is more rows, more embeddings, more storage. But in retrieval, precision is usually the thing that keeps the rest of the system from sounding vaguely confident and slightly wrong.

The retrieval path that makes the split worth it

The strongest evidence that the split matters is a retrieval helper that bypasses vector similarity entirely when I already know the file path. In supabase/functions/_shared/blog-utils.ts, it queries embeddings by metadata->>'file_path' LIKE '%<path>%' so I can guarantee the accuracy agent sees chunks from the file the draft explicitly mentions.

That's hybrid retrieval in its simplest form. Lexical exact-match (the LIKE on file path) and dense vector search (the cosine query) live side by side and compose by union. Vector search is what I want for semantic recall, "find me the chunk that talks about this idea, even if it uses different words." Path-based retrieval is what I reach for when the system needs to stop being poetic and start being literal, "find me chunks from this exact file." Production retrieval stacks at scale usually go further and combine BM25 lexical scores with dense vectors via reciprocal rank fusion. This is the minimal version of that pattern, sized to the problem.

And the chunk-level split is what makes the path-based path useful, because a file can now return multiple precise spans instead of one oversized blob. If I'm asking about a function, I want the function's neighborhood, not the whole neighborhood's autobiography.

The shape of the storage model

The storage row carries the document_id, chunk_index, chunk_text, embedding, and metadata. That shape is what keeps downstream retrieval and debugging sane. When I inspect a result, I can tell where it came from, where it sits in the source, and how many sibling chunks exist around it. Simple enough to survive maintenance, specific enough to survive scrutiny, which is a rarer combination than it should be.

export interface DocumentChunk {
  id?: string;
  document_id: string;
  chunk_index: number;
  chunk_text: string;
  embedding?: number[];
  metadata?: Record<string, unknown>;
}

The practical tradeoff: granularity buys control, not magic

Chunking doesn't make retrieval smart by itself. It gives the search layer a better surface to work with, and that surface has a goldilocks zone. Too large and chunks blur together into mid-density topical clouds. Too small and they lose enough context to become syntactic fragments whose embeddings collapse into the average. The 1000-character window here is roughly two paragraphs of prose, or one tight function: large enough to carry meaning, small enough to admit only one dominant idea per row.

The overlap is the design's safety net. It encodes the assumption that meaning is local but not aligned, since important sentences refuse to respect window boundaries, and it pays a storage tax to keep adjacent chunks neighbors in vector space. The same tradeoff shows up in sliding-window attention, which preserves locality at the cost of duplicated work, and in n-gram lexical indexes, which overlap at the boundary to avoid losing cross-token matches.

The other cost is operational. More chunks, more embeddings. More embeddings, more insert work and more bytes in the HNSW graph. I accept that, because the alternative is a retrieval system that keeps returning the right topic and the wrong answer, which is a very expensive way to be unhelpful.

The retrieval story, drawn plainly

Why I kept the split at the chunk boundary

I could have pushed more logic into the embedding step. Late-interaction models like ColBERT preserve per-token embeddings and aggregate at query time, which gives you precision without choosing a chunk size up front. I could have tried to recover precision later with cross-encoder reranking. But both of those move the problem instead of solving it. They paper over a representation that's wrong at the unit of storage.

So I split by chunk, embed by chunk, store by chunk, and retrieve by chunk. Every stage speaks the same language, which makes the system easier to inspect. Better recall is part of the payoff. The bigger part is a retrieval stack whose granularity matches the way I actually ask questions.

When the answer lives inside a file, I want the search layer to arrive with a scalpel, not a shovel.