Skip to content
Back to BlogVector Split by Chunk: Why My Retrieval Stops at the Boundary I Drew
embeddingsragvector-searchsupabasetypescript

Vector Split by Chunk: Why My Retrieval Stops at the Boundary I Drew

Daniel Anthony Romitelli Jr. · May 28, 2026

Listen to this article
0:00
-0:00 / 0:00

I watched a draft miss the exact file span I needed, and the failure was embarrassingly clean: the vector was "close," but the chunk I wanted was buried inside a larger blob. That is the kind of miss that looks acceptable in a demo and annoying in production, because the answer is almost right in the way a blurry photo is almost a portrait.

The mechanic underneath that failure is what dense retrieval people call representational collapse: when you ask one embedding to summarize an entire document, the model picks the dominant mode and discards the long tail. That tail is where the precise sentence lives. So the retrieval system returns the right neighborhood and the wrong house — coherent topically, useless operationally.

The fix is to stop asking one vector to do that work. Split the text before embedding, embed each slice independently, and let retrieval search the slices instead of the document. The rest of this post is what falls out of that decision.

The boundary I chose on purpose

I split embeddings by chunk because the retrieval layer needed a smaller unit than a document. A whole-file vector is too coarse when the system has to answer from a specific path, a specific code block, or a specific rewrite. The chunk is the right compromise: small enough to isolate one semantic mode, large enough to preserve the local context around it.

The implementation starts with a simple windowing function in lib/supabaseVectorStorage.ts. It does one thing — slice text into overlapping fixed-size chunks. That is the dumbest viable splitter, and it is on purpose. Smarter chunkers exist (sentence-aware splitters, recursive structural splitters, semantic-similarity splitters like LangChain's SemanticChunker), but they all bring their own failure modes and tuning surfaces. A naïve windowed splitter has exactly two knobs and zero hidden behavior, which is the right baseline before you let cleverness in.

/**
 * Split text into chunks for embedding
 */
export function splitTextIntoChunks(text: string, chunkSize: number = 1000, overlap: number = 200): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = start + chunkSize;
    const chunk = text.slice(start, end);
    chunks.push(chunk);
    start = end - overlap;
  }

  return chunks;
}

The overlap is doing more work than it looks like. It is encoding a locality assumption — that an important sentence often straddles a chunk boundary, and that the embedding for either neighbor will still anchor near that sentence in vector space. Drop the overlap and you turn coherent thoughts into disjoint halves with weaker mutual similarity. Common starting values in production are 10–20% overlap; this one runs at 20% (200 of 1000), which sits at the upper end and trades a bit of storage for resilience at boundaries. The Pinecone team's chunking strategies guide is the canonical writeup if you want the longer comparison.

Why the naive version fails

Embedding the whole document and calling it done works until retrieval needs a narrow answer. Then the vector starts behaving like a résumé summary: it remembers the broad shape, but not the sentence where the detail actually lives. That is the precision side of the precision/recall tradeoff failing — you keep finding related documents and missing the exact one.

Chunk-level embedding changes the unit of truth. Instead of asking one embedding to represent everything, I ask many embeddings to represent adjacent slices, then let cosine similarity decide which slice the query is closest to. More embeddings means more candidate matches per document, which means the retrieval layer has a much better chance of finding the specific span that answers the question instead of the document that contains it.

How the chunk becomes a vector

Once the text is split, each chunk is embedded independently with OpenAI. The embedding call is tied to the chunk text, not the parent document.

/**
 * Generate embedding for a given text using OpenAI
 */
export async function generateEmbedding(text: string): Promise<number[]> {
  try {
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-large',
      input: text,
      dimensions: 2000,
    });

    return response.data[0].embedding;
  } catch (error) {
    console.error('Error generating embedding:', error);
    throw error;
  }
}

Two things in that call are worth naming. First, input: text is fed the chunk, not the document — the whole purpose of the split. Second, dimensions: 2000 exploits OpenAI's Matryoshka representation learning: text-embedding-3-large is natively 3072-dimensional, but the model is trained so that any leading prefix of the vector is itself a usable embedding. Truncating to 2000 dims keeps roughly 99% of the retrieval quality at two-thirds of the storage cost and roughly two-thirds of the cosine computation per query. For a system that stores tens of thousands of chunks behind an HNSW index, that compounds fast.

The storage path follows the same logic when chunks are inserted. Each chunk gets its own record with its own index and metadata.

const chunkPromises = chunks.map(async (chunkText, index) => {
  const embedding = await generateEmbedding(chunkText);

  return {
    document_id: documentId,
    chunk_index: index,
    chunk_text: chunkText,
    embedding: embedding, // Pass as array, Supabase will handle vector conversion
    metadata: {
      title: document.title,
      chunk_index: index,
      total_chunks: chunks.length,
    },
  };
});

const chunksWithEmbeddings = await Promise.all(chunkPromises);

// Insert chunks with embeddings
const { error: embError } = await supabase
  .from('embeddings')
  .insert(chunksWithEmbeddings);

The non-obvious part is that the metadata carries chunk_index and total_chunks alongside the vector. That makes each chunk usable on its own and still locatable in the original document — which is what enables a downstream pattern called context expansion: when a chunk matches, you can also fetch its neighbors by index to widen the window the LLM sees without polluting the similarity score. Retrieval stays narrow; presentation gets to be generous.

Why this pattern beats one vector per file

A single file vector is cheap to reason about and expensive to trust. It compresses too much. Splitting first and embedding second preserves the local neighborhood around each idea, which is what lets the search layer return the exact span instead of the general topic. The honest cost is more rows, more embeddings, and more storage — but in retrieval, precision is usually the thing that keeps the rest of the system from sounding vaguely confident and slightly wrong.

The retrieval path that makes the split worth it

The strongest evidence that the split matters is the retrieval helper that bypasses vector similarity entirely when I already know the file path. In supabase/functions/_shared/blog-utils.ts, the helper queries embeddings by metadata->>'file_path' LIKE '%<path>%' so I can guarantee the accuracy agent sees chunks from the file the draft explicitly mentions.

This is hybrid retrieval in its simplest form: lexical exact-match (the LIKE on file path) and dense vector search (the cosine query) live side by side and compose by union. Vector search is great when I need semantic recall — "find me the chunk that talks about this idea, even if it uses different words." Path-based retrieval is what I reach for when I need the system to stop being poetic and start being literal — "find me chunks from this exact file." Production retrieval stacks at scale usually go further and combine BM25 lexical scores with dense vectors via reciprocal rank fusion; this is the minimal version of that pattern, sized to the problem.

The chunk-level split is what makes the path-based path useful, because the file can now return multiple precise spans instead of one oversized blob. If I am asking about a function, I want the function's neighborhood, not the whole neighborhood's autobiography.

The shape of the storage model

The storage row carries the document_id, chunk_index, chunk_text, embedding, and metadata. That shape is what makes downstream retrieval and debugging sane: when I inspect a result, I can tell where it came from, where it sits in the source, and how many sibling chunks exist around it. The schema is simple enough to survive maintenance and specific enough to survive scrutiny — a combination that is rarer than it should be.

export interface DocumentChunk {
  id?: string;
  document_id: string;
  chunk_index: number;
  chunk_text: string;
  embedding?: number[];
  metadata?: Record<string, unknown>;
}

The practical tradeoff: granularity buys control, not magic

Chunking does not make retrieval smart by itself. It gives the search layer a better surface to work with — and that surface has a goldilocks zone. Too large and chunks blur together into mid-density topical clouds; too small and they lose enough context to become syntactic fragments whose embeddings collapse into the average. The 1000-character window here is roughly two paragraphs of prose or one tight function — large enough to carry meaning, small enough to admit only one dominant idea per row.

The overlap is the design's safety net. It encodes the assumption that meaning is local but not aligned — important sentences refuse to respect window boundaries — and pays a storage tax to keep adjacent chunks neighbors in vector space. This is the same tradeoff that shows up in sliding-window attention (preserve locality at the cost of duplicated work) and in n-gram lexical indexes (overlap at the boundary to avoid losing cross-token matches).

The other cost is operational. More chunks mean more embeddings, and more embeddings mean more insert work and more bytes in the HNSW graph. I accept that because the alternative is a retrieval system that keeps returning the right topic and the wrong answer, which is a very expensive way to be unhelpful.

The retrieval story, drawn plainly

Why I kept the split at the chunk boundary

I could have pushed more logic into the embedding step — late-interaction models like ColBERT preserve per-token embeddings and aggregate at query time, which gives you precision without choosing a chunk size up front. I could have tried to recover precision later with cross-encoder reranking. But both move the problem instead of solving it; they paper over a representation that is wrong at the unit of storage.

So I split by chunk, embed by chunk, store by chunk, and retrieve by chunk. The system is easier to inspect because every stage speaks the same language. That is the real payoff: not just better recall, but a retrieval stack whose granularity matches the way I actually ask questions.

When the answer lives inside a file, I want the search layer to arrive with a scalpel, not a shovel.