Standard RAG pipelines use 'Early Chunking': they arbitrarily slice a PDF into 500-word blocks, and then embed each block separately. This destroys context—a chunk in the middle of chapter 3 might lose all reference to the main subject established in chapter 1, making it un-findable in a vector search. Late Chunking (pioneered by models like Jina AI's v2/v3 architecture) solves this by passing the *entire* document (up to 8k or 32k tokens) through the Transformer model. The model computes contextualized representations for every single token based on the whole document. Only after this global contextualization occurs are the tokens pooled together into smaller 500-word chunks for storage.
How It Works
- Global Processing: A long-context embedding model reads the entire document in one pass.
- Token-Level Embedding: The model generates an embedding for every single token, enriched by the context of the entire text.
- Boundary Definition: The text is split into chunks (e.g., by paragraph or sentence).
- Mean Pooling: The token embeddings within each boundary are averaged together to create the final chunk vector. The resulting vector inherently 'knows' the broader context of the document it came from.
Common Use Cases
- Processing complex financial reports or legal contracts where individual clauses depend heavily on the overarching document theme.
- Solving the 'lost in the middle' retrieval problem in traditional RAG.