Traditional 'naive' chunking splits documents by an arbitrary character count (e.g., 500 tokens). This often slices a complex paragraph in half, destroying the context and guaranteeing poor vector retrieval. Semantic chunking solves this by using a lightweight embedding model to measure the 'distance' between consecutive sentences. If the distance suddenly spikes, the system recognizes a shift in topic and draws the chunk boundary there. This results in dynamically sized chunks that map perfectly to the original author's logical boundaries, drastically reducing hallucinations during the generation phase of RAG.
How It Works
- Sentence Splitting: The document is broken down into individual sentences.
- Embedding: Each sentence is individually embedded into a vector.
- Distance Calculation: The system calculates the cosine distance between Sentence A and Sentence B, then B and C.
- Boundary Creation: When the distance between two sentences exceeds a predefined threshold (e.g., 95th percentile), a chunk boundary is placed between them.
Common Use Cases
- Processing highly unstructured documents like transcripts or chat logs where formatting cues are missing.
- Improving recall accuracy in RAG systems built on dense academic papers.