Foundations
Chunking and Units of Retrieval
・
Retrieval systems do not operate on raw documents. Before any search or ranking happens, data is split into smaller pieces. These pieces are referred to as the units of retrieval or chunks. How those units are defined has a direct impact on the quality, relevance, and reliability of information retrieval.
This article builds on the basics of information retrieval by focusing on what is actually retrieved before similarity or ranking is applied.
What is chunking?
Chunkingis the process of splitting source data into discrete units that can be indexed and retrieved.
Once chunking is applied, the original document structure is no longer directly accessible to the retrieval system. All downstream components, including embedding, indexing, search, and ranking, operate on chunks rather than full documents. Because language models only see retrieved chunks, chunking defines what information the model can consider when answering a query.
Approaches
The simplest approach is fixed size chunking.
Fixed size chunkingis a chunking method where text is split by token count with optional overlap between adjacent chunks.
This method is straightforward to implement but ignores structure and semantics. It often results in:
broken sentences or code blocks
mixed topics within a single chunk
loss of logical boundaries such as sections or paragraphs
A more reliable approach is semantic chunking.
Semantic chunkingis a chunking method that attempts to identify natural topic boundaries rather than splitting at arbitrary positions.
Semantic chunking works by computing representations for smaller units such as sentences and measuring similarity between adjacent sections. When similarity drops significantly, the system identifies a topic shift and creates a boundary. This helps keep related content grouped together within a single chunk.
Some content types have inherent structure that chunking should respect. Code, for example, can be parsed into an abstract syntax tree and split at logical boundaries such as functions, classes, or methods. Documents with clear headings or markup can be split at section boundaries.
Structure aware chunkingis a chunking method that preserves meaningful units that would otherwise be fragmented by fixed size or semantic approaches.
Token limits
Embedding models impose maximum input sizes. A chunk that exceeds the embedding model’s token limit cannot be embedded at all. This makes chunk size constraints a system requirement, not just a quality preference.
When a chunking strategy produces content that exceeds these limits, a fallback mechanism is necessary to split it further, typically at token boundaries. This ensures all chunks meet system constraints while preserving structure wherever possible.
In that sense, chunk size introduces an inherent tradeoff:
Smaller chunks improve precision because retrieved units are tightly focused. However, they may omit context required to fully answer a query.
Larger chunks preserve more context but increase the risk of irrelevant information being retrieved alongside relevant content. They also consume more of the model’s limited context window.
That being said, there is not really a universally correct chunk size. What's optimal depends on the specific data source, query patterns, and how much context the downstream models actually need in practice.
Effect on retrieval
It's important to note that once data is chunked and indexed, a retrieval system won't be able to recover information that was lost or fragmented during chunking. This is an important points and is why poorly defined chunking can lead to:
partial or incomplete answers
irrelevant context being passed to the model
retrieval results that appear noisy or inconsistent
These issues are often misattributed to later stages in the retrieval pipeline. In practice however, many retrieval failures actually originate from how chunks are defined. In any case, chunking should be treated as a core system design decision (a first class citizen) rather than a preprocessing detail.
Looking ahead
Chunking defines the units of retrieval that a system operates on. These units determine what information is eligible for retrieval and how useful retrieved context will be.
In the next article, we will explore how these chunks are transformed into embeddings, the numerical representations that actually enable semantic search.
