Foundations

Basics of Information Retrieval

In the previous article, we defined Context Engineering as the systematic management of the information an AI model sees. However, before we can manage that information, we must first find it.

This is the role of Information Retrieval (or IR). In the context of AI agents and Retrieval-Augmented Generation (RAG), IR acts as the foundational layer that identifies relevant facts before the model ever generates a response.

DR vs IR

To build effective AI systems, we first have to distinguish between finding raw data and finding meaningful information.

Standard Data Retrieval is deterministic and binary. When you execute a simple SELECT x from Y SQL query for a specific ID, the system either finds that exact match or returns nothing. It is a precise operation based on explicit parameters.

Information Retrieval, by contrast, is probabilistic. Because IR deals with unstructured text where exact matches are often rare, the goal is not a simple yes or no answer. Instead, the system provides a relevance ranking. It returns a list of results sorted by the mathematical likelihood that they satisfy a specific "information need".

Keyword vs Semantic

While IR has traditionally relied on keyword matching, modern AI has introduced search based on semantic meaning. Most production-grade systems now utilize a hybrid of these two methods.

Keyword Search, or lexical retrieval, looks for literal character matches between a query and a document. This is typically achieved through an Inverted Index, which functions as a map of terms and the documents they inhabit. Algorithms like BM25 calculate scores based on term frequency and document frequency.

The primary limitation of keyword search is the "vocabulary mismatch" problem. If a user searches for "cardiac arrest" but the source text uses "heart attack", a keyword system will fail to bridge that gap.

Semantic Search addresses this by focusing on intent rather than spelling. By utilizing machine learning models, text is converted into Embeddings, which are high-dimensional numerical vectors. The system then calculates the distance between the query vector and document vectors, often using Cosine Similarity.

This allows the system to understand that "how to cool a room" is conceptually related to "air conditioning" regardless of the specific words used.

Precision and Recall

Engineering context for AI systems requires managing a constant balance between two primary metrics: Recall and Precision.

Recall measures the system's ability to find all relevant documents, ensuring that vital facts are not missed.

Precision measures the accuracy of those results, ensuring the system is precise and does not include irrelevant noise.

The balance between the two is critical because AI models operate within a limited context window. If a retrieval system provides high recall but low precision, the window becomes cluttered with "context rot".

This irrelevant information forces the model to expend its limited attention budget on noise, which often leads to inconsistent reasoning or hallucinations. Effective IR is not about finding the most data, but rather finding the most grounded and pertinent information.

Looking Ahead

In the next article, we will explore the "math of meaning" by diving deeper into the concept of Embeddings, explaining how text is actually transformed into searchable vector space.

3 min read

Chunking and Units of Retrieval

Retrieval systems do not operate on raw documents. Before any search or ranking happens, data is split into smaller pieces. These pieces are referred to as the units of retrieval or chunks. How those units are defined has a direct impact on the quality, relevance, and reliability of information retrieval.

This article builds on the basics of information retrieval by focusing on what is actually retrieved before similarity or ranking is applied.

What is chunking?

Chunking is the process of splitting source data into discrete units that can be indexed and retrieved.

Once chunking is applied, the original document structure is no longer directly accessible to the retrieval system. All downstream components, including embedding, indexing, search, and ranking, operate on chunks rather than full documents. Because language models only see retrieved chunks, chunking defines what information the model can consider when answering a query.

Approaches

The simplest approach is fixed size chunking.

Fixed size chunking is a chunking method where text is split by token count with optional overlap between adjacent chunks.

This method is straightforward to implement but ignores structure and semantics. It often results in:

  • broken sentences or code blocks

  • mixed topics within a single chunk

  • loss of logical boundaries such as sections or paragraphs

A more reliable approach is semantic chunking.

Semantic chunking is a chunking method that attempts to identify natural topic boundaries rather than splitting at arbitrary positions.

Semantic chunking works by computing representations for smaller units such as sentences and measuring similarity between adjacent sections. When similarity drops significantly, the system identifies a topic shift and creates a boundary. This helps keep related content grouped together within a single chunk.

Some content types have inherent structure that chunking should respect. Code, for example, can be parsed into an abstract syntax tree and split at logical boundaries such as functions, classes, or methods. Documents with clear headings or markup can be split at section boundaries.

Structure aware chunking is a chunking method that preserves meaningful units that would otherwise be fragmented by fixed size or semantic approaches.

Token limits

Embedding models impose maximum input sizes. A chunk that exceeds the embedding model’s token limit cannot be embedded at all. This makes chunk size constraints a system requirement, not just a quality preference.

When a chunking strategy produces content that exceeds these limits, a fallback mechanism is necessary to split it further, typically at token boundaries. This ensures all chunks meet system constraints while preserving structure wherever possible.

In that sense, chunk size introduces an inherent tradeoff:

Smaller chunks improve precision because retrieved units are tightly focused. However, they may omit context required to fully answer a query.

Larger chunks preserve more context but increase the risk of irrelevant information being retrieved alongside relevant content. They also consume more of the model’s limited context window.

That being said, there is not really a universally correct chunk size. What's optimal depends on the specific data source, query patterns, and how much context the downstream models actually need in practice.

Effect on retrieval

It's important to note that once data is chunked and indexed, a retrieval system won't be able to recover information that was lost or fragmented during chunking. This is an important points and is why poorly defined chunking can lead to:

  • partial or incomplete answers

  • irrelevant context being passed to the model

  • retrieval results that appear noisy or inconsistent

These issues are often misattributed to later stages in the retrieval pipeline. In practice however, many retrieval failures actually originate from how chunks are defined. In any case, chunking should be treated as a core system design decision (a first class citizen) rather than a preprocessing detail.

Looking ahead

Chunking defines the units of retrieval that a system operates on. These units determine what information is eligible for retrieval and how useful retrieved context will be.

In the next article, we will explore how these chunks are transformed into embeddings, the numerical representations that actually enable semantic search.

3 min read

Embeddings and Semantic Search

If Information Retrieval is the process of finding relevant facts, then Embeddings are the language that makes that process possible for a machine. While humans perceive language through syntax and definitions, machine learning models require a mathematical representation to understand the relationship between different pieces of data.

Representation

At its most basic level, an embedding is a numerical representation of an object, such as a word, a sentence, or an entire document. This object is transformed into a fixed-length array of numbers called a vector. These are not simple binary values. They are continuous floating-point numbers that act as coordinates in a high-dimensional space.

When we say this space is high-dimensional, we mean it may have hundreds or even thousands of axes. For example, modern embedding models (for example, the OpenAI embedders) frequently generate vectors with 1,536 dimensions. Each dimension conceptually represents a specific feature or trait of the data, such as its topic, tone, or relationship to other concepts.

Geometric Properties

The power of embeddings lies in their ability to preserve semantic relationships. In a well-trained embedding space, items with similar meanings are positioned closer to one another than items that are unrelated.

This spatial arrangement allows machines to perform semantic arithmetic. A classic example in natural language processing is the relationship between gender and royalty. The vector for "King" minus the vector for "Man" plus the vector for "Woman" will result in a coordinate very close to the vector for "Queen". Because the model has mapped royalty and gender as distinct dimensions in its mathematical world, it can navigate these concepts without needing a dictionary.

Similarity

Once data is converted into vectors, the task of finding relevant information becomes a geometry problem. We use distance metrics to quantify how similar two embeddings are.

Cosine Similarity: This is the most common metric for text embeddings. Instead of measuring the raw distance between two points, it measures the angle between two vectors. If two vectors point in exactly the same direction, their similarity score is 1. If they are perpendicular, it is 0.

This is particularly useful for text because it focuses on the orientation of the meaning rather than the length of the document.

Euclidean Distance (L2): This measures the straight-line distance between two points in the vector space. While intuitive, it can be sensitive to the magnitude of the vectors, meaning it might struggle if your documents vary greatly in length.

Purpose

Without embeddings, an AI agent's information retrieval would be limited to exact keyword matching. This approach is brittle and easily confused by synonyms or varied phrasing. By using embeddings, a developer can ensure that if a user asks about "reducing household energy costs", the system can retrieve documents about "insulation" or "solar panels" even if the original query did not use those specific terms.

This capability is the engine behind semantic search and RAG, and is what allows the system to bridge the gap between human messy, natural language queries and the structured or unstructured data stored in your apps and databases.

3 min read

Vector Databases for AI

In the previous article we introduced what embeddings are and how semantic search uses similarity between vectors to find relevant content. Embeddings give us a way to measure meaning in numeric form. What comes next is how we store and search those embeddings efficiently.

A vector database is a database that stores embeddings and makes similarity search practical at scale.

A vector database stores embeddings together with metadata. It answers queries like “which stored items are closest in meaning to this input”. Instead of looking for exact text matches, it compares vectors and returns the most similar ones.

This makes vector databases a core component of retrieval systems used in AI applications such as semantic search, knowledge lookup, and retrieval-augmented generation.

Traditional databases

Traditional databases excel at exact matches and structured queries. They are not designed for high-dimensional vector data where the goal is to measure closeness in meaning rather than equality.

Semantic retrieval requires:

  • comparing many vectors quickly

  • using distance or similarity metrics

  • finding nearest neighbors among millions of items

Vector databases use specialized indexes and algorithms to make this fast.

How vector databases work

The basic loop looks like this:

1. Ingest and embed
Turn your text or other data into embeddings using a model. Store the vectors and any metadata.

2. Index for similarity
Build an index optimized for nearest neighbor queries in many dimensions. This avoids comparing every vector on every query.

3. Query and compare
Convert the user query into an embedding with the same model. Search for the stored vectors that are most similar by a distance measure such as cosine similarity.

4. Return results
Fetch the content linked to the best matching vectors so your application can use them.

This flow lets you find the most relevant content by meaning rather than by exact text.

Usecases

Vector databases start to matter whenever you rely on embedding similarity for retrieval. Common usecases include:

  • semantic document search

  • RAG (retrieval-augmented generation) workflows

  • long-term memory storage for agents

  • similarity-based recommendations

If your system only needs exact matches or structured fields, a traditional database may still be the right choice. Vector databases become important once embeddings are the primary retrieval signal.

Looking Ahead

Even though vector similarity is a powerful concept, it's unfortunately not perfect. Pure vector search can (and in practice often does) miss exact textual matches such as:

  • matching specific codes or identifiers

  • finding proper nouns

  • matching on exact phrases that are critical for some queries

Because of this, production retrieval systems often combine multiple methods of search and retrieval to improve relevance and recall.

This leads us directly into the topics of the next article: hybrid search and reranking, where we will discuss blending vector and keyword methods and reorder results based on deeper evaluation.

2 min read

Hybrid Search and Reranking

Finding the right vector is only half the battle. While embeddings provide a powerful way to navigate the "meaning" of data, they are not a universal solution for every retrieval challenge. In production environments, relying solely on semantic similarity often leads to surprising failures, especially when queries involve technical jargon, product codes, or specific names.

To solve this, modern search architecture often applies a two-stage process: Hybrid Search followed by Reranking.

Hybrid Search

Hybrid search is the practice of running two different search methodologies in parallel and merging their results into a single list. It combines the semantic depth of vector search with the literal precision of keyword search.

Even the most advanced embedding models can struggle with "out-of-vocabulary" terms. For example, if a user searches for a specific error code like ERR_90210, a vector model might retrieve documents about "general system errors" because it recognizes the concept of a failure.

A keyword search, however, will find the exact manual entry for that specific code instantly. By using both, the system ensures that it captures both the broad intent of the original query and its specific details.

When you run two searches, you end up with two different lists of results, each with its own scoring system. Keyword search uses BM25 scores, while vector search uses Cosine Similarity. Because these scales are mathematically different, you cannot simply add them together.

The industry standard for merging these lists is Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion is a ranking algorithm that merges multiple result lists by using document rank positions instead of raw scores, rewarding items that rank highly across multiple retrieval methods.

Instead of looking at the raw scores, RRF looks at the rank of each document in both lists. A document that appears near the top of both the keyword list and the vector list receives a significantly higher final score than a document that only appears in one.

This approach is favored in production because it is robust, requires no manual tuning, and effectively balances the strengths of both retrieval methods.

Reranking

The final step in a high-performance retrieval pipeline is the Reranker, also known as a Cross-Encoder.

A Reranker (or Cross-Encoder) is computationally expensive model that jointly evaluates a query and each candidate result to produce a high-precision relevance score, used to reorder a small set of retrieved results so the most contextually correct information appears at the top.

While the initial retrieval stage (Bi-Encoders) is designed for speed, it often sacrifices some nuance to scan millions of documents quickly. A Reranker is a much more computationally expensive model that looks at the query and a document together at the same time.

Because it is slow, we do not use it to search the whole database. Instead, we take the top 50 or 100 results from our hybrid search and pass them to the Reranker.

The Reranker then performs a deep analysis of the relationship between the user's question and the content of each document, reordering them to ensure the most relevant information is at the very top of the list.

Context Window Optimization

In the context of an AI agent, every piece of information we retrieve occupies space in the model's limited attention span. Using hybrid search and reranking serves as a high-fidelity filter. By the time the information reaches the LLM, the "noise" has been stripped away, leaving only the high-signal facts required to generate an accurate response.

3 min read