Airweave
Getting Started with Airweave
・
Building AI agents that need access to real-world data requires solving a fundamental problem: how do you give your agent reliable, up-to-date context from dozens of different sources without building custom integrations for each one?
This article walks through the core workflow of using Airweave to turn scattered data sources into a unified retrieval layer that AI agents can query in a single request. In essence, using Airweave follows a straightforward pattern:
Create a collection (your searchable knowledge base)
Add source connections (link your apps and databases)
Wait for sync (Airweave pulls and indexes your data)
Search and retrieve (query from your agent or application)
Each step builds on the last, and once configured, Airweave handles continuous synchronization automatically.
Collections
A
Collectionis a searchable knowledge base composed of entities from one or more source connections. Collections are what your AI agents actually query.
Think of a collection as a unified index across multiple data sources. You might create a collection called "Engineering Context" that includes:
GitHub issues and pull requests
Slack messages from your engineering channel
Notion documentation
Linear tickets
When your agent searches this collection, it retrieves relevant results from all connected sources in a single query, ranked by relevance regardless of where the data originated.
Collections are created through the SDK or API:
Once created, a collection has a unique readable_id that you'll use for all subsequent operations.
Source Connections
A
Source Connectionis a configured, authenticated instance of a connector linked to your specific account or workspace. It represents the actual live connection to your data using your credentials.
While Airweave supports many source types (Slack, GitHub, Notion, Google Drive, databases, and more), each source connection is specific to your account. You might have multiple connections to the same source type. For example, connecting to three different Slack workspaces or two separate GitHub organizations.
Creating a source connection requires:
Selecting a connector: The source type you want to connect (e.g., "slack", "github", "notion")
Authenticating: Providing credentials via OAuth or API keys
Assigning to a collection: Linking the connection to an existing collection
For OAuth-based sources like Slack, Google Drive, or GitHub, Airweave handles the OAuth flow through the UI. For API-key-based sources like Stripe or custom databases, you provide the credentials directly.
Syncing
Once a source connection is created, Airweave immediately triggers an initial sync. This process:
Pulls all accessible data from the source
Transforms it into searchable entities
Chunks long content for better retrieval
Generates embeddings for semantic search
Indexes everything in Vespa
The initial sync can take time depending on data volume. A Slack workspace with years of messages might take several minutes. A large Google Drive with thousands of large documents could take longer.
After the initial sync completes, Airweave continues syncing on a schedule (configurable per connection) or can be triggered programmatically via the API. Incremental syncs are fast because Airweave only processes new or modified data.
You can monitor sync status through the dashboard or by checking the source connection object:
Searching
When an agent searches a collection, the query runs across all entities from all connected sources, returning the most relevant results regardless of where the data originally came from.
Search is where Airweave delivers value. Your agent sends a natural language query, and Airweave returns the most relevant context from across all connected sources.
Behind the scenes, Airweave runs a hybrid search combining:
Semantic search: Vector similarity using embeddings
Keyword search: BM25 for exact term matching
Reranking: LLM-based reranking for precision
Results include source attribution, so your agent knows exactly where each piece of information came from. This enables citation-backed responses and helps users verify facts.
Entities
An
Entityis a single, searchable item extracted from a source. Entities are the atomic units of data that get indexed and returned in search results.
You don't interact with entities directly in most cases, but understanding them helps explain how Airweave works. When Airweave syncs a source connection, it extracts entities:
A Slack message becomes an entity
A GitHub codefile becomes an entity
A Notion page becomes an entity
A database row becomes an entity
Each entity carries metadata like timestamps, author information, source type, and links back to the original content. This metadata enables filtering and source attribution in search results.
Permission Awareness
One critical aspect of Airweave's design: it respects source-level permissions. When you authenticate a source connection, Airweave only syncs data your credentials can access.
For example:
A Slack connection only syncs channels the authenticated user can see
A GitHub connection only syncs repositories the token has access to
A Google Drive connection only syncs files the user can read
This means different users can have different collections with different source connections, each seeing only the data they're authorized to access.
Integration Patterns
Airweave integrates into AI applications through several interfaces:
SDK (Python/Node.js) Best for custom agents and applications. Full programmatic control over collections, source connections, and search.
REST API Direct HTTP access for any language or framework. Useful for integrations beyond the SDK languages.
MCP Server Model Context Protocol integration for tools like Claude Desktop. Enables agents to search Airweave collections as a native capability.
Framework Integrations Native support for popular agent frameworks like Vercel and LlamaIndex, enabling drop-in retrieval without custom code.
The choice depends on your stack, but all interfaces provide the same core functionality: create collections, add sources, search for context.
Looking Ahead
Airweave handles the infrastructure of context retrieval so you can focus on building capable agents. Once collections are configured and syncing, your agent has reliable access to up-to-date context without worrying about API quirks, rate limits, or keeping data fresh.
The patterns described here (collections, source connections, continuous sync, unified search) form the foundation for building agents that operate on real-world data rather than static snapshots. Whether you're building internal tools, customer-facing assistants, or autonomous agents, Airweave provides the retrieval layer that connects intelligence to information.
