/

/

Getting Started with Airweave

/

/

Getting Started with Airweave

Airweave

Getting Started with Airweave

5 min read

5 min read

5 min read

5 min read

Building AI agents that need access to real-world data requires solving a fundamental problem: how do you give your agent reliable, up-to-date context from dozens of different sources without building custom integrations for each one?

This article walks through the core workflow of using Airweave to turn scattered data sources into a unified retrieval layer that AI agents can query in a single request. In essence, using Airweave follows a straightforward pattern:

  1. Create a collection (your searchable knowledge base)

  2. Add source connections (link your apps and databases)

  3. Wait for sync (Airweave pulls and indexes your data)

  4. Search and retrieve (query from your agent or application)

Each step builds on the last, and once configured, Airweave handles continuous synchronization automatically.

Collections

A Collection is a searchable knowledge base composed of entities from one or more source connections. Collections are what your AI agents actually query.

Think of a collection as a unified index across multiple data sources. You might create a collection called "Engineering Context" that includes:

  • GitHub issues and pull requests

  • Slack messages from your engineering channel

  • Notion documentation

  • Linear tickets

When your agent searches this collection, it retrieves relevant results from all connected sources in a single query, ranked by relevance regardless of where the data originated.

Collections are created through the SDK or API:

collection = airweave.collections.create(
    name="Engineering Context"
)
collection = airweave.collections.create(
    name="Engineering Context"
)
collection = airweave.collections.create(
    name="Engineering Context"
)
collection = airweave.collections.create(
    name="Engineering Context"
)

Once created, a collection has a unique readable_id that you'll use for all subsequent operations.

Source Connections

A Source Connection is a configured, authenticated instance of a connector linked to your specific account or workspace. It represents the actual live connection to your data using your credentials.

While Airweave supports many source types (Slack, GitHub, Notion, Google Drive, databases, and more), each source connection is specific to your account. You might have multiple connections to the same source type. For example, connecting to three different Slack workspaces or two separate GitHub organizations.

Creating a source connection requires:

  1. Selecting a connector: The source type you want to connect (e.g., "slack", "github", "notion")

  2. Authenticating: Providing credentials via OAuth or API keys

  3. Assigning to a collection: Linking the connection to an existing collection

source_connection = airweave.source_connections.create(
    name="My Stripe Connection",
    short_name="stripe",
    readable_collection_id=collection.readable_id,
    authentication={
        "credentials": {
            "api_key": "your_stripe_api_key"
        }
    }
)
source_connection = airweave.source_connections.create(
    name="My Stripe Connection",
    short_name="stripe",
    readable_collection_id=collection.readable_id,
    authentication={
        "credentials": {
            "api_key": "your_stripe_api_key"
        }
    }
)
source_connection = airweave.source_connections.create(
    name="My Stripe Connection",
    short_name="stripe",
    readable_collection_id=collection.readable_id,
    authentication={
        "credentials": {
            "api_key": "your_stripe_api_key"
        }
    }
)
source_connection = airweave.source_connections.create(
    name="My Stripe Connection",
    short_name="stripe",
    readable_collection_id=collection.readable_id,
    authentication={
        "credentials": {
            "api_key": "your_stripe_api_key"
        }
    }
)

For OAuth-based sources like Slack, Google Drive, or GitHub, Airweave handles the OAuth flow through the UI. For API-key-based sources like Stripe or custom databases, you provide the credentials directly.

Syncing

Once a source connection is created, Airweave immediately triggers an initial sync. This process:

  • Pulls all accessible data from the source

  • Transforms it into searchable entities

  • Chunks long content for better retrieval

  • Generates embeddings for semantic search

  • Indexes everything in Vespa

The initial sync can take time depending on data volume. A Slack workspace with years of messages might take several minutes. A large Google Drive with thousands of large documents could take longer.

After the initial sync completes, Airweave continues syncing on a schedule (configurable per connection) or can be triggered programmatically via the API. Incremental syncs are fast because Airweave only processes new or modified data.

You can monitor sync status through the dashboard or by checking the source connection object:

status = airweave.source_connections.get(
    source_connection_id=source_connection.id
)
print(status.status)
status = airweave.source_connections.get(
    source_connection_id=source_connection.id
)
print(status.status)
status = airweave.source_connections.get(
    source_connection_id=source_connection.id
)
print(status.status)
status = airweave.source_connections.get(
    source_connection_id=source_connection.id
)
print(status.status)

Searching

When an agent searches a collection, the query runs across all entities from all connected sources, returning the most relevant results regardless of where the data originally came from.

Search is where Airweave delivers value. Your agent sends a natural language query, and Airweave returns the most relevant context from across all connected sources.

results = airweave.collections.search(
    readable_id=collection.readable_id,
    query="What are the open bugs related to authentication?",
    limit=10
)

for result in results.results:
    print(f"Source: {result.source_name}")
    print(f"Content: {result.md_content}")
    print(f"Score: {result.score}")
results = airweave.collections.search(
    readable_id=collection.readable_id,
    query="What are the open bugs related to authentication?",
    limit=10
)

for result in results.results:
    print(f"Source: {result.source_name}")
    print(f"Content: {result.md_content}")
    print(f"Score: {result.score}")
results = airweave.collections.search(
    readable_id=collection.readable_id,
    query="What are the open bugs related to authentication?",
    limit=10
)

for result in results.results:
    print(f"Source: {result.source_name}")
    print(f"Content: {result.md_content}")
    print(f"Score: {result.score}")
results = airweave.collections.search(
    readable_id=collection.readable_id,
    query="What are the open bugs related to authentication?",
    limit=10
)

for result in results.results:
    print(f"Source: {result.source_name}")
    print(f"Content: {result.md_content}")
    print(f"Score: {result.score}")

Behind the scenes, Airweave runs a hybrid search combining:

  • Semantic search: Vector similarity using embeddings

  • Keyword search: BM25 for exact term matching

  • Reranking: LLM-based reranking for precision

Results include source attribution, so your agent knows exactly where each piece of information came from. This enables citation-backed responses and helps users verify facts.

Entities

An Entity is a single, searchable item extracted from a source. Entities are the atomic units of data that get indexed and returned in search results.

You don't interact with entities directly in most cases, but understanding them helps explain how Airweave works. When Airweave syncs a source connection, it extracts entities:

  • A Slack message becomes an entity

  • A GitHub codefile becomes an entity

  • A Notion page becomes an entity

  • A database row becomes an entity

Each entity carries metadata like timestamps, author information, source type, and links back to the original content. This metadata enables filtering and source attribution in search results.

Permission Awareness

One critical aspect of Airweave's design: it respects source-level permissions. When you authenticate a source connection, Airweave only syncs data your credentials can access.

For example:

  • A Slack connection only syncs channels the authenticated user can see

  • A GitHub connection only syncs repositories the token has access to

  • A Google Drive connection only syncs files the user can read

This means different users can have different collections with different source connections, each seeing only the data they're authorized to access.

Integration Patterns

Airweave integrates into AI applications through several interfaces:

SDK (Python/Node.js) Best for custom agents and applications. Full programmatic control over collections, source connections, and search.

REST API Direct HTTP access for any language or framework. Useful for integrations beyond the SDK languages.

MCP Server Model Context Protocol integration for tools like Claude Desktop. Enables agents to search Airweave collections as a native capability.

Framework Integrations Native support for popular agent frameworks like Vercel and LlamaIndex, enabling drop-in retrieval without custom code.

The choice depends on your stack, but all interfaces provide the same core functionality: create collections, add sources, search for context.

Looking Ahead

Airweave handles the infrastructure of context retrieval so you can focus on building capable agents. Once collections are configured and syncing, your agent has reliable access to up-to-date context without worrying about API quirks, rate limits, or keeping data fresh.

The patterns described here (collections, source connections, continuous sync, unified search) form the foundation for building agents that operate on real-world data rather than static snapshots. Whether you're building internal tools, customer-facing assistants, or autonomous agents, Airweave provides the retrieval layer that connects intelligence to information.

On this page

No headings found on page