Error monitoring tools send alerts. What engineering teams actually need is context: What code is involved? Did anyone work on this yet? Is this a new issue or is this a known regression?
This article walks through building an intelligent error monitoring agent that uses Airweave to transform raw error logs into enriched, actionable alerts. We'll cover the architecture, implementation patterns, and lessons learned from processing 40,000+ queries per month in production.
The full implementation is available at github.com/airweave-ai/error-monitoring-agent.
Problem Setting
Traditional error monitoring follows a simple pattern: error occurs, alert fires, engineer investigates. This breaks down at scale for several reasons:
Alert fatigue: A single underlying issue can generate hundreds of individual alerts. Engineers learn to ignore notifications or spend hours triaging duplicates.
Missing context: Error logs contain stack traces but lack the surrounding context engineers need. Which code is affected? Has this happened before? Is there already a ticket?
Manual correlation: Engineers manually search GitHub for relevant code, check Linear for existing tickets, and scan Slack for related discussions. This takes 10-15 minutes per error.
Reactive posture: By the time an alert reaches someone, customers have often already experienced the issue. There's no opportunity for proactive fixes.
For small teams maintaining complex systems, this overhead becomes unsustainable.
Architecture Overview
The error monitoring agent runs as a scheduled pipeline (every 5 minutes in production) with five core stages:
Fetch and cluster errors from monitoring systems
Search for context using Airweave across GitHub, Linear, and Slack
Analyze severity and determine if this is new, ongoing, or a regression
Determine suppression - should this trigger an alert or be silenced?
Create alerts in Slack and Linear with full context
Each stage feeds into the next, progressively enriching raw errors with the context engineers need to act quickly.
Stage 1: Semantic Error Clustering
Raw error logs are noisy. A database timeout might generate 50 identical stack traces within minutes. The first step is grouping errors by root cause rather than treating each occurrence as distinct.
Multi-Stage Clustering
The agent uses a four-stage clustering approach:
Stage 1: Strict Clustering - Group by exact module + function + line number match. This catches identical stack traces immediately.
Stage 2: Regex Pattern Clustering - Group by error type extracted via regex patterns. For example, "429", "rate limit", and "too many requests" all map to a "RateLimit" error type. Errors matching the same pattern type with 2+ occurrences form a cluster.
Stage 3: LLM Semantic Clustering - (Optional) Use Claude or GPT-4 to identify remaining unclustered errors with similar root causes but different surface presentations. The LLM returns groupings like [[0, 1, 3], [2], [4, 5]] and then a second LLM call generates a human-readable signature (50-150 chars) for each multi-error group.
Stage 4: Cluster Merging - Only runs when there are 3+ clusters. Uses the LLM to decide which clusters to merge. Falls back to merging clusters with the same extracted error type if no LLM is available.
This reduces 500 raw logs to approximately 10-15 distinct clusters worth investigating.
Implementation Pattern
from pipeline.clustering import ErrorClusterer
raw_errors = await data_source.fetch_errors(
window_minutes=30,
limit=100
)
clusterer = ErrorClusterer()
clusters = await clusterer.cluster_errors(
errors=raw_errors
)
for cluster in clusters:
print(f"Cluster: {cluster['signature']}")
print(f"Count: {cluster['error_count']}")
print(f"First seen: {cluster['first_occurrence']}")The clustering logic maintains state between runs to track whether a cluster is new, ongoing, or a regression of a previously fixed issue.
Stage 2: Context Search with Airweave
Once errors are clustered, the agent needs context. This is where Airweave transforms the workflow.
Multi-Source Search Strategy
For each error cluster, the agent performs three parallel searches:
GitHub search - Find code files and functions related to the error. Returns file paths with line numbers and relevant code snippets.
Linear search - Check for existing tickets about this issue. If found, link to the ticket instead of creating a duplicate.
Slack search - Surface past discussions, incident threads, or solutions from previous occurrences.
GitHub and Linear sync continuously into the Airweave collection. Slack uses federated search, querying the Slack API at search time and merging results via Reciprocal Rank Fusion. All three searches run through the same unified interface.
Implementation Pattern
from pipeline.search import ContextSearcher
searcher = ContextSearcher()
context_results = await searcher.search_context(clusters)
Under the hood, search_context calls a private method three times per cluster:
async def _search_source(self, query, source_filter=None, limit=5):
if source_filter:
response = await self.client.collections.search_advanced(
readable_id=self.collection_readable_id,
query=query,
filter={
"must": [
{"key": "source_name", "match": {"value": source_filter}}
]
},
limit=limit
)
else:
response = await self.client.collections.search(
readable_id=self.collection_readable_id,
query=query,
limit=limit
)For each cluster, the searcher performs three parallel searches:
query = f"{cluster['signature']} {cluster['sample_message']}"[:500]
github_results = await self._search_source(
query=query, source_filter="GitHub", limit=5
)
linear_results = await self._search_source(
query=query, source_filter="Linear", limit=3
)
docs_results = await self._search_source(
query=query, source_filter=None, limit=3
)The search results include full metadata: file paths, Linear ticket IDs, Slack thread URLs. This context gets attached to each cluster for the next stage.
Why This Works
Without Airweave, this context gathering would require:
Custom GitHub API integration to search code
Linear API client to query tickets semantically
Slack API wrapper to search message history
Manual correlation logic to rank results
Airweave handles all of this through a single unified interface. The agent sends three search queries and receives ranked, relevant results from each source, regardless of whether the data is synced (GitHub, Linear) or federated (Slack).
More importantly, Airweave provides semantic search rather than keyword matching. A keyword search across GitHub and Linear APIs would miss results where the wording differs. Airweave's vector search can match "database pool exhausted" to a Linear ticket titled "DB connection limits under load" - the kind of connection engineers make intuitively but keyword search cannot.
Stage 3: Severity Analysis and Status Determination
With context attached, the agent now determines severity and whether to alert.
Severity Classification
The agent uses Claude to analyze each cluster and assign a severity level:
S1 - Critical: Complete service outage, data loss/corruption, security breach, ALL users affected
S2 - High: Major feature broken, affecting multiple users
S3 - Medium: Minor feature degraded, workaround available
S4 - Low: Cosmetic issue, no user impact
The prompt is explicitly calibrated to be conservative - most errors should land at S3 or S4. Only genuine outages or data loss scenarios warrant S1.
The LLM receives the error details, stack trace, and Airweave context to make this determination.
severity_prompt = f"""
Analyze this error cluster and assign severity (S1-S4):
Error: {cluster['signature']}
Message: {cluster['sample_message']}
Occurrences: {cluster['error_count']} in last 30 min
Stack trace: {cluster['stack_trace']}
Context from GitHub:
{github_results.summary}
Context from Linear:
{linear_results.summary}
Provide severity (S1-S4) and reasoning.
"""
analysis = await llm.complete(severity_prompt)
cluster['severity'] = analysis.severity
cluster['reasoning'] = analysis.reasoningStatus Tracking
The agent maintains state to track error signatures across runs:
NEW - First time this error signature has been seen. Always creates an alert and Linear ticket.
ONGOING - Error signature exists with an open Linear ticket. Suppresses alerts but adds a comment to the existing ticket with updated context.
REGRESSION - Error signature was previously resolved (ticket closed) but has returned. Reopens the ticket and sends a high-priority alert.
This status logic prevents alert spam while ensuring critical issues never get missed.
Stage 4: Suppression Logic
With severity and status determined, the agent now decides whether to alert. Not every error cluster triggers a notification.
Smart Suppression
The agent applies suppression rules in priority order (first match wins):
Muted? If the error signature is muted (manually by an engineer), suppress - regardless of severity.
S1/S2 severity? Always alert, overriding all other suppression rules.
NEW status? First occurrence of this error signature - always alert.
REGRESSION? Previously fixed issue has returned - always alert.
ONGOING with open ticket? Suppress to avoid spam. The existing ticket tracks it.
Alerted within 24 hours? Suppress if we already notified about this signature recently.
Default: Alert.
The ordering is deliberate. Mutes are respected first (engineers made an explicit choice), but S1/S2 severity and regressions always punch through everything else. This ensures critical issues are never silently dropped.
Mute matching goes beyond exact strings. The agent uses a SemanticMatcher that compares new error signatures against active mutes using LLM-based semantic comparison. If an engineer mutes "database connection timeout," the agent will also suppress "DB pool exhausted" if the LLM judges them similar enough. The same semantic matching applies to finding existing Linear tickets - the agent can link a new error to a ticket even when the wording differs.
Stage 5: Enriched Alerts
The final stage creates alerts in Slack and Linear with all context attached.
Slack Notification Format
Each Slack message includes:
Error type and message
Severity level with color coding
Affected organizations (if multi-tenant)
Code context with clickable GitHub links
Linear ticket status (new, existing, reopened)
Mute controls (inline buttons to suppress)
await slack.send_alert(
channel=SLACK_CHANNEL_ID,
severity=cluster['severity'],
error_type=cluster['signature'],
message=cluster['sample_message'],
github_context=github_results,
linear_ticket=linear_ticket,
mute_signature=cluster['signature']
)Linear Ticket Creation
For new errors, the agent creates a Linear ticket with:
Title: Error type and brief description
Description: Full error details, stack trace, affected organizations
Priority: Mapped from severity (S1→Urgent, S2→High, S3→Medium, S4→Low)
Attachments: Links to relevant GitHub files and Slack threads
For existing tickets, it adds a comment with new occurrences and updated context.
Production Deployment
The agent supports two deployment modes: as a cron-triggered script for simple setups, or as a FastAPI server with REST and WebSocket endpoints for real-time visualization.
Scheduling Pattern
import asyncio
from main import run_pipeline, PipelineConfig
async def main():
config = PipelineConfig(
use_sample_data=False
)
result = await run_pipeline(config)
if __name__ == "__main__":
asyncio.run(main())In production, the agent runs as a FastAPI server with REST and WebSocket endpoints. The script above is a simplified standalone entrypoint for cron-based scheduling. The server-based architecture also powers a real-time pipeline visualization UI via WebSocket.
Run via cron every 5 minutes:
*/5 * * * * cd /path/to/agent && source
State Management
The agent maintains JSON-based state files to track:
Error signatures and their status (new/ongoing/regression)
Last alert timestamps for suppression logic
Muted error patterns
Linear ticket IDs mapped to error signatures
This state persists between runs, enabling the status tracking described earlier.
Results and Impact
Deploying this agent in production delivered measurable improvements:
Volume: Handles 40,000+ Airweave queries per month across GitHub, Linear, and Slack searches.
Alert reduction: 500 raw errors per day reduced to 15-20 actionable alerts (depending on error distribution), cutting noise by 95%.
Response time: Average time from error occurrence to engineer awareness dropped from hours to minutes.
Proactive fixes: Team often resolves issues before customers report them, then proactively notifies affected users.
Context efficiency: Engineers jump directly to relevant code and existing tickets instead of spending 10-15 minutes searching manually.
Key Implementation Lessons
Use Airweave Source Filtering
When searching for context, filtering by source type dramatically improves relevance:
github_results = await client.collections.search_advanced(
readable_id=collection_readable_id,
query=error_context,
filter={
"must": [
{"key": "source_name", "match": {"value": "GitHub"}}
]
},
limit=5
)
all_results = await client.collections.search(
readable_id=collection_readable_id,
query=error_context,
limit=15
)Cluster Before Searching
Running Airweave searches on individual errors is inefficient. Cluster first, then search once per cluster:
❌ Bad: 500 errors × 3 searches = 1,500 Airweave queries
✅ Good: 500 errors → 10 clusters × 3 searches = 30 Airweave queries
The actual compression ratio depends on your error distribution. Homogeneous failures (e.g., a single endpoint timing out) compress dramatically, while diverse errors across unrelated systems compress less.
LLM Analysis After Context Gathering
Don't use LLMs to determine severity from error logs alone. First gather context via Airweave, then pass everything to the LLM:
analysis = await llm.analyze(
error=cluster,
github_context=github_results,
linear_context=linear_results,
slack_context=slack_results
)This produces far more accurate severity assessments than analyzing errors in isolation.
Maintain Clear State
Error monitoring without state creates duplicate tickets and repeated alerts. Track signatures, statuses, and alert timestamps persistently:
class StateManager:
def get_signature_status(self, signature: str) -> str:
"""Returns: 'new', 'ongoing', or 'regression'"""
def record_alert(self, signature: str):
"""Track when we last alerted for this signature"""
def is_muted(self, signature: str) -> bool:
"""Check if engineers muted this error"""Graceful Degradation
The agent works at every configuration level. Without an LLM key, clustering falls back to regex patterns and severity uses rule-based heuristics. Without Airweave, the pipeline still clusters and analyzes errors - it just lacks external context. Without Slack or Linear configured, alerts render as previews. This means teams can adopt the agent incrementally: start with clustering alone, add Airweave when ready, enable Slack/Linear when the output is trusted.
Looking Ahead
Building an error monitoring agent demonstrates how Airweave enables a new class of autonomous tools. Rather than building custom integrations for GitHub, Linear, and Slack, the agent queries a single unified interface.
This pattern extends beyond error monitoring. Any workflow that requires context from multiple sources (customer support, incident response, code review, documentation generation) can use the same approach: connect sources to Airweave, then query for context as needed.
The key insight is that context retrieval should be infrastructure, not custom code. When you treat it as infrastructure, building intelligent agents becomes straightforward: focus on the logic (clustering, analysis, alerting) rather than the plumbing (API integrations, authentication, data sync).
The complete error monitoring agent implementation, including all code examples from this article, is available as an open-source project at github.com/airweave-ai/error-monitoring-agent.