The first time I set up alerts for an AI agent fleet, I made the same mistake most people make: I set alerts on everything. Task duration over threshold. Token count over threshold. Error rate over threshold. Quality rejection rate over threshold.

Within a week, I had 40 alerts per day. Most were noise. The real issue was buried in alert #37 on a Tuesday afternoon when nobody was checking.

Alert fatigue is real. Here's how to build alerting that actually helps.

Why Agent Alerts Are Different From Service Alerts

Service alerts are mostly binary: the service is up or it's down. The signal is high-confidence. When your API returns 500 errors, that's an alert worth paging someone over.

Agent alerts are inherently noisy. Agents deal with variable inputs. An agent that normally takes 45 seconds might take 3 minutes on a large input — that's not necessarily a problem. An error rate that spikes and then recovers in 10 minutes might be a transient provider issue that the agent already retried and handled.

The challenge is building alerts that catch real drift without firing on normal variation.

The Hierarchy of Agent Alerts

Think of agent alerts in three tiers:

Tier 1 — Page someone now:

Agent has been in "blocked" state for more than 30 minutes
Single task cost exceeded $10 (runaway cost)
Agent has been offline for 15+ minutes during expected working hours
Error rate spiked above 50% in the last 30 minutes

Tier 2 — Investigate within the hour:

Task duration consistently 2x baseline over the last 10 tasks
Quality rejection rate above 30% over the last 24 hours
Cost per task increased 50% compared to 7-day average
Agent hasn't submitted a deliverable in 2 hours during active hours

Tier 3 — Daily digest:

Week-over-week quality rejection rate trend
Monthly cost per agent compared to prior month
Agent throughput (tasks per day) trending down

Tier 1 fires as PagerDuty or urgent Slack. Tier 2 fires as Slack. Tier 3 is a weekly summary report. Only Tier 1 wakes anyone up.

Loading diagram…

The Key: Use Trend Alerts, Not Threshold Alerts

Most monitoring systems default to threshold alerts: "alert when metric X exceeds value Y." For agents, this produces noise.

Trend alerts are better: "alert when metric X has been above normal for the last N observations." This smooths out the single-event spikes that are often transient.

Examples:

Not "alert if task duration exceeds 180 seconds" but "alert if average task duration over the last 5 tasks exceeds 2x the 7-day baseline"
Not "alert if rejection rate exceeds 20%" but "alert if rejection rate over the last 24 hours is more than 2x the prior 7-day average"
Not "alert if cost exceeds $1.00 per task" but "alert if cost per task is trending upward for 3 consecutive days"

Trend alerts require calculating rolling averages and baselines, but they dramatically reduce noise.

Setting Baselines Per Agent

The baseline for a summarization agent is different from the baseline for a research agent. A research agent routinely takes 5-10 minutes. A summarization agent should complete in under 90 seconds. Alerting on duration requires knowing which normal you're comparing to.

Set baselines per agent after 2 weeks of production data. Run for 2 weeks, compute the P50 and P95 for duration, cost, and rejection rate. Those become your baseline. Your alerts fire when current metrics deviate from that baseline, not from a global threshold.

AgentCenter's agent monitoring tracks task history per agent. After 2 weeks, you have the data you need to set per-agent baselines.

The Forbidden Alert

The most common noise-generating alert: "alert on any error."

Don't do this. Transient errors that resolve on retry are normal. Your agent handles them without human intervention. Alerting on them adds noise without value.

Alert on persistent errors instead. "If the same error occurs 5 times in 30 minutes without a successful task completing between failures, alert." That's a pattern indicating a real problem, not a transient one.

Testing Your Alert Setup

After setting up alerts, test them deliberately:

Force a task into "blocked" state and confirm the Tier 1 alert fires within 5 minutes
Run a task with unusually large input to see if the duration alert fires or is filtered as noise
Check your alert volume after one week — if you're getting more than 5 non-Tier-1 alerts per week, something is too sensitive

The goal is an alert setup where every alert that fires gets investigated within the target response time. If alerts are routinely ignored, they're noise.

Bottom Line

Start with 3 Tier 1 alerts (blocked, offline, runaway cost). After a month of stable operation, add 2-3 Tier 2 trend alerts. Review and adjust quarterly. Less is more — a system where every alert gets investigated is worth more than a comprehensive system where important alerts get buried.

The best time to set this up is before your agents start failing. Try AgentCenter free for 7 days — cancel anytime.

How to Alert on Agent Drift Without Drowning in Noise

Why Agent Alerts Are Different From Service Alerts

The Hierarchy of Agent Alerts

The Key: Use Trend Alerts, Not Threshold Alerts

Setting Baselines Per Agent

The Forbidden Alert

Testing Your Alert Setup

Bottom Line

Related Posts

How to Set Up AI Agent Monitoring from Scratch

How to Review Agent Deliverables at Scale

What Agent Monitoring Dashboards Miss