Blog/Monitoring Guides

Prometheus Alertmanager: Complete Setup & Configuration Guide 2026

Alertmanager is where Prometheus alerts go to either save your night or get lost in noise. This guide covers installation, routing trees, grouping, silences, inhibition, and receiver config for Slack, PagerDuty, and webhooks — everything you need to go from zero to production-grade alerting.

PrometheusAlertmanagerMonitoringDevOpsSRE

TL;DR

📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

How Prometheus + Alertmanager Work Together

The alerting pipeline has two distinct stages that engineers often conflate:

Stage 1: Prometheus (Alert Evaluation)

  • • Evaluates PromQL expressions every 15–60s (evaluation_interval)
  • • Alert fires when expression is true for the entire for duration
  • • Sends firing alerts to Alertmanager via HTTP push
  • • Continues sending every send_resolved: true interval until resolved
  • • Config: alerting_rules.yml, prometheus.yml alerting section

Stage 2: Alertmanager (Notification Pipeline)

  • • Receives alerts via POST /api/v2/alerts
  • • Deduplicates identical alerts (same labels)
  • • Groups related alerts into one notification (group_by)
  • • Applies silences and inhibition rules
  • • Routes to receivers: Slack, PagerDuty, email, webhook
Key insight: Alertmanager never queries Prometheus. It only receives alerts pushed to it. If Prometheus can't reach Alertmanager, alerts are queued in Prometheus memory and retried. Configure alertmanagers in prometheus.yml with the Alertmanager address.

Installing Alertmanager

Alertmanager ships as a single binary. Download from the Prometheus GitHub releases page, or use Docker:

# Binary install (Linux amd64)
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64
./alertmanager --config.file=alertmanager.yml

# Docker
docker run -d -p 9093:9093 \
  -v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager

# Kubernetes (Helm)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack

Alertmanager runs on port 9093 by default. The web UI is available at http://localhost:9093 and shows current alerts, silences, and firing state.

Wiring Prometheus to Alertmanager

In prometheus.yml, add the alerting block and point it to your Alertmanager instance:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093  # or localhost:9093

rule_files:
  - "alerting_rules/*.yml"   # glob — load all rule files
  - "recording_rules/*.yml"

scrape_configs:
  # ... your scrape targets

Writing Prometheus Alerting Rules

Alerting rules live in separate .yml files referenced by rule_files. Each rule file has a list of groups, and each group has a list of rules:

# alerting_rules/infrastructure.yml
groups:
  - name: infrastructure
    interval: 1m  # optional override, defaults to evaluation_interval
    rules:
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been unreachable for more than 1 minute."
          runbook_url: "https://runbooks.example.com/node-down"

      - alert: HighCPU
        expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100 > 85
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}."

Common Alerting Rule Examples

High CPU Usage

warning
avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100 > 85
for: 10mannotation: CPU {{ $value | humanize }}% on {{ $labels.instance }}

Node Down

critical
up{job="node"} == 0
for: 1mannotation: Node {{ $labels.instance }} has been unreachable for 1 minute

High Memory Usage

warning
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5mannotation: Only {{ $value | humanize }}% memory available on {{ $labels.instance }}

High Error Rate (HTTP)

critical
sum by(service) (rate(http_requests_total{status=~"5.."}[5m])) / sum by(service) (rate(http_requests_total[5m])) * 100 > 5
for: 5mannotation: {{ $value | humanize }}% error rate on {{ $labels.service }}

Disk Space Low

warning
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 15mannotation: Only {{ $value | humanize }}% disk space on {{ $labels.instance }}
Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Alertmanager Configuration Reference

The alertmanager.yml file has four top-level sections:

global:

Default settings — SMTP host, Slack API URL, PagerDuty URL, resolve_timeout

route:

The routing tree — defines which receiver handles which alerts based on label matchers

receivers:

Notification integrations — Slack, PagerDuty, email, OpsGenie, webhook

inhibit_rules:

Silence lower-priority alerts when a higher-priority alert is already firing

Complete alertmanager.yml Example

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  # Default receiver for unmatched alerts
  receiver: slack-warnings
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s        # Wait before sending first notification
  group_interval: 5m     # Wait before sending updated notification for same group
  repeat_interval: 12h   # Re-notify if alert is still firing

  routes:
    # Critical alerts → PagerDuty (immediate)
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h
      continue: false  # Stop routing after match

    # Platform team alerts → dedicated Slack channel
    - match:
        team: platform
      receiver: slack-platform
      continue: true  # Also check further routes

    # Silence noisy storage alerts during off-hours (optional)
    - match_re:
        alertname: ^(StorageNearCapacity|DiskLatencyHigh)$
      receiver: slack-warnings
      active_time_intervals:
        - business_hours

receivers:
  - name: slack-warnings
    slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
        send_resolved: true

  - name: slack-platform
    slack_configs:
      - channel: '#alerts-platform'
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}*{{ .Labels.instance }}*: {{ .Annotations.description }}\n{{ end }}'
        send_resolved: true

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
        send_resolved: true

inhibit_rules:
  # If a node is down, suppress all other alerts from that node
  - source_match:
      alertname: NodeDown
      severity: critical
    target_match_re:
      severity: ^(warning|info)$
    equal: ['instance']

  # If cluster is down, suppress all services in that cluster
  - source_match:
      alertname: ClusterDown
    target_match:
      team: platform
    equal: ['cluster']

time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '18:00'

Understanding the Routing Tree

The route tree is evaluated top-down. Each incoming alert starts at the root route and traverses child routes until a match is found (or no match → uses root receiver).

group_by

Labels used to group alerts into a single notification. alerts with the same group_by labels are batched. Use ["alertname", "cluster"] to avoid 100 separate notifications for the same outage.

group_wait

Time to wait before sending the first notification for a new group. Allows related alerts to arrive before notifying (30s default). Lower for critical, higher for warnings.

group_interval

Minimum time between notifications for the same group after the first one. Prevents notification spam for flapping alerts (5m default).

repeat_interval

How often to re-send if an alert is still firing. Use 1h for critical (maintain urgency), 12h for warnings (avoid fatigue).

continue

If true, routing continues to the next sibling route after a match. If false (default), routing stops on first match.

📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

Receiver Types & When to Use Each

ReceiverBest ForSetup
SlackDev teams — low-severity alerts, chat-first cultureIncoming Webhook URL from slack.com/apps
PagerDutyOn-call escalation — SEV1/SEV2 critical alertsIntegration key from PagerDuty service settings
EmailLow-urgency notifications, audit trailsSMTP host, auth credentials, from/to addresses
OpsGenieOn-call with advanced scheduling, mobile pushAPI key from OpsGenie integration
WebhookCustom integrations, automation, ChatOps botsAny HTTP endpoint accepting POST JSON
VictorOps / Splunk On-CallEnterprise teams already using VictorOpsRouting key from Splunk On-Call service

Slack Receiver: Full Config

receivers:
  - name: slack-oncall
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T.../B.../...'
        channel: '#oncall-alerts'
        username: 'Alertmanager'
        icon_emoji: ':fire:'
        title: |-
          [{{ .Status | toUpper }}{{ if eq .Status "firing" }}: {{ .Alerts.Firing | len }}{{ end }}]
          {{ .CommonLabels.alertname }}
        title_link: 'https://grafana.example.com/d/...'
        text: |-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Instance:* {{ .Labels.instance }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        send_resolved: true

amtool CLI: Essential Commands

amtool ships with Alertmanager and lets you manage silences, check routing, and query the API from the command line:

# Set Alertmanager URL (or use ALERTMANAGER_URL env var)
export ALERTMANAGER_URL=http://localhost:9093

# List all firing alerts
amtool alert

# List alerts filtered by labels
amtool alert --alertname=HighCPU --severity=critical

# Create a silence (maintenance window)
amtool silence add --duration=2h \
  --comment="Planned maintenance 2026-05-03" \
  alertname="HighCPU" instance="prod-server-01"

# List active silences
amtool silence query

# Expire a silence immediately
amtool silence expire <silence-id>

# Validate config file (critical before deploy!)
amtool check-config alertmanager.yml

# Test routing: which receiver handles this alert?
amtool config routes test \
  --config.file=alertmanager.yml \
  alertname=NodeDown severity=critical team=platform

# Trigger Alertmanager to reload config (hot reload)
curl -X POST http://localhost:9093/-/reload

Avoiding Alert Fatigue: Best Practices

✅ Alert on symptoms, not causes

Alert on "users are experiencing errors" (high 5xx rate) not "CPU is 70%". Causes are for dashboards.

✅ Set meaningful for durations

A 1-minute for prevents false positives from transient spikes. Use shorter for critical (1-2m), longer for warnings (10-15m).

✅ Every alert needs a runbook

Add runbook_url to every alert annotation. An alert without a runbook is an alert without a fix path.

✅ Group aggressively

Use group_by: [alertname, cluster] so 50 pods failing fires ONE notification, not 50.

✅ Use inhibition for parent/child relationships

If a node is down, inhibit disk/CPU/memory alerts from that node. One root cause = one page.

✅ Review silence usage monthly

If an alert is silenced more than it fires, fix the alert rule. Permanent silences are tech debt.

High Availability Alertmanager

Alertmanager supports native clustering with the gossip protocol (using --cluster.peer). A cluster of 3 nodes ensures alerting continues if one node fails, with automatic deduplication across instances:

# Node 1
./alertmanager \
  --config.file=alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-2:9094 \
  --cluster.peer=alertmanager-3:9094

# Node 2
./alertmanager \
  --config.file=alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-3:9094

# In prometheus.yml, list ALL Alertmanager nodes
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-1:9093
            - alertmanager-2:9093
            - alertmanager-3:9093

Prometheus sends alerts to ALL Alertmanager instances. The cluster handles deduplication via gossip — you'll receive each notification exactly once even with 3 Alertmanager nodes.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your monitoring system goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for your monitoring system + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

Alertmanager Limitations & When to Supplement It

No built-in on-call scheduling

Alertmanager routes to receivers but has no rotation schedules. Use PagerDuty or OpsGenie for on-call management, not raw Alertmanager webhooks.

Config file changes require reload

While reload is hot (SIGHUP or /-/reload), it still requires a manual step. Tools like Grafana Alerting or Better Stack have UI-based rule management.

No built-in incident grouping/timeline

Alertmanager shows firing alerts but doesn't build incident timelines. For post-mortems and incident tracking, you need a dedicated incident management tool.

Alert history not persisted by default

Alertmanager stores state in memory. After restart, pending notifications and silences may be lost unless using persistent storage.

Frequently Asked Questions

What is Prometheus Alertmanager?

Alertmanager is the dedicated alerting component of the Prometheus ecosystem. Prometheus evaluates alerting rules and sends firing alerts to Alertmanager, which then handles deduplication, grouping, silencing, inhibition, and routing to receivers like Slack, PagerDuty, or email. It runs as a separate binary (alertmanager) and is typically deployed alongside Prometheus.

What is the difference between Prometheus alerting rules and Alertmanager?

Prometheus alerting rules (in .rules.yml files) define WHEN an alert fires — using PromQL expressions and a for duration. Alertmanager handles WHAT HAPPENS after an alert fires — routing it to the right team, grouping related alerts into one notification, silencing during maintenance, and deduplicating repeated firings. Both are required for a functional alerting pipeline.

How do I silence alerts in Alertmanager?

Silences can be created via the Alertmanager web UI (default port 9093), the amtool CLI (amtool silence add alertname=HighCPU --duration=2h --comment="Planned maintenance"), or the Alertmanager HTTP API (POST /api/v2/silences). Silences match alerts by label matchers and expire after the specified duration. They do not affect alert evaluation in Prometheus — only notification delivery.

What is alert inhibition in Alertmanager?

Inhibition rules suppress lower-priority alerts when a higher-priority alert is already firing. Example: if a NodeDown alert fires, inhibit all alerts from that node so you get one notification instead of dozens. Configured under inhibit_rules in alertmanager.yml. Source labels define the inhibiting alert, target labels define which alerts to suppress, and equal labels must match on both.

How do I test Alertmanager configuration without restarting?

Use amtool check-config alertmanager.yml to validate syntax. To reload without restart, send a SIGHUP (kill -HUP <pid>) or POST to /-/reload endpoint. To test routing, use amtool config routes test --config.file=alertmanager.yml alertname=HighCPU severity=critical to see which receiver would handle that alert without firing anything.

Related Guides

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you