Your CI/CD pipeline is the heartbeat of your engineering team. When it's healthy, developers ship confidently and frequently. When it's broken — tests flake, builds queue for 45 minutes, deployments silently fail — velocity collapses and trust erodes. Yet most teams monitor their production services far more carefully than they monitor their pipelines.
This guide covers how to instrument, measure, and alert on CI/CD pipeline health — from GitHub Actions workflow monitoring to DORA metric tracking to deployment health checks.
What to Monitor in a CI/CD Pipeline
CI/CD monitoring operates at three levels:
- Pipeline health — are builds succeeding? How long do they take? Are there flaky tests?
- Deployment health — did the deploy succeed? Did error rates change? Are endpoints responding?
- Delivery performance (DORA) — are we shipping faster and more reliably than last quarter?
DORA Metrics: The North Star for Delivery Teams
The DORA (DevOps Research and Assessment) framework defines four metrics that reliably predict engineering organizational performance. Measure these, and you have an objective view of your delivery health.
| DORA Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | Multiple/day | 1/day–1/week | 1/week–1/month | < once/month |
| Lead Time for Changes | < 1 hour | 1 day–1 week | 1 week–1 month | > 1 month |
| Change Failure Rate | 0–5% | 5–10% | 10–15% | > 15% |
| Mean Time to Recovery | < 1 hour | < 1 day | 1–7 days | > 6 months |
Monitoring GitHub Actions Workflows
Native GitHub Monitoring
GitHub provides built-in workflow analytics under Repository → Actions → Insights. You get run history, success/failure rates, and duration trends per workflow. For most teams, this is the starting point before adding third-party tools.
Alerting on Workflow Failures
Use the workflow_run trigger to notify when a critical pipeline fails:
# .github/workflows/notify-on-failure.yml
name: Notify on Pipeline Failure
on:
workflow_run:
workflows: ["CI", "Deploy to Production"]
types: [completed]
jobs:
notify:
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
steps:
- name: Notify Slack
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": "❌ Pipeline failed: ${{ github.event.workflow_run.name }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Pipeline Failed*\n*Workflow:* ${{ github.event.workflow_run.name }}\n*Branch:* ${{ github.event.workflow_run.head_branch }}\n*Commit:* ${{ github.event.workflow_run.head_sha }}\n<${{ github.event.workflow_run.html_url }}|View Run>"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}Tracking Pipeline Duration
# GitHub API: Get recent workflow run durations
curl -H "Authorization: Bearer $GH_TOKEN" \
"https://api.github.com/repos/OWNER/REPO/actions/workflows/ci.yml/runs?per_page=20" \
| jq '[.workflow_runs[] | {
id: .id,
status: .status,
conclusion: .conclusion,
created_at: .created_at,
updated_at: .updated_at,
run_started_at: .run_started_at
}]'
# Calculate P50/P95 duration from the output to detect slowdowns
# Alert when P95 duration exceeds your SLO (e.g., 20 minutes for CI)Monitor your deployment endpoints with Better Stack
After every deploy, Better Stack verifies your API endpoints are responding correctly — catching broken deployments that passed all tests but failed in production.
Try Better Stack Free →Post-Deploy Monitoring: The Missing Last Mile
A successful pipeline run doesn't guarantee a successful deployment. The most dangerous failure mode is a deploy that passes all tests but breaks in production due to configuration drift, infrastructure differences, or downstream dependencies.
Post-deploy monitoring closes this gap. After every deploy:
- Run smoke tests against the production endpoint immediately after deploy
- Watch error rates for 10–15 minutes post-deploy (spikes indicate a bad release)
- Monitor latency at P95 and P99 — performance regressions appear in tail latency first
- Check health endpoints on all services that were updated
- Watch rollout progress (Kubernetes): ensure all pods reach Running state
# Post-deploy smoke test in GitHub Actions
- name: Run post-deploy smoke tests
run: |
# Wait for deployment to propagate
sleep 30
# Test critical endpoints
ENDPOINTS=(
"https://api.yourapp.com/health"
"https://api.yourapp.com/v1/status"
"https://yourapp.com"
)
for endpoint in "${ENDPOINTS[@]}"; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$endpoint")
if [ "$STATUS" != "200" ]; then
echo "❌ Smoke test failed: $endpoint returned $STATUS"
# Trigger rollback
kubectl rollout undo deployment/web
exit 1
fi
echo "✅ $endpoint: $STATUS"
doneFlaky Test Detection and Management
Flaky tests — tests that pass and fail non-deterministically — are the #1 cause of CI/CD trust erosion. When developers routinely re-run pipelines hoping for a pass, your pipeline is sending noise, not signal.
| Flakiness Rate | Impact | Action |
|---|---|---|
| < 1% | Acceptable noise | Monitor, investigate if rising |
| 1–5% | Noticeable developer friction | Quarantine and fix within sprint |
| 5–20% | Pipeline trust breaking down | Immediate: quarantine and fix |
| > 20% | CI/CD effectively broken | Engineering escalation required |
Tools like BuildPulse, Trunk Flaky Tests, and Datadog CI Visibility automatically detect flaky tests by analyzing historical run data and flagging tests with inconsistent results. Most can auto-quarantine flaky tests so they don't block pipelines while the team investigates.
CI/CD Monitoring Tool Comparison
| Tool | Best For | DORA Metrics | Flaky Test Detection | Pricing |
|---|---|---|---|---|
| GitHub Insights | GitHub Actions baseline | Partial | ❌ | Free (with GitHub) |
| Datadog CI Visibility | Enterprise full-stack | ✅ Full | ✅ | $26/committer/mo |
| LinearB | Engineering metrics + DORA | ✅ Full | ❌ | $17/developer/mo |
| BuildPulse | Flaky test elimination | Partial | ✅ (primary feature) | $125/mo |
| Trunk | Code quality + CI health | ✅ | ✅ | Free tier / $12/user |
| New Relic | Full observability stack | ✅ | Partial | Free 100GB/mo |
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your production deployments goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your production deployments + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Deployment Safety: Feature Flags and Canary Deploys
The most effective CI/CD monitoring is complemented by deployment strategies that reduce blast radius. Two techniques every team should adopt:
- Feature flags. Ship code continuously but activate features for a percentage of users. If error rates spike, disable the flag instantly — no rollback required. Tools: LaunchDarkly, Flagsmith, Unleash, GrowthBook.
- Canary deployments. Route 5–10% of traffic to the new version, monitor error rates and latency for 30 minutes, then promote or rollback. Kubernetes does this natively with Argo Rollouts or Flagger, which can auto-rollback based on Prometheus metrics.
# Argo Rollouts: automated canary with metric-based promotion
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web
spec:
strategy:
canary:
steps:
- setWeight: 10 # Route 10% of traffic to new version
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
analysis:
templates:
- templateName: error-rate-check
startingStep: 1
args:
- name: service-name
value: web
# analysis template checks error rate stays below 5%
# auto-rollback if threshold exceededFAQ
How do I calculate Lead Time for Changes?
Lead Time for Changes = time from first commit to production deployment. Most CI/CD tools don't track this natively. Tools like LinearB, Faros, and Sleuth calculate it by correlating commit timestamps (from your VCS) with deployment events (from your CD platform). The DORA-suggested minimum data source is: pull request open time + time-to-merge + deploy timestamp.
What's the difference between Change Failure Rate and pipeline failure rate?
Pipeline failure rate is how often your CI pipeline fails (tests failing, build errors). Change Failure Rate is how often deployments that reach production cause incidents or require rollback. Pipeline failures are caught before production; change failures reach users. Both matter but for different reasons — high pipeline failure rates slow developers, high change failure rates break production.
Should I alert on every failed pipeline run?
No — per-failure alerts for high-volume pipelines create noise and are ignored. Instead, alert on: (1) failure rate exceeding your threshold (e.g., > 20% of runs failing in the last hour), (2) any failure on your main branch, (3) deployment pipelines specifically failing (they have direct production impact). Feature branch CI failures should create a Slack notification, not a page.