What are DORA metrics?

DORA metrics (from the DevOps Research and Assessment program) are four key measures of software delivery performance: Deployment Frequency (how often you deploy to production), Lead Time for Changes (time from code commit to production), Change Failure Rate (% of deployments causing incidents), and Mean Time to Recovery (MTTR — how long to restore service after a failure). Elite teams deploy multiple times per day with a change failure rate below 5%.

How do I monitor GitHub Actions workflows?

Monitor GitHub Actions via: (1) GitHub's built-in Insights tab for workflow run history, success rates, and durations. (2) GitHub API (`/repos/{owner}/{repo}/actions/runs`) to pull metrics programmatically. (3) Third-party tools like Datadog CI Visibility, BuildPulse, or Trunk for deeper analytics and alerting. Set up workflow failure notifications via GitHub Actions itself using the `on: workflow_run` trigger to alert when critical workflows fail.

What is a good pipeline failure rate?

Elite performing teams maintain a change failure rate below 5%. A pipeline failure rate above 20% (1 in 5 builds fails) indicates significant test reliability or infrastructure issues worth addressing. Distinguish between flaky test failures (intermittent, non-deterministic) and real failures — flaky tests above 5% of runs should be investigated and fixed or removed.

What is deployment frequency and why does it matter?

Deployment frequency measures how often your team ships code to production. Elite teams deploy multiple times per day; high-performing teams deploy between once per day and once per week. Higher deployment frequency is associated with better reliability (smaller, lower-risk changes) and faster feature delivery. Teams stuck deploying once per month or less typically have risky, manual deployments that require coordination — a sign of an architecture or process problem.

How do I alert when a CI/CD pipeline fails?

For GitHub Actions: use a job that runs `on: workflow_run` and posts to Slack or PagerDuty when the workflow conclusion is "failure". For GitLab CI: use pipeline notification webhooks or the built-in email notifications. For more sophisticated alerting, tools like Buildkite Analytics, Datadog CI Visibility, and LinearB can alert on failure rate thresholds rather than individual failures — reducing noise from flaky tests.

CI/CD Pipeline Monitoring Guide: GitHub Actions, GitLab & Beyond (2026)

Your CI/CD pipeline is the heartbeat of your engineering team. When it's healthy, developers ship confidently and frequently. When it's broken — tests flake, builds queue for 45 minutes, deployments silently fail — velocity collapses and trust erodes. Yet most teams monitor their production services far more carefully than they monitor their pipelines.

This guide covers how to instrument, measure, and alert on CI/CD pipeline health — from GitHub Actions workflow monitoring to DORA metric tracking to deployment health checks.

What to Monitor in a CI/CD Pipeline

CI/CD monitoring operates at three levels:

Pipeline health — are builds succeeding? How long do they take? Are there flaky tests?
Deployment health — did the deploy succeed? Did error rates change? Are endpoints responding?
Delivery performance (DORA) — are we shipping faster and more reliably than last quarter?

DORA Metrics: The North Star for Delivery Teams

The DORA (DevOps Research and Assessment) framework defines four metrics that reliably predict engineering organizational performance. Measure these, and you have an objective view of your delivery health.

DORA Metric	Elite	High	Medium	Low
Deployment Frequency	Multiple/day	1/day–1/week	1/week–1/month	< once/month
Lead Time for Changes	< 1 hour	1 day–1 week	1 week–1 month	> 1 month
Change Failure Rate	0–5%	5–10%	10–15%	> 15%
Mean Time to Recovery	< 1 hour	< 1 day	1–7 days	> 6 months

Monitoring GitHub Actions Workflows

Native GitHub Monitoring

GitHub provides built-in workflow analytics under Repository → Actions → Insights. You get run history, success/failure rates, and duration trends per workflow. For most teams, this is the starting point before adding third-party tools.

Alerting on Workflow Failures

Use the workflow_run trigger to notify when a critical pipeline fails:

# .github/workflows/notify-on-failure.yml
name: Notify on Pipeline Failure

on:
  workflow_run:
    workflows: ["CI", "Deploy to Production"]
    types: [completed]

jobs:
  notify:
    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
    runs-on: ubuntu-latest
    steps:
      - name: Notify Slack
        uses: slackapi/slack-github-action@v1.26.0
        with:
          payload: |
            {
              "text": "❌ Pipeline failed: ${{ github.event.workflow_run.name }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Pipeline Failed*\n*Workflow:* ${{ github.event.workflow_run.name }}\n*Branch:* ${{ github.event.workflow_run.head_branch }}\n*Commit:* ${{ github.event.workflow_run.head_sha }}\n<${{ github.event.workflow_run.html_url }}|View Run>"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Tracking Pipeline Duration

# GitHub API: Get recent workflow run durations
curl -H "Authorization: Bearer $GH_TOKEN" \
  "https://api.github.com/repos/OWNER/REPO/actions/workflows/ci.yml/runs?per_page=20" \
  | jq '[.workflow_runs[] | {
      id: .id,
      status: .status,
      conclusion: .conclusion,
      created_at: .created_at,
      updated_at: .updated_at,
      run_started_at: .run_started_at
    }]'

# Calculate P50/P95 duration from the output to detect slowdowns
# Alert when P95 duration exceeds your SLO (e.g., 20 minutes for CI)

📡

Recommended

Monitor your deployment endpoints with Better Stack

After every deploy, Better Stack verifies your API endpoints are responding correctly — catching broken deployments that passed all tests but failed in production.

Try Better Stack Free →

Post-Deploy Monitoring: The Missing Last Mile

A successful pipeline run doesn't guarantee a successful deployment. The most dangerous failure mode is a deploy that passes all tests but breaks in production due to configuration drift, infrastructure differences, or downstream dependencies.

Post-deploy monitoring closes this gap. After every deploy:

Run smoke tests against the production endpoint immediately after deploy
Watch error rates for 10–15 minutes post-deploy (spikes indicate a bad release)
Monitor latency at P95 and P99 — performance regressions appear in tail latency first
Check health endpoints on all services that were updated
Watch rollout progress (Kubernetes): ensure all pods reach Running state

# Post-deploy smoke test in GitHub Actions
- name: Run post-deploy smoke tests
  run: |
    # Wait for deployment to propagate
    sleep 30

    # Test critical endpoints
    ENDPOINTS=(
      "https://api.yourapp.com/health"
      "https://api.yourapp.com/v1/status"
      "https://yourapp.com"
    )

    for endpoint in "${ENDPOINTS[@]}"; do
      STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$endpoint")
      if [ "$STATUS" != "200" ]; then
        echo "❌ Smoke test failed: $endpoint returned $STATUS"
        # Trigger rollback
        kubectl rollout undo deployment/web
        exit 1
      fi
      echo "✅ $endpoint: $STATUS"
    done

Flaky Test Detection and Management

Flaky tests — tests that pass and fail non-deterministically — are the #1 cause of CI/CD trust erosion. When developers routinely re-run pipelines hoping for a pass, your pipeline is sending noise, not signal.

Flakiness Rate	Impact	Action
< 1%	Acceptable noise	Monitor, investigate if rising
1–5%	Noticeable developer friction	Quarantine and fix within sprint
5–20%	Pipeline trust breaking down	Immediate: quarantine and fix
> 20%	CI/CD effectively broken	Engineering escalation required

Tools like BuildPulse, Trunk Flaky Tests, and Datadog CI Visibility automatically detect flaky tests by analyzing historical run data and flagging tests with inconsistent results. Most can auto-quarantine flaky tests so they don't block pipelines while the team investigates.

CI/CD Monitoring Tool Comparison

Tool	Best For	DORA Metrics	Flaky Test Detection	Pricing
GitHub Insights	GitHub Actions baseline	Partial	❌	Free (with GitHub)
Datadog CI Visibility	Enterprise full-stack	✅ Full	✅	$26/committer/mo
LinearB	Engineering metrics + DORA	✅ Full	❌	$17/developer/mo
BuildPulse	Flaky test elimination	Partial	✅ (primary feature)	$125/mo
Trunk	Code quality + CI health	✅	✅	Free tier / $12/user
New Relic	Full observability stack	✅	Partial	Free 100GB/mo

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your production deployments goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your production deployments + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Deployment Safety: Feature Flags and Canary Deploys

The most effective CI/CD monitoring is complemented by deployment strategies that reduce blast radius. Two techniques every team should adopt:

Feature flags. Ship code continuously but activate features for a percentage of users. If error rates spike, disable the flag instantly — no rollback required. Tools: LaunchDarkly, Flagsmith, Unleash, GrowthBook.
Canary deployments. Route 5–10% of traffic to the new version, monitor error rates and latency for 30 minutes, then promote or rollback. Kubernetes does this natively with Argo Rollouts or Flagger, which can auto-rollback based on Prometheus metrics.

# Argo Rollouts: automated canary with metric-based promotion
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10        # Route 10% of traffic to new version
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1
        args:
          - name: service-name
            value: web
  # analysis template checks error rate stays below 5%
  # auto-rollback if threshold exceeded

FAQ

How do I calculate Lead Time for Changes?

Lead Time for Changes = time from first commit to production deployment. Most CI/CD tools don't track this natively. Tools like LinearB, Faros, and Sleuth calculate it by correlating commit timestamps (from your VCS) with deployment events (from your CD platform). The DORA-suggested minimum data source is: pull request open time + time-to-merge + deploy timestamp.

What's the difference between Change Failure Rate and pipeline failure rate?

Pipeline failure rate is how often your CI pipeline fails (tests failing, build errors). Change Failure Rate is how often deployments that reach production cause incidents or require rollback. Pipeline failures are caught before production; change failures reach users. Both matter but for different reasons — high pipeline failure rates slow developers, high change failure rates break production.

Should I alert on every failed pipeline run?

No — per-failure alerts for high-volume pipelines create noise and are ignored. Instead, alert on: (1) failure rate exceeding your threshold (e.g., > 20% of runs failing in the last hour), (2) any failure on your main branch, (3) deployment pipelines specifically failing (they have direct production impact). Feature branch CI failures should create a Slack notification, not a page.