Is Grafana Cloud Down? How to Check Grafana Status in Real-Time

Is Grafana Cloud Down? How to Check Grafana Status in Real-Time

Quick Answer: To check if Grafana Cloud is down, visit apistatuscheck.com/api/grafana for real-time monitoring, or check the official status.grafana.com page. Common signs include dashboard loading failures, data source connection timeouts, alerting delays, Prometheus/Loki ingestion problems, panel rendering issues, and SSO authentication failures.

When your monitoring and observability platform goes dark, you lose the eyes and ears of your infrastructure. Grafana Cloud serves as the central nervous system for thousands of engineering teams worldwide, aggregating metrics, logs, and traces from countless data sources. Any disruption doesn't just mean you can't view dashboards—it means you're flying blind during incidents, missing critical alerts, and potentially letting production issues escalate unnoticed. Understanding how to quickly verify Grafana's status and implement fallback strategies can mean the difference between a 5-minute incident and a multi-hour outage.

How to Check Grafana Cloud Status in Real-Time

1. API Status Check (Fastest Method)

The quickest way to verify Grafana Cloud's operational status is through apistatuscheck.com/api/grafana. This real-time monitoring service:

  • Tests actual API endpoints every 60 seconds across all regions
  • Monitors dashboard loading times and rendering performance
  • Tracks data source connectivity (Prometheus, Loki, Tempo)
  • Validates alerting pipeline health and delivery
  • Provides instant alerts when degradation is detected
  • Shows historical uptime over 30/60/90 day periods

Unlike status pages that depend on manual updates, API Status Check performs active health checks against Grafana Cloud's production infrastructure, giving you the most accurate real-time picture of service availability across all critical components.

2. Official Grafana Status Page

Grafana maintains status.grafana.com as their official communication channel for service incidents. The page displays:

  • Current operational status for all Grafana Cloud services
  • Active incidents and ongoing investigations
  • Scheduled maintenance windows and upgrade notifications
  • Historical incident reports with root cause analysis
  • Component-specific status breakdown:
    • Grafana UI & Dashboards
    • Prometheus (metrics ingestion & querying)
    • Loki (log aggregation & search)
    • Tempo (distributed tracing)
    • Grafana Alerting
    • Synthetic Monitoring
    • OnCall rotation service

Pro tip: Subscribe to status updates via email, SMS, Slack, or webhook on the status page to receive immediate notifications when incidents occur. Each component can be subscribed to independently.

3. Check Your Dashboard Access

If your Grafana Cloud instance at your-stack.grafana.net is showing issues, this often indicates broader infrastructure problems. Key indicators include:

  • Login page timeouts or 502/503 errors
  • Dashboard list failing to load
  • "Data source unavailable" errors across multiple panels
  • Infinite loading spinners on panel refresh
  • Query editor interface unresponsive
  • Settings pages failing to save changes

4. Test Data Source Connectivity

For operators managing critical monitoring infrastructure, direct data source testing confirms backend health:

Prometheus query test:

# Query Grafana Cloud Prometheus directly
curl -G "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query" \
  --data-urlencode "query=up" \
  -u "YOUR_INSTANCE_ID:YOUR_API_KEY"

Loki logs query test:

# Test Loki log ingestion and query
curl -G "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/query" \
  --data-urlencode 'query={job="varlogs"}' \
  -u "YOUR_USER_ID:YOUR_API_KEY"

Look for HTTP response codes outside the 2xx range, connection timeouts exceeding 30 seconds, or authentication failures that weren't occurring previously.

5. Monitor Query Inspector for Backend Issues

Within Grafana itself, the Query Inspector (available on any panel) reveals backend health:

  • Stats tab: Shows query execution time and data source latency
  • Query tab: Displays actual queries sent to backend
  • JSON tab: Reveals API response structure and errors

Sudden spikes in query execution time (5+ seconds for simple queries) or consistent timeout errors indicate backend problems even when the UI appears functional.

Common Grafana Cloud Issues and How to Identify Them

Dashboard Loading Failures

Symptoms:

  • Dashboard list showing empty or timing out
  • Individual dashboards stuck on loading screen
  • Panels rendering with "No data" despite active metrics
  • Browser console showing 502 Bad Gateway errors
  • "Failed to fetch" network errors in developer tools

What it means: Dashboard loading failures typically indicate issues with Grafana's frontend API servers or the metadata database storing dashboard definitions. This differs from data source issues—the dashboard structure itself cannot be retrieved or rendered.

Diagnostic approach:

// Check browser console for specific errors
// Common patterns during outages:
// "GET https://your-stack.grafana.net/api/dashboards/uid/abc123 502"
// "TypeError: Cannot read property 'panels' of undefined"
// "Error: Timeout waiting for dashboard data"

Data Source Connection Timeouts

Common error messages:

  • Data source connection error
  • Query timeout exceeded (30s)
  • Connection refused to Prometheus backend
  • Loki gateway unreachable
  • HTTP 503 Service Temporarily Unavailable

Distinguishing characteristics:

  • Multiple data sources failing simultaneously (indicates platform issue)
  • Single data source failing (may be your metrics pipeline)
  • Intermittent failures with successful retries (rate limiting or capacity)
  • Consistent failures across all queries (backend outage)

Testing data source health:

# Test from Grafana UI
# Navigate to: Configuration → Data Sources → [Your Data Source] → Save & Test

# Expected response when healthy:
# ✓ Data source is working

# During outages you'll see:
# ✗ HTTP Error 502: Bad Gateway
# ✗ Timeout: request took longer than 30s

Alerting Delays and Delivery Failures

Grafana Cloud's alerting system is often the first component affected during partial outages:

  • Alert evaluation delays: Rules not being evaluated on schedule (default: every 60s)
  • Notification failures: Alerts triggered but not delivered to Slack/PagerDuty/email
  • State flapping: Alerts rapidly switching between OK/Alerting without actual metric changes
  • Missing alert history: Alert state changes not recorded in alert history panel

Critical impact: During a production incident, if your alerts aren't firing, your team may not know there's a problem until customers report it.

Verification steps:

  1. Check Alert Rules page: Do rules show "Last Evaluated" timestamp updating?
  2. Test notification channel: Send test alert to verify delivery path
  3. Review alert state history: Are there gaps in evaluation timeline?
  4. Check Prometheus query performance: Slow queries delay alert evaluation

Prometheus/Loki Ingestion Problems

Ingestion failures prevent new data from entering Grafana Cloud:

Symptoms for Prometheus metrics:

  • Scrape targets showing "Down" status without infrastructure changes
  • Gaps in time series data (flatlines in graphs)
  • Error logs from Prometheus remote_write:
    level=warn component=remote msg="Failed to send batch" err="server returned HTTP status 503"
    
  • Active series count dropping unexpectedly

Symptoms for Loki logs:

  • Log panels showing stale data (last log entry is minutes/hours old)
  • Promtail/Fluentd/Grafana Agent showing push failures:
    level=error msg="failed to push logs" status=429 error="Too Many Requests"
    
  • Missing log lines during specific time windows

Testing ingestion:

# Send test metric to Grafana Cloud Prometheus
cat <<EOF | curl --data-binary @- "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/push" \
  -H "Content-Type: application/x-protobuf" \
  -u "YOUR_INSTANCE_ID:YOUR_API_KEY"
# metric_name{label="test"} 123 $(date +%s)000
EOF

# Send test log to Grafana Cloud Loki
curl -X POST "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push" \
  -u "YOUR_USER_ID:YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"streams":[{"stream":{"job":"test"},"values":[["'$(date +%s000000000)'","test log line"]]}]}'

Panel Rendering Issues

Visual indicators:

  • Panels showing "Panel plugin not found" errors
  • Graph panels rendering empty despite data being present
  • Time series displaying with incorrect formatting
  • Heatmap or histogram panels showing corrupted visualizations
  • Annotation queries timing out

Backend vs frontend issues:

  • Backend problem: Query tab in inspector shows data, but panel is blank
  • Frontend problem: Query tab shows "No data" but data source test succeeds
  • Plugin issue: Specific panel types fail while others work

SSO/Auth Failures

Authentication and authorization issues:

  • SSO login redirect loops or timeouts
  • "Invalid session" errors after successful login
  • OAuth callback failures with SAML/OIDC providers
  • API key authentication returning 401 Unauthorized
  • Service account tokens suddenly invalid

Organizational impact:

  • Entire teams locked out of Grafana during incidents
  • Automation using API keys fails
  • Terraform provider unable to manage Grafana resources
  • Grafana OnCall unable to authenticate responders

Diagnostic commands:

# Test API key authentication
curl -H "Authorization: Bearer YOUR_API_KEY" \
  "https://your-stack.grafana.net/api/org"

# Expected when healthy: {"id":1,"name":"Your Org"}
# During auth issues: {"message":"Invalid API key"}

The Real Impact When Grafana Cloud Goes Down

Visibility Gaps During Critical Incidents

The worst time for your monitoring to fail is during an active production incident:

  • Blind troubleshooting: Engineers can't see metrics to diagnose root cause
  • Unknown blast radius: Cannot determine which services are affected
  • Remediation delays: Can't verify if fixes are working without metric visibility
  • Cascading failures: Secondary issues go undetected without monitoring

Real scenario: A database outage triggers application errors. Without Grafana, your team can't see:

  • Database connection pool exhaustion metrics
  • Application error rate spike
  • API latency degradation
  • Queue backlog growth

What should be a 15-minute fix becomes a 2-hour investigation because you're troubleshooting blind, checking logs manually across dozens of servers.

Dashboard-Driven Decision Paralysis

Modern engineering teams rely on dashboards for data-driven decision making:

Incident response paralysis:

  • Should we trigger failover? (Can't see replica lag metrics)
  • Is the deployment safe to continue? (Can't see error rates)
  • Which customers are affected? (Can't see per-tenant metrics)
  • Should we scale up capacity? (Can't see resource utilization)

Executive visibility loss:

  • C-level dashboards showing business KPIs go dark
  • Real-time revenue tracking unavailable
  • Customer experience metrics invisible
  • SLA compliance monitoring disabled

Recovery impact: Even after infrastructure is fixed, if Grafana is down, you can't confidently declare "all clear" because you lack metric visibility to verify health.

Alert Blindness and Silent Failures

When Grafana's alerting system fails, your safety net disappears:

Critical alerts never fire:

  • Disk space filling to 100% without notification
  • Memory leaks going unnoticed until OOM kills
  • SSL certificates expiring without renewal alerts
  • Backup jobs failing silently for days

On-call impact:

  • Engineers unaware of ongoing incidents
  • PagerDuty/Slack/email integrations failing
  • Alert fatigue from false positives during recovery
  • Loss of trust in alerting system reliability

Compliance consequences:

  • SLA violations without detection
  • Security events unmonitored
  • Audit trail gaps for incident response
  • Failure to meet regulatory reporting requirements

NOC/SOC Operational Impact

For Network Operations Centers and Security Operations Centers relying on Grafana:

NOC consequences:

  • No real-time network topology visibility
  • Bandwidth utilization trending unavailable
  • Infrastructure capacity planning data missing
  • Vendor SLA tracking impossible

SOC consequences:

  • Security event correlation disrupted
  • Threat detection dashboards offline
  • Incident investigation timelines incomplete
  • Log aggregation for forensics unavailable

Operational workarounds:

  • Manual log grepping across hundreds of servers
  • SSH-ing into individual nodes to check metrics
  • Writing throwaway scripts to query data sources directly
  • Communicating status updates without data backing

Productivity Tax Across Engineering

Developer workflow disruption:

  • Cannot validate performance of new deployments
  • A/B test metrics unavailable for evaluation
  • Feature flag rollout monitoring blind
  • Debugging production issues without telemetry

DevOps/SRE impact:

  • Kubernetes cluster metrics invisible
  • Infrastructure-as-code changes deployed without verification
  • Cost optimization initiatives paused (no usage metrics)
  • Capacity planning decisions delayed

Estimated productivity cost: For a team of 20 engineers at $150k average salary ($75/hour), a 4-hour Grafana outage where engineers spend 25% of time working around missing monitoring = $1,500 in lost productivity, plus immeasurable impact on incident response quality.

Diagnostic Steps and Troubleshooting Workflows

Step 1: Validate Status Page vs Reality

Don't trust status pages blindly. Perform your own verification:

#!/bin/bash
# grafana-health-check.sh

STACK_NAME="your-stack"
API_KEY="your_api_key"

# Check Grafana UI health
UI_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://${STACK_NAME}.grafana.net/api/health")

# Check Prometheus query endpoint
PROM_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  -u "${INSTANCE_ID}:${API_KEY}" \
  "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query?query=up")

echo "Grafana UI: $UI_STATUS (expected: 200)"
echo "Prometheus: $PROM_STATUS (expected: 200)"

if [ "$UI_STATUS" != "200" ] || [ "$PROM_STATUS" != "200" ]; then
  echo "❌ Grafana health check FAILED"
  exit 1
else
  echo "✅ Grafana health check PASSED"
fi

Step 2: Isolate Component Failures

Systematic testing to identify affected components:

  1. Test dashboard loading:

    • Try accessing dashboard list: https://your-stack.grafana.net/dashboards
    • Load a simple dashboard with few panels
    • Load a complex dashboard with many data sources
  2. Test each data source independently:

    • Configuration → Data Sources → Select source → "Save & Test"
    • Document which sources fail vs succeed
    • Pattern of failures indicates scope (all sources = platform issue)
  3. Test alerting pipeline:

    • Create test alert rule with simple condition
    • Manually trigger alert by adjusting threshold
    • Verify notification delivery to test channel
    • Check alert evaluation logs for errors
  4. Test query performance:

    • Open any panel's Query Inspector
    • Note query execution time
    • Compare to baseline (healthy state: <1s for simple queries)
    • Identify slow queries that may be overloading backend

Step 3: Use Query Inspector for Backend Diagnosis

Query Inspector reveals backend health even when UI works:

Access via: Panel menu (three dots) → Inspect → Query

Key metrics to check:

  • Query time: >5s indicates backend overload
  • Response size: Verify data is actually returning
  • Errors: Look for timeout messages or HTTP errors

Example problematic inspector output:

{
  "error": "timeout: context deadline exceeded",
  "status": "error",
  "errorType": "timeout",
  "message": "query took longer than 30s to execute"
}

Step 4: Check Data Source Health Programmatically

Automated health monitoring script:

#!/usr/bin/env python3
import requests
import sys
from datetime import datetime

GRAFANA_URL = "https://your-stack.grafana.net"
API_KEY = "your_api_key"

def check_datasource_health(datasource_name):
    """Test data source connectivity"""
    headers = {"Authorization": f"Bearer {API_KEY}"}
    
    # Get datasource ID
    resp = requests.get(
        f"{GRAFANA_URL}/api/datasources/name/{datasource_name}",
        headers=headers
    )
    
    if resp.status_code != 200:
        return False, f"Failed to find datasource: {resp.status_code}"
    
    ds_id = resp.json()["id"]
    
    # Test datasource
    resp = requests.get(
        f"{GRAFANA_URL}/api/datasources/{ds_id}/health",
        headers=headers
    )
    
    if resp.status_code == 200:
        return True, "OK"
    else:
        return False, f"Health check failed: {resp.text}"

# Test critical data sources
datasources = ["Prometheus", "Loki", "Tempo"]
all_healthy = True

print(f"=== Grafana Health Check: {datetime.now()} ===\n")

for ds in datasources:
    healthy, message = check_datasource_health(ds)
    status = "✅" if healthy else "❌"
    print(f"{status} {ds}: {message}")
    if not healthy:
        all_healthy = False

sys.exit(0 if all_healthy else 1)

Step 5: Test Ingestion Pipeline

Verify metrics and logs are flowing:

# Check if recent metrics exist
# Query for a metric that's constantly updating (like timestamp)
curl -G "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query" \
  --data-urlencode "query=up" \
  -u "INSTANCE:KEY" | jq '.data.result[] | .value[0]' \
  | xargs -I {} date -d @{} '+%Y-%m-%d %H:%M:%S'

# Output shows timestamp of most recent metric
# If older than 2-3 minutes, ingestion may be delayed

# Check Loki log recency
curl -G "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/query" \
  --data-urlencode 'query={job=~".+"}' \
  --data-urlencode "limit=1" \
  -u "USER:KEY" | jq '.data.result[0].values[0][0]'

# Returns nanosecond timestamp of most recent log

Resilience Strategies and Code Examples

1. Multi-Backend Data Source Routing

Implement automatic failover between Grafana Cloud and local Prometheus:

// datasource-router.js - Intelligent data source failover
const axios = require('axios');

class DataSourceRouter {
  constructor() {
    this.backends = [
      {
        name: 'grafana-cloud',
        url: 'https://prometheus-prod-01-eu-west-0.grafana.net',
        auth: { username: process.env.GRAFANA_INSTANCE_ID, password: process.env.GRAFANA_API_KEY },
        priority: 1,
        healthy: true
      },
      {
        name: 'local-prometheus',
        url: 'http://prometheus.local:9090',
        auth: null,
        priority: 2,
        healthy: true
      },
      {
        name: 'backup-victoria-metrics',
        url: 'http://victoria-metrics.local:8428',
        auth: null,
        priority: 3,
        healthy: true
      }
    ];
    
    this.healthCheckInterval = 30000; // 30 seconds
    this.startHealthChecks();
  }

  startHealthChecks() {
    setInterval(() => this.checkHealth(), this.healthCheckInterval);
  }

  async checkHealth() {
    for (const backend of this.backends) {
      try {
        const config = backend.auth 
          ? { auth: backend.auth, timeout: 5000 }
          : { timeout: 5000 };
          
        await axios.get(`${backend.url}/api/v1/query?query=up`, config);
        backend.healthy = true;
      } catch (error) {
        console.error(`Backend ${backend.name} health check failed:`, error.message);
        backend.healthy = false;
      }
    }
  }

  getHealthyBackend() {
    // Return highest priority healthy backend
    const sorted = this.backends
      .filter(b => b.healthy)
      .sort((a, b) => a.priority - b.priority);
    
    return sorted[0] || null;
  }

  async query(promql) {
    const backend = this.getHealthyBackend();
    
    if (!backend) {
      throw new Error('No healthy data source backends available');
    }

    console.log(`Routing query to ${backend.name}`);
    
    const config = backend.auth 
      ? { auth: backend.auth, timeout: 30000 }
      : { timeout: 30000 };

    try {
      const response = await axios.get(
        `${backend.url}/api/v1/query`,
        { 
          ...config,
          params: { query: promql }
        }
      );
      
      return response.data;
    } catch (error) {
      // Mark backend as unhealthy and retry with next
      backend.healthy = false;
      console.error(`Query failed on ${backend.name}, retrying...`);
      
      const nextBackend = this.getHealthyBackend();
      if (nextBackend && nextBackend.name !== backend.name) {
        return this.query(promql); // Recursive retry
      }
      
      throw error;
    }
  }
}

// Usage example
const router = new DataSourceRouter();

async function getMetrics() {
  try {
    const result = await router.query('rate(http_requests_total[5m])');
    return result.data.result;
  } catch (error) {
    console.error('All backends failed:', error);
    return [];
  }
}

module.exports = { DataSourceRouter };

2. Local Grafana Fallback Setup

Deploy a local Grafana instance with read-only federation from Grafana Cloud:

# docker-compose.yml - Local Grafana fallback
version: '3.8'

services:
  grafana-fallback:
    image: grafana/grafana:latest
    container_name: grafana-fallback
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_secure_password
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=http://grafana-fallback.local:3000
    volumes:
      - ./grafana-data:/var/lib/grafana
      - ./dashboards:/etc/grafana/provisioning/dashboards
      - ./datasources:/etc/grafana/provisioning/datasources
    restart: unless-stopped

  prometheus-local:
    image: prom/prometheus:latest
    container_name: prometheus-local
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=7d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

volumes:
  prometheus-data:
# prometheus.yml - Configure federation from Grafana Cloud
global:
  scrape_interval: 60s
  evaluation_interval: 60s

scrape_configs:
  # Federate critical metrics from Grafana Cloud
  - job_name: 'grafana-cloud-federation'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"up|node_.*|container_.*|http_.*"}'  # Critical metrics
    static_configs:
      - targets:
          - 'prometheus-prod-01-eu-west-0.grafana.net'
    basic_auth:
      username: 'YOUR_INSTANCE_ID'
      password: 'YOUR_API_KEY'
    scheme: https

  # Local scrape targets (always available)
  - job_name: 'local-node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'local-cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Datasource provisioning for fallback Grafana:

# datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus-Local
    type: prometheus
    access: proxy
    url: http://prometheus-local:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "60s"
  
  - name: Prometheus-GrafanaCloud
    type: prometheus
    access: proxy
    url: https://prometheus-prod-01-eu-west-0.grafana.net
    isDefault: false
    editable: false
    basicAuth: true
    basicAuthUser: YOUR_INSTANCE_ID
    secureJsonData:
      basicAuthPassword: YOUR_API_KEY
    jsonData:
      timeInterval: "60s"

3. Prometheus Federation for Resilience

Pull critical metrics from Grafana Cloud to local storage:

#!/bin/bash
# sync-critical-dashboards.sh - Backup Grafana dashboards locally

GRAFANA_CLOUD_URL="https://your-stack.grafana.net"
GRAFANA_API_KEY="your_api_key"
BACKUP_DIR="./dashboard-backups"

mkdir -p "$BACKUP_DIR"

# Get all dashboard UIDs
DASHBOARDS=$(curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
  "$GRAFANA_CLOUD_URL/api/search?type=dash-db" | jq -r '.[].uid')

for uid in $DASHBOARDS; do
  echo "Backing up dashboard: $uid"
  
  curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
    "$GRAFANA_CLOUD_URL/api/dashboards/uid/$uid" \
    > "$BACKUP_DIR/${uid}.json"
done

echo "✅ Backed up $(ls -1 $BACKUP_DIR/*.json | wc -l) dashboards"

# Import to local Grafana
for dashboard in $BACKUP_DIR/*.json; do
  curl -X POST -H "Content-Type: application/json" \
    -H "Authorization: Bearer $LOCAL_GRAFANA_API_KEY" \
    -d @"$dashboard" \
    "http://localhost:3000/api/dashboards/db"
done

4. Alertmanager Redundancy

Configure dual alerting: Grafana Cloud + local Alertmanager:

# alertmanager.yml - Local Alertmanager configuration
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .GroupLabels.alertname }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: '⚠️ {{ .GroupLabels.alertname }}'
# prometheus-alerts.yml - Critical alerts that must always work
groups:
  - name: grafana_cloud_health
    interval: 60s
    rules:
      - alert: GrafanaCloudDown
        expr: up{job="grafana-cloud-health-check"} == 0
        for: 5m
        labels:
          severity: critical
          component: grafana-cloud
        annotations:
          summary: "Grafana Cloud is unreachable"
          description: "Grafana Cloud has been down for 5 minutes. Failing over to local monitoring."
      
      - alert: GrafanaCloudHighLatency
        expr: probe_duration_seconds{job="grafana-cloud-health-check"} > 5
        for: 10m
        labels:
          severity: warning
          component: grafana-cloud
        annotations:
          summary: "Grafana Cloud experiencing high latency"
          description: "Query latency is {{ $value }}s (threshold: 5s)"
      
      - alert: PrometheusIngestionDelay
        expr: time() - timestamp(up{job="your-app"}) > 300
        for: 5m
        labels:
          severity: critical
          component: metrics-ingestion
        annotations:
          summary: "Prometheus metrics ingestion delayed"
          description: "Last metric received {{ $value }}s ago (threshold: 300s)"

5. Intelligent Alert Routing with Fallback

Route alerts through multiple channels with automatic failover:

// alert-router.go - Multi-channel alert delivery with fallback
package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "net/http"
    "time"
)

type AlertChannel struct {
    Name     string
    URL      string
    Priority int
    Timeout  time.Duration
}

type AlertRouter struct {
    Channels []AlertChannel
}

func NewAlertRouter() *AlertRouter {
    return &AlertRouter{
        Channels: []AlertChannel{
            {
                Name:     "grafana-cloud-alerting",
                URL:      "https://your-stack.grafana.net/api/alertmanager/grafana/api/v1/alerts",
                Priority: 1,
                Timeout:  5 * time.Second,
            },
            {
                Name:     "local-alertmanager",
                URL:      "http://localhost:9093/api/v1/alerts",
                Priority: 2,
                Timeout:  5 * time.Second,
            },
            {
                Name:     "pagerduty-direct",
                URL:      "https://events.pagerduty.com/v2/enqueue",
                Priority: 3,
                Timeout:  10 * time.Second,
            },
        },
    }
}

func (r *AlertRouter) SendAlert(alert map[string]interface{}) error {
    var lastErr error
    
    for _, channel := range r.Channels {
        err := r.sendToChannel(channel, alert)
        if err == nil {
            fmt.Printf("✅ Alert delivered via %s\n", channel.Name)
            return nil
        }
        
        fmt.Printf("❌ Failed to deliver via %s: %v\n", channel.Name, err)
        lastErr = err
        
        // Try next channel
        continue
    }
    
    return fmt.Errorf("all alert channels failed, last error: %w", lastErr)
}

func (r *AlertRouter) sendToChannel(channel AlertChannel, alert map[string]interface{}) error {
    payload, err := json.Marshal(alert)
    if err != nil {
        return err
    }
    
    client := &http.Client{Timeout: channel.Timeout}
    resp, err := client.Post(channel.URL, "application/json", bytes.NewBuffer(payload))
    if err != nil {
        return err
    }
    defer resp.Body.Close()
    
    if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusAccepted {
        return fmt.Errorf("unexpected status code: %d", resp.StatusCode)
    }
    
    return nil
}

func main() {
    router := NewAlertRouter()
    
    alert := map[string]interface{}{
        "labels": map[string]string{
            "alertname": "HighCPUUsage",
            "severity":  "warning",
        },
        "annotations": map[string]string{
            "summary":     "CPU usage above 80%",
            "description": "Host server-01 CPU usage is 85%",
        },
    }
    
    err := router.SendAlert(alert)
    if err != nil {
        fmt.Printf("⚠️ Alert delivery failed: %v\n", err)
    }
}

What to Do When Grafana Cloud Goes Down

1. Activate Your Fallback Grafana Instance

If you've implemented the local Grafana setup described earlier:

# Start local Grafana and Prometheus
cd ~/grafana-fallback
docker-compose up -d

# Verify services are running
docker-compose ps

# Access local Grafana
open http://localhost:3000

# Update DNS/load balancer to point to fallback
# Or update bookmarks to use local instance during outage

2. Query Data Sources Directly

Bypass Grafana entirely for critical queries:

# Get current CPU usage across all hosts
curl -G "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query" \
  --data-urlencode 'query=100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)' \
  -u "INSTANCE:KEY" | jq '.data.result[] | {host: .metric.instance, cpu_usage: .value[1]}'

# Search logs for errors in last hour
curl -G "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/query_range" \
  --data-urlencode 'query={job="app"} |= "ERROR"' \
  --data-urlencode "start=$(date -u -d '1 hour ago' '+%s')000000000" \
  --data-urlencode "end=$(date -u '+%s')000000000" \
  -u "USER:KEY" | jq '.data.result[] | .values[]'

3. Implement Read-Only Mode Messaging

Communicate status to your team:

// status-banner.js - Add banner to internal tools
function checkGrafanaHealth() {
  fetch('https://your-stack.grafana.net/api/health')
    .then(response => {
      if (!response.ok) {
        showBanner('⚠️ Grafana Cloud is experiencing issues. Using fallback monitoring.');
      }
    })
    .catch(() => {
      showBanner('❌ Grafana Cloud is down. Switch to http://grafana-fallback.local:3000');
    });
}

function showBanner(message) {
  const banner = document.createElement('div');
  banner.style.cssText = 'position:fixed;top:0;width:100%;background:#f44336;color:white;padding:10px;text-align:center;z-index:9999;';
  banner.textContent = message;
  document.body.prepend(banner);
}

// Check every 5 minutes
setInterval(checkGrafanaHealth, 300000);
checkGrafanaHealth();

4. Export Critical Dashboards for Manual Review

Create static HTML snapshots of key dashboards:

#!/usr/bin/env python3
# export-dashboard-snapshot.py
import requests
import json
from datetime import datetime

GRAFANA_URL = "https://your-stack.grafana.net"
API_KEY = "your_api_key"
CRITICAL_DASHBOARDS = [
    "production-overview",
    "kubernetes-cluster-health",
    "application-performance"
]

def create_snapshot(dashboard_uid):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    
    # Get dashboard
    resp = requests.get(
        f"{GRAFANA_URL}/api/dashboards/uid/{dashboard_uid}",
        headers=headers
    )
    dashboard = resp.json()['dashboard']
    
    # Create snapshot
    snapshot_data = {
        "dashboard": dashboard,
        "expires": 86400,  # 24 hours
        "external": True
    }
    
    resp = requests.post(
        f"{GRAFANA_URL}/api/snapshots",
        headers=headers,
        json=snapshot_data
    )
    
    return resp.json()['url']

print(f"Creating snapshots at {datetime.now()}")
for dashboard in CRITICAL_DASHBOARDS:
    try:
        url = create_snapshot(dashboard)
        print(f"✅ {dashboard}: {url}")
    except Exception as e:
        print(f"❌ {dashboard}: {e}")

5. Set Up Automated Failover Monitoring

Continuously monitor Grafana health and take action:

#!/bin/bash
# grafana-watchdog.sh - Auto-failover on Grafana outage

GRAFANA_URL="https://your-stack.grafana.net"
FALLBACK_URL="http://localhost:3000"
CHECK_INTERVAL=60

while true; do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GRAFANA_URL/api/health")
  
  if [ "$HTTP_CODE" != "200" ]; then
    echo "❌ Grafana Cloud down (HTTP $HTTP_CODE) - Starting failover..."
    
    # Start local Grafana if not running
    docker-compose -f ~/grafana-fallback/docker-compose.yml up -d
    
    # Send alert
    curl -X POST "YOUR_SLACK_WEBHOOK" \
      -H "Content-Type: application/json" \
      -d "{\"text\":\"🚨 Grafana Cloud is down. Failover activated: $FALLBACK_URL\"}"
    
    # Wait longer before next check during outage
    sleep 300
  else
    echo "✅ Grafana Cloud healthy"
    sleep $CHECK_INTERVAL
  fi
done

6. Maintain Local Metrics During Outage

Ensure you don't lose data during Grafana Cloud downtime:

# prometheus.yml - Dual remote write (cloud + local)
remote_write:
  # Primary: Grafana Cloud
  - url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
    basic_auth:
      username: YOUR_INSTANCE_ID
      password: YOUR_API_KEY
    queue_config:
      capacity: 10000
      max_shards: 10
      max_samples_per_send: 1000
    # Don't fail entirely if Grafana Cloud is down
    write_relabel_configs:
      - source_labels: [__name__]
        regex: '.*'
        action: keep

  # Backup: Local VictoriaMetrics or Prometheus
  - url: http://victoria-metrics:8428/api/v1/write
    queue_config:
      capacity: 5000
      max_shards: 5

This ensures metrics continue to be stored locally even when Grafana Cloud is unavailable, preventing data loss.

Frequently Asked Questions

How often does Grafana Cloud go down?

Grafana Cloud maintains high availability with typical uptime exceeding 99.9% annually. Major platform-wide outages affecting all customers are rare (1-3 per year), though individual components like specific regions or services may experience brief degradation more frequently. Most teams experience zero impactful downtime in a typical quarter. However, degraded performance (slow queries, dashboard loading delays) can occur during peak usage or when handling very large datasets.

What's the difference between Grafana status page and API Status Check?

The official Grafana status page (status.grafana.com) is manually updated by Grafana Labs' operations team during incidents, which can lag behind actual issues by 5-15 minutes. API Status Check performs automated health checks every 60 seconds against live Grafana Cloud API endpoints, dashboard loading, and data source connectivity, often detecting degradation before it's officially reported. Use both: API Status Check for immediate detection, and the status page for official incident communications and estimated resolution times.

Can I get SLA credits for Grafana Cloud downtime?

Grafana Cloud's SLA varies by plan tier. Free tier has no SLA guarantees. Pro and Advanced plans typically include 99.9% uptime SLAs with monthly credits for downtime exceeding thresholds (usually 5-10% credit per 0.1% below SLA). Enterprise plans often negotiate custom SLAs up to 99.99% with more generous credit terms. Check your specific plan details at grafana.com/legal or contact Grafana support to file an SLA credit claim after significant outages.

Should I use Grafana alerts or a separate alerting system?

For production-critical alerts, implement a defense-in-depth strategy: use Grafana Cloud alerting as primary (integrated with your dashboards and data sources), but maintain a secondary independent alerting system (local Alertmanager, Prometheus, or direct integration with PagerDuty/Opsgenie). This ensures alerts fire even if Grafana Cloud is down. Critical infrastructure alerts (disk space, memory, core services) should never rely solely on a cloud SaaS platform for delivery.

How do I prevent duplicate alerts during Grafana Cloud outages and recovery?

Configure alert grouping with group_wait and group_interval in Alertmanager to batch related alerts. Use repeat_interval to prevent alert spam. During recovery, implement exponential backoff in your alert routing logic. Example: Set group_wait: 30s to allow alerts to batch, group_interval: 5m to limit notification frequency, and repeat_interval: 4h to avoid re-sending the same alert. For webhook receivers, implement idempotency using alert fingerprints to deduplicate at the receiving end.

What regions is Grafana Cloud available in?

Grafana Cloud operates in multiple regions globally: US (us-east, us-west), Europe (eu-west, eu-central), Asia Pacific (ap-southeast), and South America (sa-east). When signing up, you select a primary region for your stack. Data residency and compliance requirements may dictate region choice (e.g., GDPR requires EU region for EU customer data). Outages typically affect single regions rather than all of Grafana Cloud globally, though severe incidents can have multi-region impact.

How do I monitor Grafana Cloud availability from outside Grafana?

Implement external synthetic monitoring using tools like API Status Check, Pingdom, Datadog Synthetics, or custom scripts. Test these endpoints: (1) Dashboard loading: https://your-stack.grafana.net/api/health, (2) Prometheus queries: /api/v1/query, (3) Loki log searches: /loki/api/v1/query, (4) Alert rule evaluation. Deploy monitors from multiple geographic locations to detect regional issues. Set up alerts to PagerDuty or Slack independent of Grafana's notification system.

Can I migrate dashboards between Grafana Cloud and self-hosted?

Yes, dashboards are fully portable. Export from Grafana Cloud: Dashboard settings → JSON Model → Copy. Import to self-hosted: Create Dashboard → Import → Paste JSON. For bulk migration, use Grafana API or tools like grafana-backup-tool. Note: Data source references must be updated to match target environment. For IaC approach, use Terraform's Grafana provider to manage dashboards as code, enabling deployment to multiple Grafana instances.

What happens to my metrics data during a Grafana Cloud outage?

Metrics collection continues unaffected—your Prometheus exporters, agents, and remote_write configurations keep pushing data to Grafana Cloud's ingestion endpoints. During outages, ingestion may be delayed or queued, but data is typically not lost unless the outage is catastrophic (extremely rare). Query access is what's impacted: you cannot view, alert on, or analyze metrics until service is restored. Implement dual remote_write to local storage to ensure zero data loss during any outage.

Should I use Grafana Cloud for business-critical monitoring?

Grafana Cloud is suitable for business-critical monitoring when combined with proper resilience strategies: (1) Implement local fallback Grafana with federated metrics, (2) Use dual alerting paths (cloud + local), (3) Deploy synthetic monitoring of Grafana Cloud itself, (4) Maintain dashboard backups, (5) Configure dual remote_write for metrics. For ultra-high-availability requirements (99.99%+), consider hybrid architecture: Grafana Cloud for primary with on-prem Prometheus cluster for guaranteed local access. Many Fortune 500 companies successfully use Grafana Cloud for critical systems with these patterns.

Stay Ahead of Grafana Cloud Outages

Don't let monitoring downtime leave you blind during critical incidents. Subscribe to real-time Grafana Cloud alerts and get notified instantly when issues are detected—before your dashboards stop loading.

API Status Check monitors Grafana Cloud 24/7 with:

  • 60-second health checks across dashboard API, Prometheus, Loki, and Tempo
  • Instant alerts via email, Slack, Discord, or webhook when degradation is detected
  • Historical uptime tracking and detailed incident reports
  • Multi-region monitoring to detect regional vs global outages
  • Ingestion pipeline monitoring to catch delays before data gaps appear

Related Monitoring Guides:

Start monitoring Grafana Cloud now →


Last updated: February 4, 2026. Grafana Cloud status information is provided in real-time based on active monitoring. For official incident reports, always refer to status.grafana.com.

Monitor Your APIs

Check the real-time status of 100+ popular APIs used by developers.

View API Status →