Is Grafana Cloud Down? How to Check Grafana Status in Real-Time
Is Grafana Cloud Down? How to Check Grafana Status in Real-Time
Quick Answer: To check if Grafana Cloud is down, visit apistatuscheck.com/api/grafana for real-time monitoring, or check the official status.grafana.com page. Common signs include dashboard loading failures, data source connection timeouts, alerting delays, Prometheus/Loki ingestion problems, panel rendering issues, and SSO authentication failures.
When your monitoring and observability platform goes dark, you lose the eyes and ears of your infrastructure. Grafana Cloud serves as the central nervous system for thousands of engineering teams worldwide, aggregating metrics, logs, and traces from countless data sources. Any disruption doesn't just mean you can't view dashboards—it means you're flying blind during incidents, missing critical alerts, and potentially letting production issues escalate unnoticed. Understanding how to quickly verify Grafana's status and implement fallback strategies can mean the difference between a 5-minute incident and a multi-hour outage.
How to Check Grafana Cloud Status in Real-Time
1. API Status Check (Fastest Method)
The quickest way to verify Grafana Cloud's operational status is through apistatuscheck.com/api/grafana. This real-time monitoring service:
- Tests actual API endpoints every 60 seconds across all regions
- Monitors dashboard loading times and rendering performance
- Tracks data source connectivity (Prometheus, Loki, Tempo)
- Validates alerting pipeline health and delivery
- Provides instant alerts when degradation is detected
- Shows historical uptime over 30/60/90 day periods
Unlike status pages that depend on manual updates, API Status Check performs active health checks against Grafana Cloud's production infrastructure, giving you the most accurate real-time picture of service availability across all critical components.
2. Official Grafana Status Page
Grafana maintains status.grafana.com as their official communication channel for service incidents. The page displays:
- Current operational status for all Grafana Cloud services
- Active incidents and ongoing investigations
- Scheduled maintenance windows and upgrade notifications
- Historical incident reports with root cause analysis
- Component-specific status breakdown:
- Grafana UI & Dashboards
- Prometheus (metrics ingestion & querying)
- Loki (log aggregation & search)
- Tempo (distributed tracing)
- Grafana Alerting
- Synthetic Monitoring
- OnCall rotation service
Pro tip: Subscribe to status updates via email, SMS, Slack, or webhook on the status page to receive immediate notifications when incidents occur. Each component can be subscribed to independently.
3. Check Your Dashboard Access
If your Grafana Cloud instance at your-stack.grafana.net is showing issues, this often indicates broader infrastructure problems. Key indicators include:
- Login page timeouts or 502/503 errors
- Dashboard list failing to load
- "Data source unavailable" errors across multiple panels
- Infinite loading spinners on panel refresh
- Query editor interface unresponsive
- Settings pages failing to save changes
4. Test Data Source Connectivity
For operators managing critical monitoring infrastructure, direct data source testing confirms backend health:
Prometheus query test:
# Query Grafana Cloud Prometheus directly
curl -G "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query" \
--data-urlencode "query=up" \
-u "YOUR_INSTANCE_ID:YOUR_API_KEY"
Loki logs query test:
# Test Loki log ingestion and query
curl -G "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/query" \
--data-urlencode 'query={job="varlogs"}' \
-u "YOUR_USER_ID:YOUR_API_KEY"
Look for HTTP response codes outside the 2xx range, connection timeouts exceeding 30 seconds, or authentication failures that weren't occurring previously.
5. Monitor Query Inspector for Backend Issues
Within Grafana itself, the Query Inspector (available on any panel) reveals backend health:
- Stats tab: Shows query execution time and data source latency
- Query tab: Displays actual queries sent to backend
- JSON tab: Reveals API response structure and errors
Sudden spikes in query execution time (5+ seconds for simple queries) or consistent timeout errors indicate backend problems even when the UI appears functional.
Common Grafana Cloud Issues and How to Identify Them
Dashboard Loading Failures
Symptoms:
- Dashboard list showing empty or timing out
- Individual dashboards stuck on loading screen
- Panels rendering with "No data" despite active metrics
- Browser console showing 502 Bad Gateway errors
- "Failed to fetch" network errors in developer tools
What it means: Dashboard loading failures typically indicate issues with Grafana's frontend API servers or the metadata database storing dashboard definitions. This differs from data source issues—the dashboard structure itself cannot be retrieved or rendered.
Diagnostic approach:
// Check browser console for specific errors
// Common patterns during outages:
// "GET https://your-stack.grafana.net/api/dashboards/uid/abc123 502"
// "TypeError: Cannot read property 'panels' of undefined"
// "Error: Timeout waiting for dashboard data"
Data Source Connection Timeouts
Common error messages:
Data source connection errorQuery timeout exceeded (30s)Connection refused to Prometheus backendLoki gateway unreachableHTTP 503 Service Temporarily Unavailable
Distinguishing characteristics:
- Multiple data sources failing simultaneously (indicates platform issue)
- Single data source failing (may be your metrics pipeline)
- Intermittent failures with successful retries (rate limiting or capacity)
- Consistent failures across all queries (backend outage)
Testing data source health:
# Test from Grafana UI
# Navigate to: Configuration → Data Sources → [Your Data Source] → Save & Test
# Expected response when healthy:
# ✓ Data source is working
# During outages you'll see:
# ✗ HTTP Error 502: Bad Gateway
# ✗ Timeout: request took longer than 30s
Alerting Delays and Delivery Failures
Grafana Cloud's alerting system is often the first component affected during partial outages:
- Alert evaluation delays: Rules not being evaluated on schedule (default: every 60s)
- Notification failures: Alerts triggered but not delivered to Slack/PagerDuty/email
- State flapping: Alerts rapidly switching between OK/Alerting without actual metric changes
- Missing alert history: Alert state changes not recorded in alert history panel
Critical impact: During a production incident, if your alerts aren't firing, your team may not know there's a problem until customers report it.
Verification steps:
- Check Alert Rules page: Do rules show "Last Evaluated" timestamp updating?
- Test notification channel: Send test alert to verify delivery path
- Review alert state history: Are there gaps in evaluation timeline?
- Check Prometheus query performance: Slow queries delay alert evaluation
Prometheus/Loki Ingestion Problems
Ingestion failures prevent new data from entering Grafana Cloud:
Symptoms for Prometheus metrics:
- Scrape targets showing "Down" status without infrastructure changes
- Gaps in time series data (flatlines in graphs)
- Error logs from Prometheus remote_write:
level=warn component=remote msg="Failed to send batch" err="server returned HTTP status 503" - Active series count dropping unexpectedly
Symptoms for Loki logs:
- Log panels showing stale data (last log entry is minutes/hours old)
- Promtail/Fluentd/Grafana Agent showing push failures:
level=error msg="failed to push logs" status=429 error="Too Many Requests" - Missing log lines during specific time windows
Testing ingestion:
# Send test metric to Grafana Cloud Prometheus
cat <<EOF | curl --data-binary @- "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/push" \
-H "Content-Type: application/x-protobuf" \
-u "YOUR_INSTANCE_ID:YOUR_API_KEY"
# metric_name{label="test"} 123 $(date +%s)000
EOF
# Send test log to Grafana Cloud Loki
curl -X POST "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push" \
-u "YOUR_USER_ID:YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"streams":[{"stream":{"job":"test"},"values":[["'$(date +%s000000000)'","test log line"]]}]}'
Panel Rendering Issues
Visual indicators:
- Panels showing "Panel plugin not found" errors
- Graph panels rendering empty despite data being present
- Time series displaying with incorrect formatting
- Heatmap or histogram panels showing corrupted visualizations
- Annotation queries timing out
Backend vs frontend issues:
- Backend problem: Query tab in inspector shows data, but panel is blank
- Frontend problem: Query tab shows "No data" but data source test succeeds
- Plugin issue: Specific panel types fail while others work
SSO/Auth Failures
Authentication and authorization issues:
- SSO login redirect loops or timeouts
- "Invalid session" errors after successful login
- OAuth callback failures with SAML/OIDC providers
- API key authentication returning 401 Unauthorized
- Service account tokens suddenly invalid
Organizational impact:
- Entire teams locked out of Grafana during incidents
- Automation using API keys fails
- Terraform provider unable to manage Grafana resources
- Grafana OnCall unable to authenticate responders
Diagnostic commands:
# Test API key authentication
curl -H "Authorization: Bearer YOUR_API_KEY" \
"https://your-stack.grafana.net/api/org"
# Expected when healthy: {"id":1,"name":"Your Org"}
# During auth issues: {"message":"Invalid API key"}
The Real Impact When Grafana Cloud Goes Down
Visibility Gaps During Critical Incidents
The worst time for your monitoring to fail is during an active production incident:
- Blind troubleshooting: Engineers can't see metrics to diagnose root cause
- Unknown blast radius: Cannot determine which services are affected
- Remediation delays: Can't verify if fixes are working without metric visibility
- Cascading failures: Secondary issues go undetected without monitoring
Real scenario: A database outage triggers application errors. Without Grafana, your team can't see:
- Database connection pool exhaustion metrics
- Application error rate spike
- API latency degradation
- Queue backlog growth
What should be a 15-minute fix becomes a 2-hour investigation because you're troubleshooting blind, checking logs manually across dozens of servers.
Dashboard-Driven Decision Paralysis
Modern engineering teams rely on dashboards for data-driven decision making:
Incident response paralysis:
- Should we trigger failover? (Can't see replica lag metrics)
- Is the deployment safe to continue? (Can't see error rates)
- Which customers are affected? (Can't see per-tenant metrics)
- Should we scale up capacity? (Can't see resource utilization)
Executive visibility loss:
- C-level dashboards showing business KPIs go dark
- Real-time revenue tracking unavailable
- Customer experience metrics invisible
- SLA compliance monitoring disabled
Recovery impact: Even after infrastructure is fixed, if Grafana is down, you can't confidently declare "all clear" because you lack metric visibility to verify health.
Alert Blindness and Silent Failures
When Grafana's alerting system fails, your safety net disappears:
Critical alerts never fire:
- Disk space filling to 100% without notification
- Memory leaks going unnoticed until OOM kills
- SSL certificates expiring without renewal alerts
- Backup jobs failing silently for days
On-call impact:
- Engineers unaware of ongoing incidents
- PagerDuty/Slack/email integrations failing
- Alert fatigue from false positives during recovery
- Loss of trust in alerting system reliability
Compliance consequences:
- SLA violations without detection
- Security events unmonitored
- Audit trail gaps for incident response
- Failure to meet regulatory reporting requirements
NOC/SOC Operational Impact
For Network Operations Centers and Security Operations Centers relying on Grafana:
NOC consequences:
- No real-time network topology visibility
- Bandwidth utilization trending unavailable
- Infrastructure capacity planning data missing
- Vendor SLA tracking impossible
SOC consequences:
- Security event correlation disrupted
- Threat detection dashboards offline
- Incident investigation timelines incomplete
- Log aggregation for forensics unavailable
Operational workarounds:
- Manual log grepping across hundreds of servers
- SSH-ing into individual nodes to check metrics
- Writing throwaway scripts to query data sources directly
- Communicating status updates without data backing
Productivity Tax Across Engineering
Developer workflow disruption:
- Cannot validate performance of new deployments
- A/B test metrics unavailable for evaluation
- Feature flag rollout monitoring blind
- Debugging production issues without telemetry
DevOps/SRE impact:
- Kubernetes cluster metrics invisible
- Infrastructure-as-code changes deployed without verification
- Cost optimization initiatives paused (no usage metrics)
- Capacity planning decisions delayed
Estimated productivity cost: For a team of 20 engineers at $150k average salary ($75/hour), a 4-hour Grafana outage where engineers spend 25% of time working around missing monitoring = $1,500 in lost productivity, plus immeasurable impact on incident response quality.
Diagnostic Steps and Troubleshooting Workflows
Step 1: Validate Status Page vs Reality
Don't trust status pages blindly. Perform your own verification:
#!/bin/bash
# grafana-health-check.sh
STACK_NAME="your-stack"
API_KEY="your_api_key"
# Check Grafana UI health
UI_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
"https://${STACK_NAME}.grafana.net/api/health")
# Check Prometheus query endpoint
PROM_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
-u "${INSTANCE_ID}:${API_KEY}" \
"https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query?query=up")
echo "Grafana UI: $UI_STATUS (expected: 200)"
echo "Prometheus: $PROM_STATUS (expected: 200)"
if [ "$UI_STATUS" != "200" ] || [ "$PROM_STATUS" != "200" ]; then
echo "❌ Grafana health check FAILED"
exit 1
else
echo "✅ Grafana health check PASSED"
fi
Step 2: Isolate Component Failures
Systematic testing to identify affected components:
Test dashboard loading:
- Try accessing dashboard list:
https://your-stack.grafana.net/dashboards - Load a simple dashboard with few panels
- Load a complex dashboard with many data sources
- Try accessing dashboard list:
Test each data source independently:
- Configuration → Data Sources → Select source → "Save & Test"
- Document which sources fail vs succeed
- Pattern of failures indicates scope (all sources = platform issue)
Test alerting pipeline:
- Create test alert rule with simple condition
- Manually trigger alert by adjusting threshold
- Verify notification delivery to test channel
- Check alert evaluation logs for errors
Test query performance:
- Open any panel's Query Inspector
- Note query execution time
- Compare to baseline (healthy state: <1s for simple queries)
- Identify slow queries that may be overloading backend
Step 3: Use Query Inspector for Backend Diagnosis
Query Inspector reveals backend health even when UI works:
Access via: Panel menu (three dots) → Inspect → Query
Key metrics to check:
- Query time: >5s indicates backend overload
- Response size: Verify data is actually returning
- Errors: Look for timeout messages or HTTP errors
Example problematic inspector output:
{
"error": "timeout: context deadline exceeded",
"status": "error",
"errorType": "timeout",
"message": "query took longer than 30s to execute"
}
Step 4: Check Data Source Health Programmatically
Automated health monitoring script:
#!/usr/bin/env python3
import requests
import sys
from datetime import datetime
GRAFANA_URL = "https://your-stack.grafana.net"
API_KEY = "your_api_key"
def check_datasource_health(datasource_name):
"""Test data source connectivity"""
headers = {"Authorization": f"Bearer {API_KEY}"}
# Get datasource ID
resp = requests.get(
f"{GRAFANA_URL}/api/datasources/name/{datasource_name}",
headers=headers
)
if resp.status_code != 200:
return False, f"Failed to find datasource: {resp.status_code}"
ds_id = resp.json()["id"]
# Test datasource
resp = requests.get(
f"{GRAFANA_URL}/api/datasources/{ds_id}/health",
headers=headers
)
if resp.status_code == 200:
return True, "OK"
else:
return False, f"Health check failed: {resp.text}"
# Test critical data sources
datasources = ["Prometheus", "Loki", "Tempo"]
all_healthy = True
print(f"=== Grafana Health Check: {datetime.now()} ===\n")
for ds in datasources:
healthy, message = check_datasource_health(ds)
status = "✅" if healthy else "❌"
print(f"{status} {ds}: {message}")
if not healthy:
all_healthy = False
sys.exit(0 if all_healthy else 1)
Step 5: Test Ingestion Pipeline
Verify metrics and logs are flowing:
# Check if recent metrics exist
# Query for a metric that's constantly updating (like timestamp)
curl -G "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query" \
--data-urlencode "query=up" \
-u "INSTANCE:KEY" | jq '.data.result[] | .value[0]' \
| xargs -I {} date -d @{} '+%Y-%m-%d %H:%M:%S'
# Output shows timestamp of most recent metric
# If older than 2-3 minutes, ingestion may be delayed
# Check Loki log recency
curl -G "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/query" \
--data-urlencode 'query={job=~".+"}' \
--data-urlencode "limit=1" \
-u "USER:KEY" | jq '.data.result[0].values[0][0]'
# Returns nanosecond timestamp of most recent log
Resilience Strategies and Code Examples
1. Multi-Backend Data Source Routing
Implement automatic failover between Grafana Cloud and local Prometheus:
// datasource-router.js - Intelligent data source failover
const axios = require('axios');
class DataSourceRouter {
constructor() {
this.backends = [
{
name: 'grafana-cloud',
url: 'https://prometheus-prod-01-eu-west-0.grafana.net',
auth: { username: process.env.GRAFANA_INSTANCE_ID, password: process.env.GRAFANA_API_KEY },
priority: 1,
healthy: true
},
{
name: 'local-prometheus',
url: 'http://prometheus.local:9090',
auth: null,
priority: 2,
healthy: true
},
{
name: 'backup-victoria-metrics',
url: 'http://victoria-metrics.local:8428',
auth: null,
priority: 3,
healthy: true
}
];
this.healthCheckInterval = 30000; // 30 seconds
this.startHealthChecks();
}
startHealthChecks() {
setInterval(() => this.checkHealth(), this.healthCheckInterval);
}
async checkHealth() {
for (const backend of this.backends) {
try {
const config = backend.auth
? { auth: backend.auth, timeout: 5000 }
: { timeout: 5000 };
await axios.get(`${backend.url}/api/v1/query?query=up`, config);
backend.healthy = true;
} catch (error) {
console.error(`Backend ${backend.name} health check failed:`, error.message);
backend.healthy = false;
}
}
}
getHealthyBackend() {
// Return highest priority healthy backend
const sorted = this.backends
.filter(b => b.healthy)
.sort((a, b) => a.priority - b.priority);
return sorted[0] || null;
}
async query(promql) {
const backend = this.getHealthyBackend();
if (!backend) {
throw new Error('No healthy data source backends available');
}
console.log(`Routing query to ${backend.name}`);
const config = backend.auth
? { auth: backend.auth, timeout: 30000 }
: { timeout: 30000 };
try {
const response = await axios.get(
`${backend.url}/api/v1/query`,
{
...config,
params: { query: promql }
}
);
return response.data;
} catch (error) {
// Mark backend as unhealthy and retry with next
backend.healthy = false;
console.error(`Query failed on ${backend.name}, retrying...`);
const nextBackend = this.getHealthyBackend();
if (nextBackend && nextBackend.name !== backend.name) {
return this.query(promql); // Recursive retry
}
throw error;
}
}
}
// Usage example
const router = new DataSourceRouter();
async function getMetrics() {
try {
const result = await router.query('rate(http_requests_total[5m])');
return result.data.result;
} catch (error) {
console.error('All backends failed:', error);
return [];
}
}
module.exports = { DataSourceRouter };
2. Local Grafana Fallback Setup
Deploy a local Grafana instance with read-only federation from Grafana Cloud:
# docker-compose.yml - Local Grafana fallback
version: '3.8'
services:
grafana-fallback:
image: grafana/grafana:latest
container_name: grafana-fallback
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_secure_password
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://grafana-fallback.local:3000
volumes:
- ./grafana-data:/var/lib/grafana
- ./dashboards:/etc/grafana/provisioning/dashboards
- ./datasources:/etc/grafana/provisioning/datasources
restart: unless-stopped
prometheus-local:
image: prom/prometheus:latest
container_name: prometheus-local
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'
- '--web.enable-lifecycle'
restart: unless-stopped
volumes:
prometheus-data:
# prometheus.yml - Configure federation from Grafana Cloud
global:
scrape_interval: 60s
evaluation_interval: 60s
scrape_configs:
# Federate critical metrics from Grafana Cloud
- job_name: 'grafana-cloud-federation'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"up|node_.*|container_.*|http_.*"}' # Critical metrics
static_configs:
- targets:
- 'prometheus-prod-01-eu-west-0.grafana.net'
basic_auth:
username: 'YOUR_INSTANCE_ID'
password: 'YOUR_API_KEY'
scheme: https
# Local scrape targets (always available)
- job_name: 'local-node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'local-cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Datasource provisioning for fallback Grafana:
# datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus-Local
type: prometheus
access: proxy
url: http://prometheus-local:9090
isDefault: true
editable: false
jsonData:
timeInterval: "60s"
- name: Prometheus-GrafanaCloud
type: prometheus
access: proxy
url: https://prometheus-prod-01-eu-west-0.grafana.net
isDefault: false
editable: false
basicAuth: true
basicAuthUser: YOUR_INSTANCE_ID
secureJsonData:
basicAuthPassword: YOUR_API_KEY
jsonData:
timeInterval: "60s"
3. Prometheus Federation for Resilience
Pull critical metrics from Grafana Cloud to local storage:
#!/bin/bash
# sync-critical-dashboards.sh - Backup Grafana dashboards locally
GRAFANA_CLOUD_URL="https://your-stack.grafana.net"
GRAFANA_API_KEY="your_api_key"
BACKUP_DIR="./dashboard-backups"
mkdir -p "$BACKUP_DIR"
# Get all dashboard UIDs
DASHBOARDS=$(curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
"$GRAFANA_CLOUD_URL/api/search?type=dash-db" | jq -r '.[].uid')
for uid in $DASHBOARDS; do
echo "Backing up dashboard: $uid"
curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
"$GRAFANA_CLOUD_URL/api/dashboards/uid/$uid" \
> "$BACKUP_DIR/${uid}.json"
done
echo "✅ Backed up $(ls -1 $BACKUP_DIR/*.json | wc -l) dashboards"
# Import to local Grafana
for dashboard in $BACKUP_DIR/*.json; do
curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer $LOCAL_GRAFANA_API_KEY" \
-d @"$dashboard" \
"http://localhost:3000/api/dashboards/db"
done
4. Alertmanager Redundancy
Configure dual alerting: Grafana Cloud + local Alertmanager:
# alertmanager.yml - Local Alertmanager configuration
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
title: '⚠️ {{ .GroupLabels.alertname }}'
# prometheus-alerts.yml - Critical alerts that must always work
groups:
- name: grafana_cloud_health
interval: 60s
rules:
- alert: GrafanaCloudDown
expr: up{job="grafana-cloud-health-check"} == 0
for: 5m
labels:
severity: critical
component: grafana-cloud
annotations:
summary: "Grafana Cloud is unreachable"
description: "Grafana Cloud has been down for 5 minutes. Failing over to local monitoring."
- alert: GrafanaCloudHighLatency
expr: probe_duration_seconds{job="grafana-cloud-health-check"} > 5
for: 10m
labels:
severity: warning
component: grafana-cloud
annotations:
summary: "Grafana Cloud experiencing high latency"
description: "Query latency is {{ $value }}s (threshold: 5s)"
- alert: PrometheusIngestionDelay
expr: time() - timestamp(up{job="your-app"}) > 300
for: 5m
labels:
severity: critical
component: metrics-ingestion
annotations:
summary: "Prometheus metrics ingestion delayed"
description: "Last metric received {{ $value }}s ago (threshold: 300s)"
5. Intelligent Alert Routing with Fallback
Route alerts through multiple channels with automatic failover:
// alert-router.go - Multi-channel alert delivery with fallback
package main
import (
"bytes"
"encoding/json"
"fmt"
"net/http"
"time"
)
type AlertChannel struct {
Name string
URL string
Priority int
Timeout time.Duration
}
type AlertRouter struct {
Channels []AlertChannel
}
func NewAlertRouter() *AlertRouter {
return &AlertRouter{
Channels: []AlertChannel{
{
Name: "grafana-cloud-alerting",
URL: "https://your-stack.grafana.net/api/alertmanager/grafana/api/v1/alerts",
Priority: 1,
Timeout: 5 * time.Second,
},
{
Name: "local-alertmanager",
URL: "http://localhost:9093/api/v1/alerts",
Priority: 2,
Timeout: 5 * time.Second,
},
{
Name: "pagerduty-direct",
URL: "https://events.pagerduty.com/v2/enqueue",
Priority: 3,
Timeout: 10 * time.Second,
},
},
}
}
func (r *AlertRouter) SendAlert(alert map[string]interface{}) error {
var lastErr error
for _, channel := range r.Channels {
err := r.sendToChannel(channel, alert)
if err == nil {
fmt.Printf("✅ Alert delivered via %s\n", channel.Name)
return nil
}
fmt.Printf("❌ Failed to deliver via %s: %v\n", channel.Name, err)
lastErr = err
// Try next channel
continue
}
return fmt.Errorf("all alert channels failed, last error: %w", lastErr)
}
func (r *AlertRouter) sendToChannel(channel AlertChannel, alert map[string]interface{}) error {
payload, err := json.Marshal(alert)
if err != nil {
return err
}
client := &http.Client{Timeout: channel.Timeout}
resp, err := client.Post(channel.URL, "application/json", bytes.NewBuffer(payload))
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusAccepted {
return fmt.Errorf("unexpected status code: %d", resp.StatusCode)
}
return nil
}
func main() {
router := NewAlertRouter()
alert := map[string]interface{}{
"labels": map[string]string{
"alertname": "HighCPUUsage",
"severity": "warning",
},
"annotations": map[string]string{
"summary": "CPU usage above 80%",
"description": "Host server-01 CPU usage is 85%",
},
}
err := router.SendAlert(alert)
if err != nil {
fmt.Printf("⚠️ Alert delivery failed: %v\n", err)
}
}
What to Do When Grafana Cloud Goes Down
1. Activate Your Fallback Grafana Instance
If you've implemented the local Grafana setup described earlier:
# Start local Grafana and Prometheus
cd ~/grafana-fallback
docker-compose up -d
# Verify services are running
docker-compose ps
# Access local Grafana
open http://localhost:3000
# Update DNS/load balancer to point to fallback
# Or update bookmarks to use local instance during outage
2. Query Data Sources Directly
Bypass Grafana entirely for critical queries:
# Get current CPU usage across all hosts
curl -G "https://prometheus-prod-01-eu-west-0.grafana.net/api/v1/query" \
--data-urlencode 'query=100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)' \
-u "INSTANCE:KEY" | jq '.data.result[] | {host: .metric.instance, cpu_usage: .value[1]}'
# Search logs for errors in last hour
curl -G "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/query_range" \
--data-urlencode 'query={job="app"} |= "ERROR"' \
--data-urlencode "start=$(date -u -d '1 hour ago' '+%s')000000000" \
--data-urlencode "end=$(date -u '+%s')000000000" \
-u "USER:KEY" | jq '.data.result[] | .values[]'
3. Implement Read-Only Mode Messaging
Communicate status to your team:
// status-banner.js - Add banner to internal tools
function checkGrafanaHealth() {
fetch('https://your-stack.grafana.net/api/health')
.then(response => {
if (!response.ok) {
showBanner('⚠️ Grafana Cloud is experiencing issues. Using fallback monitoring.');
}
})
.catch(() => {
showBanner('❌ Grafana Cloud is down. Switch to http://grafana-fallback.local:3000');
});
}
function showBanner(message) {
const banner = document.createElement('div');
banner.style.cssText = 'position:fixed;top:0;width:100%;background:#f44336;color:white;padding:10px;text-align:center;z-index:9999;';
banner.textContent = message;
document.body.prepend(banner);
}
// Check every 5 minutes
setInterval(checkGrafanaHealth, 300000);
checkGrafanaHealth();
4. Export Critical Dashboards for Manual Review
Create static HTML snapshots of key dashboards:
#!/usr/bin/env python3
# export-dashboard-snapshot.py
import requests
import json
from datetime import datetime
GRAFANA_URL = "https://your-stack.grafana.net"
API_KEY = "your_api_key"
CRITICAL_DASHBOARDS = [
"production-overview",
"kubernetes-cluster-health",
"application-performance"
]
def create_snapshot(dashboard_uid):
headers = {"Authorization": f"Bearer {API_KEY}"}
# Get dashboard
resp = requests.get(
f"{GRAFANA_URL}/api/dashboards/uid/{dashboard_uid}",
headers=headers
)
dashboard = resp.json()['dashboard']
# Create snapshot
snapshot_data = {
"dashboard": dashboard,
"expires": 86400, # 24 hours
"external": True
}
resp = requests.post(
f"{GRAFANA_URL}/api/snapshots",
headers=headers,
json=snapshot_data
)
return resp.json()['url']
print(f"Creating snapshots at {datetime.now()}")
for dashboard in CRITICAL_DASHBOARDS:
try:
url = create_snapshot(dashboard)
print(f"✅ {dashboard}: {url}")
except Exception as e:
print(f"❌ {dashboard}: {e}")
5. Set Up Automated Failover Monitoring
Continuously monitor Grafana health and take action:
#!/bin/bash
# grafana-watchdog.sh - Auto-failover on Grafana outage
GRAFANA_URL="https://your-stack.grafana.net"
FALLBACK_URL="http://localhost:3000"
CHECK_INTERVAL=60
while true; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GRAFANA_URL/api/health")
if [ "$HTTP_CODE" != "200" ]; then
echo "❌ Grafana Cloud down (HTTP $HTTP_CODE) - Starting failover..."
# Start local Grafana if not running
docker-compose -f ~/grafana-fallback/docker-compose.yml up -d
# Send alert
curl -X POST "YOUR_SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\":\"🚨 Grafana Cloud is down. Failover activated: $FALLBACK_URL\"}"
# Wait longer before next check during outage
sleep 300
else
echo "✅ Grafana Cloud healthy"
sleep $CHECK_INTERVAL
fi
done
6. Maintain Local Metrics During Outage
Ensure you don't lose data during Grafana Cloud downtime:
# prometheus.yml - Dual remote write (cloud + local)
remote_write:
# Primary: Grafana Cloud
- url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
basic_auth:
username: YOUR_INSTANCE_ID
password: YOUR_API_KEY
queue_config:
capacity: 10000
max_shards: 10
max_samples_per_send: 1000
# Don't fail entirely if Grafana Cloud is down
write_relabel_configs:
- source_labels: [__name__]
regex: '.*'
action: keep
# Backup: Local VictoriaMetrics or Prometheus
- url: http://victoria-metrics:8428/api/v1/write
queue_config:
capacity: 5000
max_shards: 5
This ensures metrics continue to be stored locally even when Grafana Cloud is unavailable, preventing data loss.
Frequently Asked Questions
How often does Grafana Cloud go down?
Grafana Cloud maintains high availability with typical uptime exceeding 99.9% annually. Major platform-wide outages affecting all customers are rare (1-3 per year), though individual components like specific regions or services may experience brief degradation more frequently. Most teams experience zero impactful downtime in a typical quarter. However, degraded performance (slow queries, dashboard loading delays) can occur during peak usage or when handling very large datasets.
What's the difference between Grafana status page and API Status Check?
The official Grafana status page (status.grafana.com) is manually updated by Grafana Labs' operations team during incidents, which can lag behind actual issues by 5-15 minutes. API Status Check performs automated health checks every 60 seconds against live Grafana Cloud API endpoints, dashboard loading, and data source connectivity, often detecting degradation before it's officially reported. Use both: API Status Check for immediate detection, and the status page for official incident communications and estimated resolution times.
Can I get SLA credits for Grafana Cloud downtime?
Grafana Cloud's SLA varies by plan tier. Free tier has no SLA guarantees. Pro and Advanced plans typically include 99.9% uptime SLAs with monthly credits for downtime exceeding thresholds (usually 5-10% credit per 0.1% below SLA). Enterprise plans often negotiate custom SLAs up to 99.99% with more generous credit terms. Check your specific plan details at grafana.com/legal or contact Grafana support to file an SLA credit claim after significant outages.
Should I use Grafana alerts or a separate alerting system?
For production-critical alerts, implement a defense-in-depth strategy: use Grafana Cloud alerting as primary (integrated with your dashboards and data sources), but maintain a secondary independent alerting system (local Alertmanager, Prometheus, or direct integration with PagerDuty/Opsgenie). This ensures alerts fire even if Grafana Cloud is down. Critical infrastructure alerts (disk space, memory, core services) should never rely solely on a cloud SaaS platform for delivery.
How do I prevent duplicate alerts during Grafana Cloud outages and recovery?
Configure alert grouping with group_wait and group_interval in Alertmanager to batch related alerts. Use repeat_interval to prevent alert spam. During recovery, implement exponential backoff in your alert routing logic. Example: Set group_wait: 30s to allow alerts to batch, group_interval: 5m to limit notification frequency, and repeat_interval: 4h to avoid re-sending the same alert. For webhook receivers, implement idempotency using alert fingerprints to deduplicate at the receiving end.
What regions is Grafana Cloud available in?
Grafana Cloud operates in multiple regions globally: US (us-east, us-west), Europe (eu-west, eu-central), Asia Pacific (ap-southeast), and South America (sa-east). When signing up, you select a primary region for your stack. Data residency and compliance requirements may dictate region choice (e.g., GDPR requires EU region for EU customer data). Outages typically affect single regions rather than all of Grafana Cloud globally, though severe incidents can have multi-region impact.
How do I monitor Grafana Cloud availability from outside Grafana?
Implement external synthetic monitoring using tools like API Status Check, Pingdom, Datadog Synthetics, or custom scripts. Test these endpoints: (1) Dashboard loading: https://your-stack.grafana.net/api/health, (2) Prometheus queries: /api/v1/query, (3) Loki log searches: /loki/api/v1/query, (4) Alert rule evaluation. Deploy monitors from multiple geographic locations to detect regional issues. Set up alerts to PagerDuty or Slack independent of Grafana's notification system.
Can I migrate dashboards between Grafana Cloud and self-hosted?
Yes, dashboards are fully portable. Export from Grafana Cloud: Dashboard settings → JSON Model → Copy. Import to self-hosted: Create Dashboard → Import → Paste JSON. For bulk migration, use Grafana API or tools like grafana-backup-tool. Note: Data source references must be updated to match target environment. For IaC approach, use Terraform's Grafana provider to manage dashboards as code, enabling deployment to multiple Grafana instances.
What happens to my metrics data during a Grafana Cloud outage?
Metrics collection continues unaffected—your Prometheus exporters, agents, and remote_write configurations keep pushing data to Grafana Cloud's ingestion endpoints. During outages, ingestion may be delayed or queued, but data is typically not lost unless the outage is catastrophic (extremely rare). Query access is what's impacted: you cannot view, alert on, or analyze metrics until service is restored. Implement dual remote_write to local storage to ensure zero data loss during any outage.
Should I use Grafana Cloud for business-critical monitoring?
Grafana Cloud is suitable for business-critical monitoring when combined with proper resilience strategies: (1) Implement local fallback Grafana with federated metrics, (2) Use dual alerting paths (cloud + local), (3) Deploy synthetic monitoring of Grafana Cloud itself, (4) Maintain dashboard backups, (5) Configure dual remote_write for metrics. For ultra-high-availability requirements (99.99%+), consider hybrid architecture: Grafana Cloud for primary with on-prem Prometheus cluster for guaranteed local access. Many Fortune 500 companies successfully use Grafana Cloud for critical systems with these patterns.
Stay Ahead of Grafana Cloud Outages
Don't let monitoring downtime leave you blind during critical incidents. Subscribe to real-time Grafana Cloud alerts and get notified instantly when issues are detected—before your dashboards stop loading.
API Status Check monitors Grafana Cloud 24/7 with:
- 60-second health checks across dashboard API, Prometheus, Loki, and Tempo
- Instant alerts via email, Slack, Discord, or webhook when degradation is detected
- Historical uptime tracking and detailed incident reports
- Multi-region monitoring to detect regional vs global outages
- Ingestion pipeline monitoring to catch delays before data gaps appear
Related Monitoring Guides:
- Is Sentry Down? - Error tracking and performance monitoring status
- Is PagerDuty Down? - Incident response and on-call platform monitoring
Start monitoring Grafana Cloud now →
Last updated: February 4, 2026. Grafana Cloud status information is provided in real-time based on active monitoring. For official incident reports, always refer to status.grafana.com.
Monitor Your APIs
Check the real-time status of 100+ popular APIs used by developers.
View API Status →