G

Grafana Cloud Outage History

Past incidents and downtime events

Complete history of Grafana Cloud outages, incidents, and service disruptions. Showing 50 most recent incidents.

March 2026(21 incidents)

majorresolvedMar 20, 03:00 PM — Resolved Mar 20, 03:41 PM

Authentication API Database Down in prod-eu-west-2 and prod-eu-west-4

3 updates
resolvedMar 20, 03:41 PM

This incident has been resolved.

investigatingMar 20, 03:08 PM

We have observed impact in prod-eu-west-4 as well.

investigatingMar 20, 03:00 PM

We are currently investigating an issue impacting the main database for Authentication API's in the prod-eu-west-2 region. Writes are currently failing, but reads are operational.

majorresolvedMar 19, 04:46 PM — Resolved Mar 19, 06:44 PM

Various Datasource Issues

5 updates
resolvedMar 19, 06:44 PM

This incident has been resolved.

monitoringMar 19, 05:56 PM

We are continuing to monitor for any further issues.

monitoringMar 19, 05:56 PM

We have observed recovery for the Cloudwatch Datasource. We are now seeing failures for the following Datasources: Aurora Opensearch X-Ray Timestream Redshift Sitewise A fix for the above is being rolled out now, and we will monitor progress. We will also change the name of this incident from "Cloudwatch Datasource Issues" to "Various Datasource Issues" to more accurately reflect impact.

monitoringMar 19, 05:13 PM

We have identified the issue, and are rolling out the fix. We are already seeing improvements and will continue to monitor progress.

investigatingMar 19, 04:46 PM

We are currently investigating an issue impacting the CloudWatch Datasource causing failures.

majorresolvedMar 19, 11:17 AM — Resolved Mar 19, 06:11 PM

Degraded performance of Grafana Cloud k6 test runs

2 updates
resolvedMar 19, 06:11 PM

Our engineering team has deployed a fix and we continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

investigatingMar 19, 11:17 AM

Some customers are seeing degraded performance and errors from certain v6 API endpoints. We are investigating the issue.

minorresolvedMar 13, 10:28 AM — Resolved Mar 18, 07:13 AM

Grafana Cloud Logs - Write degradation in Azure Netherlands (eu-west-3)

3 updates
resolvedMar 18, 07:13 AM

We have been observing stability for a period of time and will mark the incident as resolved at this time.

investigatingMar 13, 09:22 PM

We are continuing to investigate this issue with our CSP, and will provide updates as they become available.

investigatingMar 13, 10:28 AM

We are seeing issues on the write path for Loki in cluster Azure Netherlands (eu-west-3). Impact will reflect in degradation of logs ingestion on that cluster. Our engineering team is already working on restoring the service.

majorresolvedMar 11, 05:10 PM — Resolved Mar 13, 06:15 PM

Rule Evaluation Outage in prod-us-west-0

3 updates
resolvedMar 13, 06:15 PM

This incident has been resolved.

monitoringMar 11, 06:02 PM

A fix has been implemented and we are monitoring the results.

investigatingMar 11, 05:10 PM

We are currently investigating an issue impacting rule evaluation for a subset of customers in the prod-us-west-0 region. We will provide updates as they become available.

majorresolvedMar 13, 07:41 AM — Resolved Mar 13, 06:11 PM

Increased number of Aborted-by-Systems with a k6 binary building errors

4 updates
resolvedMar 13, 06:11 PM

This incident has been resolved.

monitoringMar 13, 12:49 PM

A fix has been implemented and we are monitoring the results.

identifiedMar 13, 08:45 AM

The issue has been identified and a fix is being implemented.

investigatingMar 13, 07:41 AM

We are seeing an increased number of Aborted-by-Systems with a k6 binary building error. We are investigating the issue. The first occurrence of this happened back on March 9, has now been identified as a blocking issue for some customers.

minorresolvedMar 11, 08:31 AM — Resolved Mar 12, 01:18 PM

Grafana Cloud Logs - Write degradation in Azure Netherlands (eu-west-3)

3 updates
resolvedMar 12, 01:18 PM

This incident has been resolved.

investigatingMar 11, 09:13 AM

We are also reporting impact to Faro performance in the same region. We are continuing to investigate this issue.

investigatingMar 11, 08:31 AM

We are seeing issues on the write path for Loki in cluster Azure Netherlands (eu-west-3). Impact will reflect in degradation of logs ingestion on that cluster. Our engineering team is already working on restoring the service.

majorresolvedMar 10, 06:00 PM — Resolved Mar 11, 09:48 PM

Some Write Failures in prod-eu-west-3.

5 updates
resolvedMar 11, 09:48 PM

This incident has been resolved.

monitoringMar 11, 03:51 PM

Things have been stable, and we have a potential mitigation should this issue arise again. We are monitoring the issue in the meantime.

identifiedMar 11, 01:35 AM

There are ongoing intermittent elevated transient write failures. We will continue to provide additional updates as more information becomes available.

monitoringMar 10, 06:42 PM

A fix has been implemented, and we are monitoring.

investigatingMar 10, 06:00 PM

We are currently investigating an issue impacting a subset of users in the prod-eu-west-3 region. Impacted users are experiencing elevated transit write failures, with no degradation to the read path.

minorresolvedMar 9, 06:03 PM — Resolved Mar 10, 09:17 PM

Metrics write path outage in prod-us-central-0 and prod-us-central-5

2 updates
resolvedMar 10, 09:17 PM

This incident has been resolved.

monitoringMar 9, 06:03 PM

From 15:30 to 15:45 UTC and from 16:53 to 17:03 UTC, the prod-us-central-0 and prod-us-central-5 regions saw elevated latency and error rates on the write path. We're monitoring now.

minorresolvedMar 9, 02:20 PM — Resolved Mar 10, 08:54 PM

Fleet Managment Elevated Rate of Errors

3 updates
resolvedMar 10, 08:54 PM

This incident has been resolved.

investigatingMar 10, 06:11 PM

Our engineering team continues to work towards a resolution for this issue.

investigatingMar 9, 02:20 PM

Some users in prod-us-central-0 may be seeing elevated rate of errors when fetching configurations. Our engineers are currently investigating this issue.

minorresolvedMar 10, 03:26 PM — Resolved Mar 10, 08:39 PM

Service degradation on Logs Read path in AWS US West (us-west-0)

2 updates
resolvedMar 10, 08:39 PM

This incident has been resolved.

identifiedMar 10, 03:26 PM

There has been a reoccurrence o the issues on the Read path of Loki services on AWS US West since yesterday 9th around ~17:15UTC. The issue has been identified, and resolutions steps has been taken to restore full service. We are currently monitoring the service status. The impact of this includes timeouts and 5xx errors when query logs for customers on this cluster.

majorresolvedMar 10, 06:06 PM — Resolved Mar 10, 07:17 PM

Various Issues with HG Pages

2 updates
resolvedMar 10, 07:17 PM

This incident has been resolved.

investigatingMar 10, 06:06 PM

We are noticing issues with various HG pages. Our engineering team is actively looking into it.

noneresolvedMar 7, 08:07 PM — Resolved Mar 9, 08:59 AM

Outage for prod-eu-central-0 due to AWS S3 outage.

5 updates
resolvedMar 9, 08:59 AM

This incident has been resolved.

monitoringMar 8, 11:30 AM

Since about 20:03 UTC we have seen AWS S3 recover and also our services are recovering, we are monitoring.

investigatingMar 7, 08:10 PM

Since about 20:03 UTC we have seen AWS S3 recover and also our services are recovering, we are monitoring.

investigatingMar 7, 08:10 PM

We are continuing to investigate this issue.

investigatingMar 7, 08:07 PM

We are seeing elevated errors rate and outages across many of our services in prod-eu-central-0, due to an on-going AWS S3 outage in that region.

minorresolvedMar 8, 02:17 PM — Resolved Mar 8, 08:31 PM

Service degradation on Logs Read path in AWS US West (us-west-0)

3 updates
resolvedMar 8, 08:31 PM

We continue to observe a continued period of stability since 19:40 UTC. At this time, we are considering this issue resolved

monitoringMar 8, 06:29 PM

Since 16:35 UTC we have experienced stability and services are recovering. We are actively monitoring and working to fully stabilize

investigatingMar 8, 02:17 PM

Our engineering team is investigating issues on the read path of Loki services on AWS US West since today aroun ~13:25UTC. These issues can cause timeouts and 5xx errors when query logs for customers on the cluster. The team is currently working to restore the service.

criticalresolvedMar 6, 03:03 PM — Resolved Mar 6, 04:31 PM

Some Grafana Instances Unavailable

2 updates
resolvedMar 6, 04:31 PM

This incident has been resolved.

identifiedMar 6, 03:03 PM

We have identified an issue which is causing some instances to become unavailable. Our engineering team is actively working on mitigating the issue. We will continue to share updates as they become available.

majorresolvedMar 5, 10:27 PM — Resolved Mar 5, 11:36 PM

Write failures in prod-eu-west-0

3 updates
resolvedMar 5, 11:36 PM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

monitoringMar 5, 10:41 PM

Engineering has released a fix and as of 22:00 UTC, customers should no longer experience write failures and delays in rule evaluation. We will continue to monitor for recurrence and provide updates accordingly.

investigatingMar 5, 10:27 PM

A recent incident affecting the data read path and rule execution within prod-eu-west-0 began at ~21:05 UTC on March 5, 2026. Customers with instances in this region may experience write failures and delays in rule evaluation. Engineering is actively engaged and assessing the issue. We will provide updates accordingly.

minorresolvedMar 3, 12:07 PM — Resolved Mar 5, 06:31 PM

Grafana Cloud Logs - Write degradation in Azure Netherlands (eu-west-3)

5 updates
resolvedMar 5, 06:31 PM

This incident has been resolved.

identifiedMar 4, 10:41 PM

We continue to monitor mitigation efforts and work with our CSP.

identifiedMar 3, 10:19 PM

The impacted has been reduced to slight intermittency. We continue to work with our CSP to reach a complete resolution.

investigatingMar 3, 02:15 PM

Since today at 11:55 UTC time we are seeing issues on the write path for Loki in cluster Azure Netherlands (eu-west-3). Impact will reflect in degradation of logs ingestion on that cluster. We are also reporting impact to Faro performance in the same region. Our engineering team is already working on restoring the service.

investigatingMar 3, 12:07 PM

Since today at 11:55 UTC time we are seeing issues on the write path for Loki in cluster Azure Netherlands (eu-west-3). Impact will reflect in degradation of logs ingestion on that cluster. Our engineering team is already working on restoring the service.

minorresolvedMar 4, 07:47 AM — Resolved Mar 4, 09:29 AM

Elevated rate of errors for Fleet Management in prod-us-central-0

3 updates
resolvedMar 4, 09:29 AM

This incident has been resolved.

monitoringMar 4, 08:46 AM

A fix has been implemented and we are monitoring the results.

investigatingMar 4, 07:47 AM

We are currently experiencing an issue with Fleet Management in prod-us-central-0. Users in prod-us-central-0 may observe elevated rate of errors when fetching configurations.

noneresolvedMar 3, 01:00 PM — Resolved Mar 3, 01:00 PM

Test Run Browser Screenshot Upload Failing

1 update
resolvedMar 3, 06:35 PM

Test run browser screenshot upload experienced failures from 13:12 to 14:51 UTC. The issue has been resolved

criticalresolvedMar 2, 07:37 AM — Resolved Mar 2, 03:48 PM

Write outage for logs in prod-eu-west-3

3 updates
resolvedMar 2, 03:48 PM

This incident has been resolved.

investigatingMar 2, 08:08 AM

We are now experiencing write outage for logs in prod-eu-west-3. Our Engineering team is aware and currently investigating this. We will provide further updates accordingly.

investigatingMar 2, 07:37 AM

We are experiencing increased write latency for logs in prod-eu-west-3. Our Engineering team is aware and currently investigating this. We will provide further updates accordingly.

criticalinvestigatingMar 2, 06:43 AM

Complete outage in prod-me-central-1

10 updates
investigatingMar 19, 12:13 PM

We have not received any further updates from AWS at this time. However, we are actively monitoring the outage and will provide additional information as it becomes available. Also, please continue to refer to the AWS status page for more detailed updates. https://health.aws.amazon.com/health/status All the guidance previously included about stack migration is still relevant. Please reach out to our Support team if you have any questions.

investigatingMar 4, 10:22 PM

We are actively monitoring the situation, but at this time there are no new updates to share. The next update will be provided once we have more information to share. Please reach out to our Support team if you have any questions.

investigatingMar 4, 10:28 AM

We are continuing to investigate this issue.

investigatingMar 2, 10:18 PM

Please continue to refer to the AWS status page for more detailed updates specific to AWS. https://health.aws.amazon.com/health/status AWS are recommending that affected customers move workloads to alternate regions, and we are recommending the same. Customers who are impacted and who cannot wait for a restoration of service are asked to: 1. Create a Grafana Cloud stack in an alternate region 2. Update clients to send telemetry to the new region, if using Grafana Alloy then you can use Fleet Management https://grafana.com/docs/grafana-cloud/send-data/fleet-management/introduction/ 3. If your instance remains available and you have not configured your dashboards as code, then you may be able to use `grafanactl` to migrate dashboards https://grafana.com/docs/grafana/latest/as-code/observability-as-code/grafana-cli/grafanacli-workflows/ https://grafana.github.io/grafanactl/ We are continuing to work with our CSP at this time, and will provide updates as they are available.

investigatingMar 2, 10:31 AM

AWS are recommending that affected customers move workloads to alternate regions https://health.aws.amazon.com/health/status and we are recommending the same. Customers who are impacted and who cannot wait for a restoration of service are asked to: 1. Create a Grafana Cloud stack in an alternate region 2. Update clients to send telemetry to the new region, if using Grafana Alloy then you can use Fleet Management https://grafana.com/docs/grafana-cloud/send-data/fleet-management/introduction/ 3. If your instance remains available and you have not configured your dashboards as code, then you may be able to use `grafanactl` to migrate dashboards https://grafana.com/docs/grafana/latest/as-code/observability-as-code/grafana-cli/grafanacli-workflows/ https://grafana.github.io/grafanactl/ We will provide updates when we have them, but we do not have an expected resolution time at this point.

investigatingMar 2, 10:04 AM

Customers are recommended to configure a new blank stack in an alternative Grafana Cloud region and to reconfigure their clients (such as Grafana Alloy) to send telemetry to that region, Fleet Management can be used for this purpose https://grafana.com/docs/grafana-cloud/send-data/fleet-management/introduction/

investigatingMar 2, 08:36 AM

We are updating this incident to reflect a complete outage in prod-me-central-1, due to an on-going AWS UAE data center issue. We will provide further updates accordingly.

investigatingMar 2, 08:21 AM

We are observing write and read outage errors across all databases (metrics, logs, traces) in prod-me-central-1, due to an on-going AWS UAE data center issue. We will provide further updates accordingly.

investigatingMar 2, 08:14 AM

We are observing write and read outage errors across all databases (metrics, logs, traces) in prod-me-central-1, due to an on-going AWS UAE data center issue. We will provide further updates accordingly.

investigatingMar 2, 06:43 AM

We are seeing elevated write and read path errors in prod-me-central-1, due to an on-going AWS UAE data center issue. We will provide further updates accordingly.

February 2026(29 incidents)

minorresolvedFeb 25, 07:54 PM — Resolved Mar 17, 06:22 PM

Grafana Cloud Metrics - Intermittent Write Latency in prod-us-central, prod-us-central-5, and prod-eu-west-0

8 updates
resolvedMar 17, 06:22 PM

This incident is now resolved. During the incident the Cloud Metrics platform experienced intermittent latency spikes communicating with a backend cloud service in the prod-us-central-0 and prod-us-central-5 regions. During the incident the internal CSP-facing issue was escalated to a P1. After determining the scope of the latency spikes was limited to only one availability zone, the team mitigated the situation by migrating all write traffic from to the single nearly unaffected availability zone. As the CSP service team attempted to remedy the situation, the situation became worse and began affecting the previously unaffected zone. Given this, another mitigation path was needed. Changing the connection strategy employed by Cloud Metrics to a different method was deployed to all environments, stabilizing the write path once again as we found the different connection method was more reliable and not affected by these increases in latency. We have migrated all tenants back to multi-zone write paths and are happy with and confident in the current method of connectivity to the backend cloud service, which is the one we migrated to during the course of the incident. We have no immediate plans to use the previous problematic connectivity method for the foreseeable future.

monitoringMar 6, 09:44 PM

We are rolling out a mitigation across the environments in these regions, and preemptively where possible to ensure it doesn’t spread elsewhere.

monitoringMar 6, 08:53 PM

We have seen an increase in latency in our cloud providers services, and are rolling out a change to mitigate the issue. We are monitoring.

monitoringMar 5, 10:22 PM

We are continuing to investigate this issue alongside the CSP, and have taken steps to escalate through the appropriate channels. The mitigation in place continues to work as expected, and any notable updates will continue to be shared here for tracking.

monitoringFeb 27, 10:05 PM

We are continuing to investigate this issue alongside the CSP. Any notable updates will continue to be shared here for tracking.

monitoringFeb 27, 02:55 PM

We've implemented mitigation in place and are continuing to monitoring and investigating this issue.

investigatingFeb 26, 04:23 PM

We have begun rolling out mitigation steps to reduce write latency in the prod-us-central-0 and prod-us-central-5 regions. While these measures are expected to improve performance, we are continuing to investigate the underlying root cause of the issue. We will provide additional updates as more information becomes available.

investigatingFeb 25, 07:54 PM

Since February 19, we have been investigating an intermittent issue causing increased write latency in the prod-us-central-0 and prod-us-central-5 regions. The issue does not affect all traffic but may result in delayed write operations for some customers. Our engineering team is actively working to identify the root cause and stabilize performance. We will share additional updates as progress is made.

minorresolvedFeb 27, 01:46 PM — Resolved Feb 27, 11:38 PM

Trace querying issue in all Tempo clusters

3 updates
resolvedFeb 27, 11:38 PM

This incident has been resolved.

identifiedFeb 27, 07:27 PM

Our team has identified the issue, and are in the process of testing a fix.

investigatingFeb 27, 01:46 PM

We're currently working on an issue where portions of data may be temporarily unretrievable, affecting a small percentage of tenants in all Tempo clusters.

minorresolvedFeb 27, 04:25 PM — Resolved Feb 27, 04:25 PM

Increased Latency for Small Subset of Customers

1 update
resolvedFeb 27, 04:25 PM

A recent rollout caused the AuthZ (RBAC) service to perform many redundant folder-tree fetches for each authorization check. For a small number of tenants in the prod-us-east-0 and prod-eu-west-2 regions with very large folder trees. This added a few milliseconds to every check, which increased request latency. The approximate timeframe of the impact is: 2026-02-26 17:24:43 UTC to 2026-02-27 14:33:53 UTC. This has now been resolved.

minorresolvedFeb 27, 12:57 PM — Resolved Feb 27, 03:24 PM

Incorrect pipeline assignment after custom attributes are assigned

3 updates
resolvedFeb 27, 03:24 PM

This incident has been resolved.

identifiedFeb 27, 01:39 PM

The issue has been identified and we are working on a fix.

investigatingFeb 27, 12:57 PM

We are investigating issues with incorrect pipeline assignment after custom attributes are assigned.

minorresolvedFeb 26, 01:00 PM — Resolved Feb 27, 02:49 AM

Grafana Cloud Faro slowness of listing and uploading sourcemaps in all regions.

3 updates
resolvedFeb 27, 02:49 AM

This incident has been resolved.

identifiedFeb 26, 02:43 PM

Uploads should work without an issue now. However, listing might still result in occasional timeouts - we're actively addressing this problem.

identifiedFeb 26, 01:00 PM

We're experiencing an issue for all Grafana Cloud regions, which manifest in slowness when uploading and listing sourcemaps. The issue most significantly affects users who have a large sourcemap files. We've identified the issue and our team is currently working on a fix.

majorresolvedFeb 25, 05:44 PM — Resolved Feb 25, 07:51 PM

Issues Loading Dashboards and Alert Folders in Hosted Grafana

5 updates
resolvedFeb 25, 07:51 PM

This incident has been resolved.

monitoringFeb 25, 06:46 PM

A fix has been implemented, and we are observing recovery across all impacted regions. We will continue to monitor progress.

identifiedFeb 25, 06:31 PM

The issue has been identified, and we are in the process of rolling out a fix.

investigatingFeb 25, 05:49 PM

While we work on narrowing down the scope, we can confirm that deployments in the prod-us-east-0 region are impacted.

investigatingFeb 25, 05:44 PM

Some users may be experiencing issues loading dashboard and alert folders in Hosted Grafana. We will provide more information as it becomes available to us.

majorresolvedFeb 25, 03:05 PM — Resolved Feb 25, 05:20 PM

Partial Write & Rule Evaluation Outage in prod-eu-west-3

3 updates
resolvedFeb 25, 05:20 PM

This incident has been resolved.

monitoringFeb 25, 03:55 PM

A fix has been implemented and we are monitoring the results.

investigatingFeb 25, 03:05 PM

We are currently investigating an issue which is causing a partial write, and rule evaluation outage in the specified region. We will continue to provide updates as they are available

majorresolvedFeb 25, 12:41 PM — Resolved Feb 25, 03:05 PM

Grafana Cloud Traces prod-eu-west-6 region (AWS Ireland) wrong URL endpoint shown for traces ingestion.

3 updates
resolvedFeb 25, 03:05 PM

This incident has been resolved.

monitoringFeb 25, 12:53 PM

The fix was deployed to all affected, already existing tenants. All newly created tenants will not face the issue as well. We're monitoring the incident, but it should be resolved by now.

identifiedFeb 25, 12:41 PM

We identified an issue with the incorrect URL endpoint being shown for traces ingestion in prod-eu-west-6 region (AWS Ireland). Using the displayed URL will result in traces not being able to be ingested. The AWS private link ingestion should work without issues though. The issue affects all tenants in this region and our team is in the process of deploying a fix to address this issue.

majorresolvedFeb 24, 02:31 PM — Resolved Feb 24, 05:09 PM

Some Alert Rule Evaluations Failing

3 updates
resolvedFeb 24, 05:09 PM

This incident has been resolved.

monitoringFeb 24, 04:27 PM

A fix has been implemented, and we are monitoring results.

investigatingFeb 24, 02:31 PM

We are currently investigating an issue impacting a subset of users in the prod-us-east-0 region. Impacted customers will receive a "failed to execute query" error when evaluating alert rules.

minorresolvedFeb 18, 08:27 AM — Resolved Feb 18, 09:17 PM

Degraded performance of Grafana Cloud k6 test runs

4 updates
resolvedFeb 18, 09:17 PM

This incident has been resolved.

monitoringFeb 18, 12:57 PM

A fix has been implemented and we are monitoring the results.

investigatingFeb 18, 10:20 AM

We are continuing to investigate this issue.

investigatingFeb 18, 08:27 AM

We see intermittent failures and slow start-up of test-runs. We are currently investigating this issue.

minorresolvedFeb 18, 02:00 PM — Resolved Feb 18, 02:00 PM

Brief Disruption in Azure prod-us-7-central

1 update
resolvedFeb 18, 02:56 PM

We experienced an issue impacting a cell within the Azure prod-us-central-7 region, which occurred between 14:26 and 14:36. Affected users may have noticed increased errors with rule evaluations, as well as a some read/write errors. We have resolved this issue, and will continue to monitor.

minorresolvedFeb 18, 03:43 AM — Resolved Feb 18, 05:31 AM

Grafana Cloud metrics degredation

3 updates
resolvedFeb 18, 05:31 AM

This incident has been resolved.

investigatingFeb 18, 03:47 AM

We are continuing to investigate this issue.

investigatingFeb 18, 03:43 AM

We've been alerted to issues querying and are investigating

minorresolvedFeb 17, 02:53 PM — Resolved Feb 17, 04:27 PM

Maintenance task for Synthetic Monitoring ProbeFailedExecutionsTooHigh alert rule

2 updates
resolvedFeb 17, 04:27 PM

This incident has been resolved.

monitoringFeb 17, 02:53 PM

Alert instances for Synthetic Monitoring ProbeFailedExecutionsTooHigh provisioned alert rule that are firing during this maintenance might resolve and fire again in the next evaluation. Only the API is affected. Estimated time window is 15:00–16:00 UTC Impacted clusters are: prod-eu-west-5 prod-us-east-4 prod-eu-west-6 prod-sa-east-0 prod-ap-south-0 prod-ap-southeast-0 prod-me-central-0 prod-au-southeast-0 prod-ap-southeast-2

noneresolvedFeb 17, 12:47 PM — Resolved Feb 17, 12:47 PM

Degradation of service on Synthetic Monitoring Public Probe AWS Canada (Calgary)

1 update
resolvedFeb 17, 12:50 PM

There was a service degradation today from ~12:09 UTC until ~12:35 UTC on the Public Probe of Calgary for Synthetic Monitoring. Impact may include SM check fails where the probe was used.

majorresolvedFeb 13, 06:40 PM — Resolved Feb 13, 07:08 PM

Self-Serve Users Unable to Sign Up

2 updates
resolvedFeb 13, 07:08 PM

This incident has been resolved.

investigatingFeb 13, 06:40 PM

We are currently investigating an issue which is causing users the inability to sign up for self-serve Grafana. We will continue to update with more information as we progress our investigation.

criticalresolvedFeb 12, 11:16 PM — Resolved Feb 13, 05:03 PM

Loki Delete Endpoint Bug

4 updates
resolvedFeb 13, 05:03 PM

This incident has been resolved.

identifiedFeb 13, 07:56 AM

We are continuing to work on a fix for this issue.

identifiedFeb 13, 07:56 AM

A fix is being made to mitigate the issue. We will provide further updates accordingly.

identifiedFeb 12, 11:16 PM

As of 22:45 UTC, we have identified a serious bug affecting the delete endpoint for all Loki regions. As a precaution, the endpoint has been temporarily disabled. Engineering is actively engaged and assessing the issue. We will provide updates accordingly.

criticalresolvedFeb 13, 06:59 AM — Resolved Feb 13, 07:29 AM

Loki writes outage in prod-ca-east-0

3 updates
resolvedFeb 13, 07:29 AM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved.

monitoringFeb 13, 07:09 AM

We have scaled up to handle the increased traffic and are seeing marked improvement. We will continue to monitor and provide updates.

investigatingFeb 13, 06:59 AM

We have been alerted to an ongoing Loki writes outage in the prod-ca-east-0 region. Our Engineering team is actively investigating this.

maintenanceresolvedFeb 12, 04:09 PM — Resolved Feb 12, 07:07 PM

Essential Maintenance for Faro Services

2 updates
resolvedFeb 12, 07:07 PM

This incident has been resolved.

monitoringFeb 12, 04:09 PM

We are undergoing essential maintenance for Faro services. Users may experience a short outage for the service of <1 minute during this time. We expect this to be finished within an hour.

minorresolvedFeb 12, 12:33 PM — Resolved Feb 12, 02:30 PM

Grafana Cloud Metrics elevated write and rule evaluation latency in prod-eu-west-2 region.

4 updates
resolvedFeb 12, 02:30 PM

We no longer observed any problems with our services - this incident has been resolved.

monitoringFeb 12, 12:50 PM

The fix has been implemented and services are back to normal. We're currently monitoring health of the services before resolving this incident.

identifiedFeb 12, 12:40 PM

The issue has been identified and our team is currently working on a fix.

investigatingFeb 12, 12:33 PM

Since 12:17 UTC, we're observing an increased latency for data ingestion and rule evaluation in Grafana Cloud Metrics, prod-eu-west-2 region. We're currently investigating the issue.

majorresolvedFeb 11, 02:21 PM — Resolved Feb 11, 09:47 PM

Unable to Install Slack Integration

4 updates
resolvedFeb 11, 09:47 PM

This incident has been resolved.

monitoringFeb 11, 06:20 PM

We are in the process of rolling out the fix.

identifiedFeb 11, 04:22 PM

We have identified the issue, and are working on a fix.

investigatingFeb 11, 02:21 PM

We are aware of an issue that is preventing the installation of the Slack integration. We are currently investigating this, and will provide updates as they become available.

minorresolvedFeb 11, 06:51 AM — Resolved Feb 11, 07:25 AM

Loki error response rate spike on prod-ap-southeast-1

3 updates
resolvedFeb 11, 07:25 AM

This incident has been resolved.

monitoringFeb 11, 06:54 AM

We have deployed temporary measures to mitigate the issue, but there was a writes outage from  06:26 to 06:37 UTC.

investigatingFeb 11, 06:51 AM

cloud logging is facing write issues in this region, our team is looking into this.

majorresolvedFeb 10, 12:39 AM — Resolved Feb 10, 01:45 AM

Write failures in prod-us-central-0

2 updates
resolvedFeb 10, 01:45 AM

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved.

investigatingFeb 10, 12:39 AM

As of 00:10, we are currently experiencing write failures in a single cell affecting customers in prod-us-central-0. Impacted customers may see failed or dropped writes. Engineering is actively engaged and assessing the issue. We will provide updates accordingly.

minorresolvedFeb 9, 03:35 PM — Resolved Feb 9, 07:07 PM

Athena Queries Broken

4 updates
resolvedFeb 9, 07:07 PM

This incident has been resolved.

monitoringFeb 9, 05:01 PM

We are seeing recovery in impacted environments. We will continue to monitor the progress.

investigatingFeb 9, 04:23 PM

Our engineering team is still investigating this issue.

investigatingFeb 9, 03:35 PM

We are currently investigating an issue resulting in broken queries for the Athena data source.

minorresolvedFeb 9, 10:32 AM — Resolved Feb 9, 11:21 AM

Grafana Cloud Logs – Write Ingestion Degradation

3 updates
resolvedFeb 9, 11:21 AM

This incident has been resolved.

monitoringFeb 9, 10:36 AM

We are continuing to monitor for any further issues.

monitoringFeb 9, 10:32 AM

Between 09:47 and 10:14 UTC, Grafana Cloud Logs within a single cell residing in the prod-ap-southeast-1 region experienced an issue affecting write ingestion only. During this time, some log writes may have failed or been delayed. Log reads were not impacted and remained fully available throughout the incident. Our engineering team quickly identified the cause of the issue and are monitoring the service. The service has been operating normally since 10:14 UTC.

minorresolvedFeb 6, 11:29 AM — Resolved Feb 6, 05:57 PM

Multiple free tier customers are getting "no fields to display" when viewing logs instead of labels and structured metadata

2 updates
resolvedFeb 6, 05:57 PM

This incident has been resolved.

investigatingFeb 6, 11:29 AM

We are currently investigating this issue.

noneresolvedFeb 5, 06:30 PM — Resolved Feb 5, 06:30 PM

Grafana Cloud Metrics – Write Ingestion Degradation

1 update
resolvedFeb 5, 09:10 PM

Between 18:32 and 18:46 UTC, Grafana Cloud Metrics within a single cell residing in the prod-us-west-0 region experienced an issue affecting write ingestion only. During this time, some metric writes may have failed or been delayed. Metric reads were not impacted and remained fully available throughout the incident. Our engineering team quickly identified the cause of the issue and implemented mitigation steps to restore normal write ingestion. The service has been operating normally since 18:46 UTC.

noneresolvedFeb 5, 06:00 PM — Resolved Feb 5, 06:00 PM

Tempo write path degradation in prod-us-west-0

1 update
resolvedFeb 10, 11:44 AM

From 17:43 UTC to 18:05 UTC, a subset of customers experienced elevated latency and a peak error rate of approximately 22% for trace ingestion.

majorresolvedFeb 5, 02:14 PM — Resolved Feb 5, 05:41 PM

Hosted Metrics partial outage of read path in us-central-0 region.

3 updates
resolvedFeb 5, 05:41 PM

This incident has been resolved.

monitoringFeb 5, 02:40 PM

Services recovered and there's no active issue anymore. We're still monitoring the overall health.

investigatingFeb 5, 02:14 PM

We're experiencing an issue in us-central-0 region for Hosted Metrics offering - the issue manifest in rule evaluations failing, and possibility of queries returning stale data. We're actively investigating the cause of the issue.

minorresolvedFeb 4, 05:20 PM — Resolved Feb 5, 03:31 PM

Inconsistent threshold check results reported intermittently

4 updates
resolvedFeb 5, 03:31 PM

This incident has been resolved.

monitoringFeb 5, 09:36 AM

The issue causing the incident has been identified, and the fix has been deployed. All new test runs work consistently

identifiedFeb 4, 07:57 PM

We are continuing to work on a fix for this issue.

identifiedFeb 4, 05:20 PM

We encountered a subtle bug which caused our test-run finalization process to read from stale threshold status because of a synchronization issue. We have since resolved the bug, and new test runs will work properly. Impacted test runs will need to be fixed via further correction on our end. We will continue to provide updates on the progress of the fix for impacted test runs.