F

Fly.io Outage History

Past incidents and downtime events

Complete history of Fly.io outages, incidents, and service disruptions. Showing 50 most recent incidents.

March 2026(15 incidents)

minormonitoringMar 20, 07:26 AM

Machines failing to start in DFW

3 updates
monitoringMar 20, 12:45 PM

In addition to freeing up existing capacity, the team has provisioned new capacity in DFW and we are monitoring the results.

monitoringMar 20, 08:08 AM

We freed up some capacity on our workers to allow for successful Machine starts.

investigatingMar 20, 07:26 AM

The Machines start failure rate is elevated in DFW.

criticalresolvedMar 19, 06:28 AM — Resolved Mar 19, 10:37 AM

Metrics currently experiencing issues

3 updates
resolvedMar 19, 10:37 AM

This incident has been resolved. We're unable to recover the lost metrics from that one hour.

monitoringMar 19, 07:12 AM

We have implemented a fix. There has been approximately 1h of lost metrics from 06:07UTC. We're monitoring the cluster for further issues

investigatingMar 19, 06:28 AM

We are currently investigating an issue with our metrics cluster.

majorresolvedMar 18, 09:58 AM — Resolved Mar 18, 06:53 PM

Machines failing to start in DFW

4 updates
resolvedMar 18, 06:53 PM

This incident has been resolved. Machine creates in DFW continue to work normally.

monitoringMar 18, 12:40 PM

A fix has been implemented and we are monitoring the results.

identifiedMar 18, 11:44 AM

The team is currently rolling out additional capacity in DFW which should help ease Machine start failures across the region.

investigatingMar 18, 09:58 AM

We are investigating reports of machines failing to start in the DFW (Dallas) region with "insufficient memory" errors. This may cause deployment failures for applications running in DFW. Our team is actively working to restore full capacity in the region. If you are affected, deploying to an alternate region may serve as a temporary workaround. We will provide updates as the situation progresses.

majorresolvedMar 18, 04:12 PM — Resolved Mar 18, 05:02 PM

IPv6 networking issues in SJC region

3 updates
resolvedMar 18, 05:02 PM

This incident has been resolved.

monitoringMar 18, 04:31 PM

A fix has been implemented and we are monitoring the results.

investigatingMar 18, 04:12 PM

We are investigating intermittent network issues in SJC region impacting outbound public IPv6 access from Machines. Connecting to IPv6 internet resources from apps hosted in SJC region may be slow or fail at this time. IPv4 access, as well as 6PN private networking, are unaffected.

minorresolvedMar 18, 02:07 PM — Resolved Mar 18, 02:18 PM

Connection Issues in SJC

2 updates
resolvedMar 18, 02:18 PM

This incident has been resolved.

monitoringMar 18, 02:07 PM

Between 13:55 and 14:03 UTC machines and MPG clusters hosted in the SJC region saw elevated connection errors. Users may have seen errors connecting to or from most machines in the region, as well as with deployments or updates to machines in the region. Networking has returned to normal in the region, and we are continuing to monitor closely to ensure stable recovery.

minorresolvedMar 18, 02:12 PM — Resolved Mar 18, 02:18 PM

Fly ssh console command failing

3 updates
resolvedMar 18, 02:18 PM

This incident has been resolved.

monitoringMar 18, 02:17 PM

A fix has been implemented and we are seeing `ssh console` commands succeed as normal.

identifiedMar 18, 02:12 PM

We have identified an issue causing new `fly ssh console` connections to fail with 500 errors. A fix is in progress.

noneresolvedMar 14, 04:20 AM — Resolved Mar 14, 02:05 PM

Sprites Operations: 401 errors for certain organizations

2 updates
resolvedMar 14, 02:05 PM

This incident has been resolved.

monitoringMar 14, 01:55 PM

Organizations with names prefixed with numerical digits may experience 401 errors. Affected operations include actions such as Sprite creation, listing, etc... A fix has been implemented since 2026-03-14 12:30 UTC and we are monitoring the results!

majorresolvedMar 11, 09:19 AM — Resolved Mar 11, 11:37 AM

Setting secrets and creating apps is degraded

4 updates
resolvedMar 11, 11:37 AM

This incident has been resolved.

monitoringMar 11, 11:03 AM

While the secret storage service was in a read-only state, app creation requests queued up, due to the retry logic and insufficient request concurrency limits in our GraphQL API. This prevented our GraphQL API from serving any other requests. We have scaled up the GraphQL API and are continuing to monitor the situation.

monitoringMar 11, 10:14 AM

A fix has been implemented and we are monitoring the results.

identifiedMar 11, 09:19 AM

An ongoing data migration in our secret storage service is causing degraded Machines API functionality.

majorresolvedMar 7, 02:42 PM — Resolved Mar 7, 03:56 PM

Private networking issues in SYD region

3 updates
resolvedMar 7, 03:56 PM

This incident has been resolved.

monitoringMar 7, 03:10 PM

A fix has been implemented and we are monitoring the results.

investigatingMar 7, 02:42 PM

We are investigating a private networking failure between SYD and other regions. Apps continue to run, and private networking within SYD is unaffected.

noneresolvedMar 5, 07:24 PM — Resolved Mar 5, 07:50 PM

Routing issues in NA regions

3 updates
resolvedMar 5, 07:50 PM

This incident has been resolved. Due to a BGP issue, we saw some North American traffic routed to edges in Singapore (sin). Users in North America would have seen additional request latency during this period.

monitoringMar 5, 07:38 PM

A fix has been implemented and we are monitoring the results.

investigatingMar 5, 07:24 PM

We're aware of routing issues affecting some customers in North America regions, and we're actively investigating.

majorresolvedMar 3, 08:18 PM — Resolved Mar 3, 09:15 PM

Elevated GraphQL API errors

3 updates
resolvedMar 3, 09:15 PM

This incident was caused by a failed Redis node that powers our GraphQL API. We were able to recreate the Redis node and restore service. We are still investigating the root cause of the failure. In the mean time, all API endpoints now appear to be stable and errors have dropped to baseline level.

monitoringMar 3, 08:36 PM

A fix has been implemented and we are monitoring the results.

investigatingMar 3, 08:18 PM

We're investigating elevated GraphQL errors that affect some API endpoints.

minorresolvedMar 3, 10:50 AM — Resolved Mar 3, 12:10 PM

Cost Explorer fails to load

2 updates
resolvedMar 3, 12:10 PM

This incident has been resolved.

investigatingMar 3, 10:50 AM

We are currently investigating this issue. The page currently displays: "We’re having trouble loading the cost breakdown."

noneresolvedMar 3, 12:54 AM — Resolved Mar 3, 12:54 AM

Certificates issues affecting API and proxy

1 update
resolvedMar 3, 02:05 AM

Between 19:54 and 20:06 UTC, our Vault cluster serving app certificates was unavailable. This caused various API requests to fail, mainly operations on certificates but also app creates and IP assignments. As the failure mode was Vault requests hanging rather than failing immediately, TLS requests through fly-proxy for domains where the certificate was not cached on the local node remained open for a long time while proxy attempted to fetch the certificate; this caused some connections to fail as too many connection slots were taken up by requests waiting on Vault. The root cause of this incident was a partially completed update to the Vault cluster. We will be implementing safeguards in the proxy for this failure mode, as well as improving certificate storage longer-term.

majorresolvedMar 2, 05:42 PM — Resolved Mar 2, 10:49 PM

Machines failing to boot in EWR

4 updates
resolvedMar 2, 10:49 PM

This incident has been resolved.

monitoringMar 2, 08:35 PM

A fix has been implemented and we are monitoring the results.

identifiedMar 2, 06:21 PM

The issue has been identified and a fix is being implemented.

investigatingMar 2, 05:42 PM

We are currently investigating this issue.

minorresolvedMar 2, 09:19 PM — Resolved Mar 2, 09:50 PM

Issues with the Machines API

4 updates
resolvedMar 2, 09:50 PM

This incident has been resolved.

monitoringMar 2, 09:47 PM

A fix has been implemented and we are monitoring the results.

identifiedMar 2, 09:39 PM

The issue has been identified and a fix is being implemented.

investigatingMar 2, 09:19 PM

We're currently investigating issues with the Machines API. Customer deployments and the Fly dashboard may be affected.

February 2026(28 incidents)

majorresolvedFeb 27, 06:50 PM — Resolved Feb 27, 08:21 PM

Slow API requests

9 updates
resolvedFeb 27, 08:21 PM

This incident has been resolved. All platform and API operations are working normally.

monitoringFeb 27, 08:05 PM

API and platform operations have normalized. We are continuing to monitor to ensure full and stable recovery. Background jobs are almost fully caught up. Users may still see slightly slower requests creating new apps / orgs, but they should complete successfully. Sprite and MPG cluster creations are processing as normal.

identifiedFeb 27, 07:41 PM

A second fix has been deployed and database load has returned to normal, resulting in API response times beginning to normalize. Most Machines API requests should succeed as normal, and deploys to existing apps should also work. We are working through a backlog of background jobs. New app / organization creations and other other operations that use these will continue to see increased latency or failures while we work thorough these. New MPG cluster and new Sprite creation continues to be impacted.

identifiedFeb 27, 07:23 PM

An initial fix has been deployed and we are seeing improvements in load and API performance. Some operations that rely on the Graphql API, such as new app creations and some deployments, will continue to fail at this time. We are continuing to work on restoring full availability.

identifiedFeb 27, 07:05 PM

We are currently seeing full API failures for requests to our Graphql API and elevated failures for the machines API. Direct calls to these apis may fail, along with many flyctl commands. We have identified the cause of the issue and are continuing to work on a fix. Existing running machines and apps should continue to be reachable, but creates, deploys, or other features relying on platform API calls will fail at this time.

identifiedFeb 27, 06:59 PM

New Sprite creations are also timing out or failing at this time. We are continuing to work on a fix for this issue.

identifiedFeb 27, 06:53 PM

We are continuing to work on a fix for this issue.

identifiedFeb 27, 06:52 PM

We have identified the cause of the increased latency and are working on a fix. The most common errors we are seeing is timeouts when users attempt to perform an action against a newly created app / machine resource. Those may timeout or fail with an `app|machine not found` error

investigatingFeb 27, 06:50 PM

We are investigating increased in API request latency and timeouts with the main platform API. This is impacting multiple operations, including creating, querying or performing actions against machines, as well as platform level operations like adding payment methods.

minorresolvedFeb 27, 03:34 PM — Resolved Feb 27, 05:54 PM

Capacity issues in iad and dfw

3 updates
resolvedFeb 27, 05:54 PM

This incident has been resolved.

monitoringFeb 27, 05:31 PM

We have provisioned additional capacity in dfw and iad and are monitoring to ensure machine and builder starts are succeeding consistently.

identifiedFeb 27, 03:34 PM

These regions (Dallas, TX dfw and Ashburn, VA iad) are currently low on capacity. New machine creates in these regions might fail temporarily, and Depot builders may be unavailable, causing deploys to hang in "Waiting for Depot builder". If you are having issues with Depot builders, consider moving them to a different non-iad, non-dfw region in your fly.io dashboard's "Settings" page under "App builders", or try `--depot=false`.

noneresolvedFeb 26, 05:00 PM — Resolved Feb 26, 10:28 PM

Capacity isssues in iad and dfw

6 updates
resolvedFeb 26, 10:28 PM

This incident has been resolved.

monitoringFeb 26, 08:19 PM

We're continuing to monitor after having added more capacity to our DFW and IAD regions. Deploys or machine starts using existing volumes in these regions may still hit a capacity issue. Users should use `fly volume fork --vm-memory ` to fork the volume to a host with more capacity, then retry the deploy or start command using the new volume.

identifiedFeb 26, 06:57 PM

We have added additional capacity in DFW and IAD regions and are monitoring the impact. New machine creates and deploys without volumes are seeing improved success rates. Deploys using depot builders in those regions are also improving, with much quicker builder start times. Deploys or machine starts using existing volumes in these regions may still hit a capacity issue. Users should use `fly volume fork --vm-memory ` to fork the volume to a host with more capacity, then retry the deploy or start command using the new volume.

identifiedFeb 26, 05:18 PM

We've identified some newly created Managed Postgres clusters are failing to come up healthy in these regions.

identifiedFeb 26, 05:05 PM

New machine creates in these regions might fail temporarily, and Depot builders may be unavailable. If you are having issues with Depot builders, consider moving them to a different region, or try `--depot=false`.

identifiedFeb 26, 05:00 PM

We have identified the problem and are working on a fix.

noneresolvedFeb 24, 05:23 PM — Resolved Feb 24, 05:51 PM

Sprites API degradation

3 updates
resolvedFeb 24, 05:51 PM

This incident has been resolved.

identifiedFeb 24, 05:24 PM

A slow deploy is causing Sprites API degradation. We are implementing a fix.

identifiedFeb 24, 05:23 PM

A slow deploy is causing Sprites API degradation. We are implementing a fix.

minorresolvedFeb 24, 04:33 AM — Resolved Feb 24, 11:06 AM

Metrics are degraded

5 updates
resolvedFeb 24, 11:06 AM

Metrics processing has caught up, and we don't see any data loss.

monitoringFeb 24, 09:35 AM

Delayed metrics are still being processed.

monitoringFeb 24, 06:46 AM

Metrics are coming back online, but it will take a little time to process what's backed up in the queues.

identifiedFeb 24, 05:49 AM

We're continuing to work with VictoriaMetrics support on a fix for this issue.

identifiedFeb 24, 04:33 AM

In some cases data is missing or lagging. We've identified the problem and are working on a fix.

minorresolvedFeb 24, 09:39 AM — Resolved Feb 24, 10:44 AM

Sprite creations failing

3 updates
resolvedFeb 24, 10:44 AM

This incident has been resolved.

monitoringFeb 24, 10:25 AM

A fix has been implemented and we are monitoring the results.

investigatingFeb 24, 09:39 AM

We are currently investigating issues creating new Sprites.

noneresolvedFeb 23, 03:00 PM — Resolved Feb 23, 08:30 PM

Degraded Managed Postgres Control Plane

2 updates
resolvedFeb 24, 12:31 AM

This incident has been resolved as of 20:30 UTC.

investigatingFeb 23, 03:00 PM

We are currently investigating issues with the MPG control plane. Users may experience delays or hanging when creating or deleting databases via the dashboard or CLI.

minorresolvedFeb 20, 04:14 PM — Resolved Feb 20, 08:49 PM

Deploys hanging at waiting for Depot Builder

5 updates
resolvedFeb 20, 08:49 PM

This incident has been resolved.

monitoringFeb 20, 07:38 PM

The fix has been rolled out and we are seeing deploys using depot builder succeeding normally. We continue to monitor to ensure full recovery. Depot builders have been reenabled as the default option for new deploys

identifiedFeb 20, 05:59 PM

A fix is being rolled out. Fly builders continue to be the default while this is deployed

identifiedFeb 20, 04:39 PM

We are again seeing elevated latency provisioning depot builders on new deploys. Users may see deploys using Depot builders hang or timeout at the "Waiting for Depot Builder" step. We are working on a fix. We are switching all deploys to use the default Fly builders in the meantime. If desired users can manually switch back to depot builders using `fly deploy --depot=true` but may continue to see latency issues at this time.

monitoringFeb 20, 04:14 PM

We have seen elevated latency provisioning Depot builders during deployments over the past hour. This caused some deploys to hang or timeout at the "Waiting for Depot Builder" step in this period. Latency has improved and builder provision times are back to normal. We're continuing to monitor to ensure latency remains normal.

minorresolvedFeb 20, 10:52 AM — Resolved Feb 20, 11:57 AM

Networking issues for users connecting through lhr

3 updates
resolvedFeb 20, 11:57 AM

Network traffic in LHR has been stable for some time now, we are not seeing any further issues.

monitoringFeb 20, 11:21 AM

A fix has been implemented and we are monitoring the results.

investigatingFeb 20, 10:52 AM

We’re currently investigating this issue.

minorresolvedFeb 19, 09:14 PM — Resolved Feb 20, 12:05 AM

Investigating registry issues affecting deploys

5 updates
resolvedFeb 20, 12:05 AM

This incident has been resolved.

identifiedFeb 19, 10:24 PM

While we have seen some improvement from the previous fix, we are still seeing elevated rates of Registry connection issues. Users may continue to see slower machine creates and deploys due to slow image pulls. Deploys may succeed on a retry. We are continuing to work on restoring normal registry performance

monitoringFeb 19, 09:49 PM

A fix has been implemented and we are monitoring the results.

identifiedFeb 19, 09:43 PM

The issue has been identified and a fix is being implemented.

investigatingFeb 19, 09:14 PM

We are currently investigating this issue.

majorresolvedFeb 18, 04:22 PM — Resolved Feb 18, 04:44 PM

Control plane state delayed on some hosts possibly causing network or deployment disruption

4 updates
resolvedFeb 18, 04:44 PM

This incident has been resolved.

monitoringFeb 18, 04:28 PM

A fix has been implemented and we are monitoring the results.

identifiedFeb 18, 04:23 PM

We are continuing to work on a fix for this issue.

identifiedFeb 18, 04:22 PM

The issue has been identified and a fix is being implemented.

majorresolvedFeb 17, 01:06 PM — Resolved Feb 17, 02:24 PM

flyctl deploy timeouts

3 updates
resolvedFeb 17, 02:24 PM

Earlier today, an issue caused elevated rate limiting and some deployment timeouts. A fix is in place and deployments are back to normal.

monitoringFeb 17, 01:42 PM

A fix has been implemented and we are monitoring the results.

identifiedFeb 17, 01:06 PM

We’re investigating elevated 429 errors from flaps causing deployment timeouts. Affected deploys are failing with: ✖ Failed: error waiting for release_command machine XX to finish running: timeout reached waiting for machine's state to change Your machine never reached the state "destroyed".

majorresolvedFeb 14, 11:33 AM — Resolved Feb 14, 02:27 PM

Degraded Managed Postgres Control Plane in ORD

5 updates
resolvedFeb 14, 02:27 PM

This incident has been resolved.

monitoringFeb 14, 02:07 PM

A fix has been implemented and we are seeing full recovery of the control plane in ORD. With that recovery we are seeing impacted replicas catching up and clusters returning to normal health. We're continuing to monitor for full recovery.

identifiedFeb 14, 01:47 PM

We are continuing to work on a fix for this issue.

identifiedFeb 14, 11:47 AM

The issue has been identified and we are working on a fix. The majority of MPG clusters in ORD continue to run normally, though some users may still see degraded replicas at this time. Some clusters in the region will have experienced a primary -> replica failover.

investigatingFeb 14, 11:33 AM

We are currently investigating issues with the MPG control plane in ORD. A small number of clusters in the region may be seeing replication lag or PGBouncers connectivity issues at this time.

minorresolvedFeb 11, 08:44 PM — Resolved Feb 11, 09:30 PM

Issues with deploying apps using Depot builders for new accounts

4 updates
resolvedFeb 11, 09:30 PM

This incident has been resolved.

monitoringFeb 11, 09:24 PM

A fix has been implemented and we are monitoring the results.

identifiedFeb 11, 08:57 PM

The issue has been identified and a fix is being implemented.

investigatingFeb 11, 08:44 PM

Some new Fly.io users may encounter an "upgrade your organization" error message when attempting to deploy apps for the first time. We're currently working with Depot to figure out what's causing the issue. In the meantime, you should be able to work around the issue by using Fly builders with `fly deploy --depot=false`.

minorresolvedFeb 11, 06:07 AM — Resolved Feb 11, 07:22 AM

Creating new sprites is degraded

6 updates
resolvedFeb 11, 07:22 AM

This incident has been resolved.

monitoringFeb 11, 06:57 AM

Sprite creation appears to be back to normal operation now.

identifiedFeb 11, 06:52 AM

We've identified the cause of the delay following creates and we're deploying a fix.

investigatingFeb 11, 06:09 AM

We are continuing to investigate this issue.

investigatingFeb 11, 06:08 AM

We are continuing to investigate this issue.

investigatingFeb 11, 06:07 AM

Sprite creation generates an error that the sprite "is not assigned to compute." Eventually the sprite transitions from an unknown state to warm, so there is a delay before the sprite is usable.

minorresolvedFeb 10, 07:00 PM — Resolved Feb 10, 08:44 PM

Degraded MPG clusters in IAD

5 updates
resolvedFeb 10, 08:44 PM

This incident has been resolved.

monitoringFeb 10, 08:00 PM

We've rolled out a fix for the remaining impacted clusters, and we're now monitoring the results.

identifiedFeb 10, 07:53 PM

We've rolled out a fix for some additional impacted clusters, and we're continuing to work on the remaining clusters.

identifiedFeb 10, 07:15 PM

We've identified the issue - some MPG clusters in IAD should be seeing improvements, and we're working on rolling out a fix for the remaining impacted clusters.

investigatingFeb 10, 07:00 PM

We're currently looking into an issue with MPG clusters in the IAD region.

minorresolvedFeb 9, 08:29 PM — Resolved Feb 9, 09:38 PM

Issue creating new Sprites in IAD

4 updates
resolvedFeb 9, 09:38 PM

This incident has been resolved.

monitoringFeb 9, 09:19 PM

A fix has been implemented and we are monitoring the results.

identifiedFeb 9, 08:45 PM

The issue has been identified and a fix is being implemented.

investigatingFeb 9, 08:29 PM

We're currently looking into an issue that's preventing new Sprites from being created in IAD. Sprite creation from other regions are unaffected.

majorresolvedFeb 9, 07:17 AM — Resolved Feb 9, 10:55 AM

Degraded network in AMS

6 updates
resolvedFeb 9, 10:55 AM

This incident has been resolved.

monitoringFeb 9, 09:47 AM

A fix has been implemented and we are monitoring the results.

identifiedFeb 9, 08:58 AM

We are still working on restoring the MPG clusters. Most of them should be operational already.

identifiedFeb 9, 07:42 AM

Affected hosts are starting to come back online. We are working on restoring affected MPG clusters.

identifiedFeb 9, 07:34 AM

One of our upstream providers is experiencing a major power issue in their AMS datacenter. Managed Postgres instances in AMS are experiencing an outage as our control plane for Managed Postgres is taken down by the incident.

investigatingFeb 9, 07:17 AM

One of our upstream providers is performing an emergency DC maintenance. You may see degraded connectivity on some of your apps in AMS. Most apps in AMS are not affected.

majorresolvedFeb 7, 04:23 PM — Resolved Feb 7, 06:13 PM

Machines API issues

4 updates
resolvedFeb 7, 06:13 PM

This incident has been resolved.

monitoringFeb 7, 05:17 PM

A fix has been implemented and we are seeing Machines API connectivity improve in APAC regions. We continue monitoring for full recovery.

identifiedFeb 7, 04:40 PM

The issue has been identified and we are seeing Machines API performance improve in most regions since ~16:20 UTC. Machines API calls in the SYD, NRT, SIN region may continue to see 5xx errors or higher latency at this time. We are continuing to work on restoring full API performance in all regions

investigatingFeb 7, 04:23 PM

We are investigating widespread Machines API issues since 16:00 UTC. You may experience 5xx errors or higher latency at this time.

majorresolvedFeb 7, 03:19 PM — Resolved Feb 7, 06:12 PM

Private Networking and Certificate Resolution Issues in SYD

5 updates
resolvedFeb 7, 06:12 PM

This incident has been resolved.

monitoringFeb 7, 04:24 PM

A fix has been implemented and we are seeing private networking / certificates in SYD improving. We are continuing to monitor for full recovery.

investigatingFeb 7, 03:46 PM

Private Networking (6PN) is degraded in SYD region. Communication between Machines in SYD region and Machines in other regions may fail at this time. Newly created Machines in SYD may fail to sync to other regions (may not show up in Machines API List endpoint, or state may be incorrect). We are working with our upstream providers to resolve this issue.

investigatingFeb 7, 03:21 PM

We are continuing to investigate this issue.

investigatingFeb 7, 03:19 PM

We are investigating communication issues between some SYD (Sydney, Australia) region hosts and our Certificate vault. Requests hitting SYD region edges may see issues resolving certificates at this time, especially for newly issued or not recently used certificates.

minorresolvedFeb 5, 09:52 PM — Resolved Feb 6, 07:07 AM

Network issues on newly-created machines

3 updates
resolvedFeb 6, 07:07 AM

This issue is now resolved.

monitoringFeb 6, 02:49 AM

We have successfully run a fix to re-sync our global and regional state stores in order to bring machines back to a healthy state, and we're monitoring the situation to confirm that there are no more issues.

identifiedFeb 5, 09:52 PM

Machines created after the delayed machine registration incident (https://status.flyio.net/incidents/3npj6935byt4) may have incomplete networking configurations and could be unable to receive traffic. We've identified the issue and are deploying code fixes as well as updating created machines. This affects a small number of machines on all our regions.

majorresolvedFeb 5, 05:03 PM — Resolved Feb 6, 04:58 AM

MPG Degraded clusters in AMS, IAD and SIN regions

8 updates
resolvedFeb 6, 04:58 AM

All MPG clusters are back to full, normal operations.

monitoringFeb 6, 04:22 AM

All MPG clusters are reachable.

identifiedFeb 6, 03:36 AM

We are still continuing cleanup on some clusters.

identifiedFeb 5, 11:55 PM

All cluster primary and pgBouncer machines are now healthy and operating normally. We are still continuing cleanup on some clusters with lagging or degraded replicas, but this should not impact writes or reads to clusters.

identifiedFeb 5, 08:58 PM

We are continuing to work on restoring all clusters to full health.

identifiedFeb 5, 08:16 PM

With the underlying incident stabilizing (https://status.flyio.net/incidents/3npj6935byt4) we are seeing improvements amongst impacted clusters. We continue to work on restoring all clusters to full health.

identifiedFeb 5, 07:01 PM

A number of clusters in IAD, AMS, and SIN regions continue to see degraded replicas and PGBouncers at this time. A smaller number of clusters in these regions are also seeing disruption to their primaries. We continue to work on restoring full cluster health in all regions.

identifiedFeb 5, 05:03 PM

A small number of MPG clusters in the AMS and IAD region are currently in degraded states due to downstream impact from this Machines API issue: https://status.flyio.net/incidents/3npj6935byt4 Most of the impacted clusters may see a degraded replica or PG Bouncer in their statuspage. A very small number may be unable to connect to their MPG primary node, the team is working to restore connectivity as the top priority. Users may also see delays registering new clusters in these regions at this time.

majorresolvedFeb 5, 04:43 PM — Resolved Feb 5, 08:53 PM

Delayed Machine Registration + Token Errors

7 updates
resolvedFeb 5, 08:53 PM

This incident has been resolved.

monitoringFeb 5, 08:11 PM

A fix has been deployed across all impacted hosts. We are seeing a sharp reduction in Token errors since 20:00 UTC and other metrics are recovering as well. We are continuing to monitor closely

identifiedFeb 5, 07:03 PM

We saw some improvement from the previous fix, however errors remained elevated on some hosts. We have identified the root cause of the remaining errors as a communication issue between the hosts and our Token database. We are preparing a fix that should resolve these.

identifiedFeb 5, 06:19 PM

We have rolled out an initial fix for the token issues and are monitoring for improvements.

investigatingFeb 5, 05:42 PM

While Machine registration error rates have improved, we are now seeing elevated error rates verifying user tokens during some actions. Users may see errors like "failed to launch VM: permission_denied: bolt token: failed to verify service token: no verified tokens" when deploying or creating machines. We are investigating

identifiedFeb 5, 04:59 PM

A fix has been rolled out and most hosts are registering machines as normal. A few hosts remain with elevated error rates, we are continuing to fix these. Users who experience an error creating or deploying a new machine should re-try the operation.

identifiedFeb 5, 04:43 PM

We have identified elevated error rates registering new machines with our global state tracking service on some hosts. We have identified the issue and are deploying a fix. Users may have seen elevated machine create, start, or deployment failures over the past ~20 minutes.

minorresolvedFeb 5, 05:46 AM — Resolved Feb 5, 09:22 AM

Network maintenance in YYZ

3 updates
resolvedFeb 5, 09:22 AM

Network maintenance has concluded.

monitoringFeb 5, 09:01 AM

Managed Postgres clusters in YYZ should be operating normally.

identifiedFeb 5, 05:46 AM

An upstream network provider is performing an emergency network maintenance in the YYZ region. Machines in YYZ may see some packet loss. Managed Postgres clusters in YYZ are experiencing management plane issues. Clusters may see delayed fail-overs and changes in cluster size may not be possible during the maintenance period.

majorresolvedFeb 3, 03:33 PM — Resolved Feb 3, 03:53 PM

IPv6 Issues in YYZ

3 updates
resolvedFeb 3, 03:53 PM

This incident has been resolved.

monitoringFeb 3, 03:44 PM

A fix has been implemented and we're seeing IPv6 networking return to normal in YYZ. We'll continue to monitor to ensure full recovery.

investigatingFeb 3, 03:33 PM

We are currently investigating degraded IPv6 networking in the YYZ (Toronto) region. Users with machines in this region may see issues connecting to their machines over IPv6. Users with static egress IPs may see issues connecting outbound over IPv6 from this region at this time. IPv4 is not impacted and continues to work normally.

minorresolvedFeb 3, 02:56 AM — Resolved Feb 3, 03:43 AM

Elevated latency and packetloss in North American regions

3 updates
resolvedFeb 3, 03:43 AM

This incident has been resolved.

monitoringFeb 3, 03:26 AM

Network performance issues between North American regions have resolved and we're continuing to monitor.

investigatingFeb 3, 02:56 AM

We are currently investigating intermittent spikes of increased latency and packet loss between North American regions over the past hour. Users may see degraded network performance on traffic in and out of the IAD and SJC regions at this time. We are working with our upstream networking providers to investigate and mitigate these issues.

minorresolvedFeb 1, 08:15 PM — Resolved Feb 1, 09:37 PM

Congestion in CDG and FRA

2 updates
resolvedFeb 1, 09:37 PM

This incident has been resolved.

investigatingFeb 1, 08:15 PM

We are experiencing elevated weekend congestion in CDG (France) and FRA (Germany).

noneresolvedFeb 1, 02:16 AM — Resolved Feb 1, 05:48 AM

Sprites are returning not found or unauthorized when they shouldn't be.

6 updates
resolvedFeb 1, 05:48 AM

This incident has been resolved.

monitoringFeb 1, 05:32 AM

We've been able to restore missing sprites and tokens. We're monitoring for any additional issues.

identifiedFeb 1, 04:52 AM

We're working on a fix to restore missing sprites and tokens.

identifiedFeb 1, 03:02 AM

We identified the source of the problem as an upstream DNS issue Tigris experienced, now resolved. We're currently assessing the impact on Sprites.

investigatingFeb 1, 02:51 AM

We are continuing to investigate this issue.

investigatingFeb 1, 02:16 AM

We're currently investigating this issue.

January 2026(7 incidents)

noneresolvedJan 31, 05:35 PM — Resolved Jan 31, 06:29 PM

Grafana Log Search Display Issue

3 updates
resolvedJan 31, 06:29 PM

This has been resolved. If you are still experiencing any issues, you may need to log out and then back in.

investigatingJan 31, 05:49 PM

No logs are displayed in Grafana Log Search when using the default `*` query. You can try the following workarounds: 1. Replace the default `*`query with `NOT ""` 2. Viewing logs from the “fly app” tab or the “explore” tab, Thank you for your kind understanding as we work through resolving this!

investigatingJan 31, 05:35 PM

No logs are displayed in Grafana Log Search when using the default `*` query. As a temporary workaround, please replace `*` with `NOT ""` query. Thank you for your kind understanding as we work through resolving this!

minorresolvedJan 27, 07:40 PM — Resolved Jan 29, 08:04 PM

Delayed metric reporting in NRT and SIN regions

5 updates
resolvedJan 29, 08:04 PM

This incident has been resolved. All hosts in SIN and NRT are reporting up to date metrics.

identifiedJan 29, 03:15 PM

Currently one host in SIN is still finishing working through it's metrics backlog and is reporting delayed metrics. Other hosts in NRT and SIN are reporting metrics correctly. If needed, users with impacted machines on the remaining host can use `fly machine clone` to create new machines in the region, which should land on a different host.

identifiedJan 28, 03:27 PM

Most hosts in NRT and SIN have completed backfilling their metrics and are up to date in fly-metrics.net. Four hosts are still working through the backlog; machines on those hosts are still reporting delayed metrics at this time.

identifiedJan 28, 02:49 AM

We are continuing to process the metrics backlog in NRT and SIN. Progress is being made, but due to the volume of metrics this may still take some time to fully complete. At this time users with machines on impacted hosts will see metrics beginning to backfill into fly-metrics.net. However many will not be fully caught up yet. This impacts metrics only, the underlying machines continue to work normally.

identifiedJan 27, 07:40 PM

A small number of hosts in the NRT (Tokyo) and SIN (Singapore) are reporting delayed metrics to the hosted Grafana charts at fly-metrics.net. Users with machines on impacted hosts will see delayed or spotty metrics in their Grafana charts. Only metrics for these machines are impacted. The underlying machines continue to receive and serve traffic as usual, and all machine actions(stopping, starting, deploys etc.) continue to work normally. We are processing the backlog of metrics on these hosts, but metrics will be delayed until this is complete.

minorresolvedJan 24, 04:00 PM — Resolved Jan 24, 04:00 PM

Congestion in CDG and FRA

1 update
resolvedJan 24, 11:06 PM

We are experiencing elevated weekend congestion in CDG (France) and FRA (Germany).

minorresolvedJan 21, 03:21 AM — Resolved Jan 21, 05:24 AM

Delays issuing certificates

3 updates
resolvedJan 21, 05:24 AM

This incident has been resolved.

monitoringJan 21, 04:59 AM

We have identified the congestion and released a fix, we'll continue to monitor while the jobs catch up.

investigatingJan 21, 03:21 AM

We are currently investigating possible delays issuing ACME certificates for new hostnames.

minorresolvedJan 20, 10:08 PM — Resolved Jan 20, 10:15 PM

Errors creating new Sprites

3 updates
resolvedJan 20, 10:15 PM

This incident has been resolved.

investigatingJan 20, 10:10 PM

We are continuing to investigate this issue.

investigatingJan 20, 10:08 PM

We're currently investigating an issue that's preventing new Sprites from being created.

noneresolvedJan 19, 02:34 PM — Resolved Jan 19, 08:07 PM

MPG network instability in LAX

3 updates
resolvedJan 19, 08:07 PM

This incident has been resolved.

monitoringJan 19, 03:11 PM

Connections are back to normal. We'll keep monitoring the region.

investigatingJan 19, 02:34 PM

We identified network partitions in the LAX region. We are investigating the problem.

majorresolvedJan 19, 12:17 PM — Resolved Jan 19, 12:29 PM

Machines errors in JNB region

2 updates
resolvedJan 19, 12:29 PM

This incident has been resolved.

identifiedJan 19, 12:17 PM

A bad deploy of an internal service in JNB region may cause Machines API requests for JNB region machines to fail. At this time, it may not be possible to create or update machines in JNB region, but apps continue to run. The deploy is being reverted.