Fly.io Outage History
Past incidents and downtime events
Complete history of Fly.io outages, incidents, and service disruptions. Showing 50 most recent incidents.
May 2026(1 incident)
Log search unavailable
3 updates
This incident has been resolved.
We have a mitigation in place and are monitoring results.
Log search in Grafana is currently unavailable. You may see `failed to make http request: 502` errors when accessing logs from fly-metrics.net at this time. App logs continue to be available using the `fly logs` command and in the Fly.io dashboard.
April 2026(13 incidents)
flyctl deploy creating new app instances
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're investigating an issue where fly deploy is creating new Fly machine instances rather than updating existing ones, leading to apps with a mixed state. We're currently investigating the issue. As a workaround, please try removing the "processes = [ "app" ]" line from your fly.toml configuration file and redeploying. Another workaround is to downgrade flyctl to 0.4.40 - this should resolve the issue in the meantime.
Slow machines operations in IAD region
5 updates
This incident has been resolved.
Network packet loss has returned to normal levels. We are monitoring the Machines API for stability.
We are continuing to investigate this issue.
We are deploying a partial mitigation while we continue investigating.
We are currently investigating the issue. Only a portion of machines within the region are impacted.
Errors when adding or editing Github integrations for deployments
5 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We're investigating reports of "500" errors when trying to add a new Github integration or edit an existing Github integration in Fly.io/dashboard. This only affects "Launch an app from Github" or trying to change settings for an app set up this way. Existing integrations continue to work normally. It does not affect deploys done with `flyctl` or existing, running apps.
Errors (5xx, timeouts) in Fly.io dashboard
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating issues with web dashboard.
Increased latency in SIN
2 updates
This incident has been resolved.
We are currently working on resolving increased latencies in our Singapore region.
TLS certificate issues
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an issue with the Vault server that stores TLS certificates. Provisioning new TLS certificates may fail, and connecting to domains whose existing certificate has not yet been cached may fail.
Network issues in SYD
3 updates
This incident has been resolved.
We've identified the issue and applied a fix. All services should be working as normal.
We're currently investigating some networking issues in SYD. This is affecting a number of our central services.
Heightened latency in ORD
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating heightened network latency in ORD.
Managed Postgres control plane instability in NRT (Tokyo)
4 updates
This incident has been resolved.
A fix has been implemented and we are seeing MPG performance in NRT normalize. We are continuing to monitor to ensure a stable recovery
The issue has been identified and a fix is being implemented. Users with clusters in NRT may continue to see instability at this time
We are investigating instability in the MPG control plane in the NRT (Toyko, Japan) region causing unexpected cluster failovers. Clusters return to health shortly after, but some users with clusters in NRT may see dropped connections or degraded performance at this time.
Unavailable hosts in ORD region
2 updates
This incident has been resolved.
Some hosts in our Chicago (ORD) region are currently inaccessible. We are working with our provider to resolve this issue. To see if you are affected, please visit the personalized status page: https://fly.io/status A small amount of Managed Postgres clusters may also be inaccessible at this time.
Managed Postgres Control Plane Issues in SYD
4 updates
This incident has been resolved.
Control plane operations in SYD have returned to normal and all clusters are healthy at this time. We're continuing to monitor to ensure stable recovery.
We are seeing an improvement in control plane performance in the SYD region. Some clusters in the region currently are showing degraded standby nodes and we are working to bring those back to full health.
We are investigating elevated control plane issues for Managed Postgres clusters in SYD. The majority of clusters appear to be running fine, but new creates, backup restores, and upgrades may show errors or take longer than usual to complete. Some clusters will have seen a failover event from primary to standby.
Metrics currently experiencing issues
4 updates
This incident has been resolved.
We are continuing to monitor for any further issues.
We have implemented a fix. We're monitoring the cluster for further issues.
We are currently investigating an issue with our metrics cluster.
GraphQL API / Dashboard Issues
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have restored GraphQL and dashboard availability, but some actions (e.g. app state updates) may still be delayed.
We are investigating issues with our GraphQL API and web dashboard
March 2026(20 incidents)
Low Capacity in SIN and AMS regions
6 updates
This incident has been resolved.
We've freed up additional room in the SIN and AMS regions and are monitoring capacity.
We've freed up additional room in the SIN and AMS regions and are monitoring capacity.
We are currently investigating capacity issues in SIN and AMS regions that are affecting: - Machine Create and Start events - Deployments, due to affected, degraded Remote Builders - Sprite startup from cold state
This may also affect: - Remote builders in AMS and SIN regions, which could currently be experiencing degraded performance or failures. - Sprites starting from a cold state, which may experience failures in starting
We are currently investigating elevated errors when creating and starting machines in the SIN and AMS regions. Choosing other regions to create or deploy may help in the meantime
Low capacity in IAD
5 updates
This incident has been resolved.
With the additional capacity we've brought online, machine start failure rates in IAD have now recovered. We'll continue to monitor IAD capacity.
We've brought some additional capacity online in IAD and are seeing improvements, and we're continuing to work on adding more and freeing up additional room.
We're continuing to evaluate our options for increasing short-term capacity in the IAD region.
We're currently investigating capacity issues in IAD that is preventing machine starts (machine creates are currently unaffected). This may result in deploys failing to complete (even for apps outside of the IAD region). As a workaround, using legacy Fly builders explicitly located in another region (i.e., `FLY_REMOTE_BUILDER_REGION=lhr fly deploy --depot=false --recreate-builder`) may help in the meantime.
Machine Creates Failing in ORD Region
5 updates
This incident has been resolved.
We've implemented a fix and have seen error rates for machine creates in ORD drop off. We're continuing to monitor the results.
We've identified the cause of this increased failure rate and a fix is in progress. We are seeing most creates in ORD succeed at this time, though failure rate is still above baseline.
We are continuing to investigate this issue. We are seeing 408 errors decreasing in ORD, though still above baseline.
We are currently investigating elevated errors creating machines in the ORD (Chicago, Illinois) region. Users may see `failed to launch VM: request returned non-2xx status: 408` errors when creating, updating, or scaling machines in ORD. Existing, already running machines in the ORD region continue to run as normal.
Network issues in FRA region
4 updates
This incident has been resolved.
Some Managed Postgres clusters in FRA region are still unreachable, we are investigating this issue.
Apps and Managed Postgres clusters in FRA region should be back online at this time. We are monitoring for any further issues.
We are investigating network issues in FRA region. Apps and/or Managed Postgres clusters in the region may be inaccessible at this time.
Backend errors when trying to use Grafana to view logs
4 updates
This incident is resolved, Grafana logs are now working properly.
We've deployed a fix and are monitoring the results. Logs are now be visible on Grafana.
Using the Logs panel in Grafana at https://fly-metrics.net/ will show a 502 error from the backend and won't show any logs. You can use `fly logs` or the live log viewer directly on https://fly.io/dashboard to view streaming logs for the time being.
Using the Logs panel in Grafana at https://fly-metrics.net/ will show a 502 error from the backend and won't show any logs. You can use `fly logs` or the live log viewer directly on https://fly.io/dashboard to view streaming logs for the time being.
Machines failing to start in DFW
5 updates
This incident has been resolved.
Machine start success rates in DFW have improved but we are continuing to monitor and make further adjustments. We will provide updates as the situation progresses.
In addition to freeing up existing capacity, the team has provisioned new capacity in DFW and we are monitoring the results.
We freed up some capacity on our workers to allow for successful Machine starts.
The Machines start failure rate is elevated in DFW.
Metrics currently experiencing issues
3 updates
This incident has been resolved. We're unable to recover the lost metrics from that one hour.
We have implemented a fix. There has been approximately 1h of lost metrics from 06:07UTC. We're monitoring the cluster for further issues
We are currently investigating an issue with our metrics cluster.
Machines failing to start in DFW
4 updates
This incident has been resolved. Machine creates in DFW continue to work normally.
A fix has been implemented and we are monitoring the results.
The team is currently rolling out additional capacity in DFW which should help ease Machine start failures across the region.
We are investigating reports of machines failing to start in the DFW (Dallas) region with "insufficient memory" errors. This may cause deployment failures for applications running in DFW. Our team is actively working to restore full capacity in the region. If you are affected, deploying to an alternate region may serve as a temporary workaround. We will provide updates as the situation progresses.
IPv6 networking issues in SJC region
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating intermittent network issues in SJC region impacting outbound public IPv6 access from Machines. Connecting to IPv6 internet resources from apps hosted in SJC region may be slow or fail at this time. IPv4 access, as well as 6PN private networking, are unaffected.
Connection Issues in SJC
2 updates
This incident has been resolved.
Between 13:55 and 14:03 UTC machines and MPG clusters hosted in the SJC region saw elevated connection errors. Users may have seen errors connecting to or from most machines in the region, as well as with deployments or updates to machines in the region. Networking has returned to normal in the region, and we are continuing to monitor closely to ensure stable recovery.
Fly ssh console command failing
3 updates
This incident has been resolved.
A fix has been implemented and we are seeing `ssh console` commands succeed as normal.
We have identified an issue causing new `fly ssh console` connections to fail with 500 errors. A fix is in progress.
Sprites Operations: 401 errors for certain organizations
2 updates
This incident has been resolved.
Organizations with names prefixed with numerical digits may experience 401 errors. Affected operations include actions such as Sprite creation, listing, etc... A fix has been implemented since 2026-03-14 12:30 UTC and we are monitoring the results!
Setting secrets and creating apps is degraded
4 updates
This incident has been resolved.
While the secret storage service was in a read-only state, app creation requests queued up, due to the retry logic and insufficient request concurrency limits in our GraphQL API. This prevented our GraphQL API from serving any other requests. We have scaled up the GraphQL API and are continuing to monitor the situation.
A fix has been implemented and we are monitoring the results.
An ongoing data migration in our secret storage service is causing degraded Machines API functionality.
Private networking issues in SYD region
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating a private networking failure between SYD and other regions. Apps continue to run, and private networking within SYD is unaffected.
Routing issues in NA regions
3 updates
This incident has been resolved. Due to a BGP issue, we saw some North American traffic routed to edges in Singapore (sin). Users in North America would have seen additional request latency during this period.
A fix has been implemented and we are monitoring the results.
We're aware of routing issues affecting some customers in North America regions, and we're actively investigating.
Elevated GraphQL API errors
3 updates
This incident was caused by a failed Redis node that powers our GraphQL API. We were able to recreate the Redis node and restore service. We are still investigating the root cause of the failure. In the mean time, all API endpoints now appear to be stable and errors have dropped to baseline level.
A fix has been implemented and we are monitoring the results.
We're investigating elevated GraphQL errors that affect some API endpoints.
Cost Explorer fails to load
2 updates
This incident has been resolved.
We are currently investigating this issue. The page currently displays: "We’re having trouble loading the cost breakdown."
Certificates issues affecting API and proxy
1 update
Between 19:54 and 20:06 UTC, our Vault cluster serving app certificates was unavailable. This caused various API requests to fail, mainly operations on certificates but also app creates and IP assignments. As the failure mode was Vault requests hanging rather than failing immediately, TLS requests through fly-proxy for domains where the certificate was not cached on the local node remained open for a long time while proxy attempted to fetch the certificate; this caused some connections to fail as too many connection slots were taken up by requests waiting on Vault. The root cause of this incident was a partially completed update to the Vault cluster. We will be implementing safeguards in the proxy for this failure mode, as well as improving certificate storage longer-term.
Machines failing to boot in EWR
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Issues with the Machines API
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're currently investigating issues with the Machines API. Customer deployments and the Fly dashboard may be affected.
February 2026(16 incidents)
Slow API requests
9 updates
This incident has been resolved. All platform and API operations are working normally.
API and platform operations have normalized. We are continuing to monitor to ensure full and stable recovery. Background jobs are almost fully caught up. Users may still see slightly slower requests creating new apps / orgs, but they should complete successfully. Sprite and MPG cluster creations are processing as normal.
A second fix has been deployed and database load has returned to normal, resulting in API response times beginning to normalize. Most Machines API requests should succeed as normal, and deploys to existing apps should also work. We are working through a backlog of background jobs. New app / organization creations and other other operations that use these will continue to see increased latency or failures while we work thorough these. New MPG cluster and new Sprite creation continues to be impacted.
An initial fix has been deployed and we are seeing improvements in load and API performance. Some operations that rely on the Graphql API, such as new app creations and some deployments, will continue to fail at this time. We are continuing to work on restoring full availability.
We are currently seeing full API failures for requests to our Graphql API and elevated failures for the machines API. Direct calls to these apis may fail, along with many flyctl commands. We have identified the cause of the issue and are continuing to work on a fix. Existing running machines and apps should continue to be reachable, but creates, deploys, or other features relying on platform API calls will fail at this time.
New Sprite creations are also timing out or failing at this time. We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We have identified the cause of the increased latency and are working on a fix. The most common errors we are seeing is timeouts when users attempt to perform an action against a newly created app / machine resource. Those may timeout or fail with an `app|machine not found` error
We are investigating increased in API request latency and timeouts with the main platform API. This is impacting multiple operations, including creating, querying or performing actions against machines, as well as platform level operations like adding payment methods.
Capacity issues in iad and dfw
3 updates
This incident has been resolved.
We have provisioned additional capacity in dfw and iad and are monitoring to ensure machine and builder starts are succeeding consistently.
These regions (Dallas, TX dfw and Ashburn, VA iad) are currently low on capacity. New machine creates in these regions might fail temporarily, and Depot builders may be unavailable, causing deploys to hang in "Waiting for Depot builder". If you are having issues with Depot builders, consider moving them to a different non-iad, non-dfw region in your fly.io dashboard's "Settings" page under "App builders", or try `--depot=false`.
Capacity isssues in iad and dfw
6 updates
This incident has been resolved.
We're continuing to monitor after having added more capacity to our DFW and IAD regions. Deploys or machine starts using existing volumes in these regions may still hit a capacity issue. Users should use `fly volume fork --vm-memory ` to fork the volume to a host with more capacity, then retry the deploy or start command using the new volume.
We have added additional capacity in DFW and IAD regions and are monitoring the impact. New machine creates and deploys without volumes are seeing improved success rates. Deploys using depot builders in those regions are also improving, with much quicker builder start times. Deploys or machine starts using existing volumes in these regions may still hit a capacity issue. Users should use `fly volume fork --vm-memory ` to fork the volume to a host with more capacity, then retry the deploy or start command using the new volume.
We've identified some newly created Managed Postgres clusters are failing to come up healthy in these regions.
New machine creates in these regions might fail temporarily, and Depot builders may be unavailable. If you are having issues with Depot builders, consider moving them to a different region, or try `--depot=false`.
We have identified the problem and are working on a fix.
Sprites API degradation
3 updates
This incident has been resolved.
A slow deploy is causing Sprites API degradation. We are implementing a fix.
A slow deploy is causing Sprites API degradation. We are implementing a fix.
Metrics are degraded
5 updates
Metrics processing has caught up, and we don't see any data loss.
Delayed metrics are still being processed.
Metrics are coming back online, but it will take a little time to process what's backed up in the queues.
We're continuing to work with VictoriaMetrics support on a fix for this issue.
In some cases data is missing or lagging. We've identified the problem and are working on a fix.
Sprite creations failing
3 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating issues creating new Sprites.
Degraded Managed Postgres Control Plane
2 updates
This incident has been resolved as of 20:30 UTC.
We are currently investigating issues with the MPG control plane. Users may experience delays or hanging when creating or deleting databases via the dashboard or CLI.
Deploys hanging at waiting for Depot Builder
5 updates
This incident has been resolved.
The fix has been rolled out and we are seeing deploys using depot builder succeeding normally. We continue to monitor to ensure full recovery. Depot builders have been reenabled as the default option for new deploys
A fix is being rolled out. Fly builders continue to be the default while this is deployed
We are again seeing elevated latency provisioning depot builders on new deploys. Users may see deploys using Depot builders hang or timeout at the "Waiting for Depot Builder" step. We are working on a fix. We are switching all deploys to use the default Fly builders in the meantime. If desired users can manually switch back to depot builders using `fly deploy --depot=true` but may continue to see latency issues at this time.
We have seen elevated latency provisioning Depot builders during deployments over the past hour. This caused some deploys to hang or timeout at the "Waiting for Depot Builder" step in this period. Latency has improved and builder provision times are back to normal. We're continuing to monitor to ensure latency remains normal.
Networking issues for users connecting through lhr
3 updates
Network traffic in LHR has been stable for some time now, we are not seeing any further issues.
A fix has been implemented and we are monitoring the results.
We’re currently investigating this issue.
Investigating registry issues affecting deploys
5 updates
This incident has been resolved.
While we have seen some improvement from the previous fix, we are still seeing elevated rates of Registry connection issues. Users may continue to see slower machine creates and deploys due to slow image pulls. Deploys may succeed on a retry. We are continuing to work on restoring normal registry performance
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Control plane state delayed on some hosts possibly causing network or deployment disruption
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
flyctl deploy timeouts
3 updates
Earlier today, an issue caused elevated rate limiting and some deployment timeouts. A fix is in place and deployments are back to normal.
A fix has been implemented and we are monitoring the results.
We’re investigating elevated 429 errors from flaps causing deployment timeouts. Affected deploys are failing with: ✖ Failed: error waiting for release_command machine XX to finish running: timeout reached waiting for machine's state to change Your machine never reached the state "destroyed".
Degraded Managed Postgres Control Plane in ORD
5 updates
This incident has been resolved.
A fix has been implemented and we are seeing full recovery of the control plane in ORD. With that recovery we are seeing impacted replicas catching up and clusters returning to normal health. We're continuing to monitor for full recovery.
We are continuing to work on a fix for this issue.
The issue has been identified and we are working on a fix. The majority of MPG clusters in ORD continue to run normally, though some users may still see degraded replicas at this time. Some clusters in the region will have experienced a primary -> replica failover.
We are currently investigating issues with the MPG control plane in ORD. A small number of clusters in the region may be seeing replication lag or PGBouncers connectivity issues at this time.
Issues with deploying apps using Depot builders for new accounts
4 updates
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Some new Fly.io users may encounter an "upgrade your organization" error message when attempting to deploy apps for the first time. We're currently working with Depot to figure out what's causing the issue. In the meantime, you should be able to work around the issue by using Fly builders with `fly deploy --depot=false`.
Creating new sprites is degraded
6 updates
This incident has been resolved.
Sprite creation appears to be back to normal operation now.
We've identified the cause of the delay following creates and we're deploying a fix.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
Sprite creation generates an error that the sprite "is not assigned to compute." Eventually the sprite transitions from an unknown state to warm, so there is a delay before the sprite is usable.
Degraded MPG clusters in IAD
5 updates
This incident has been resolved.
We've rolled out a fix for the remaining impacted clusters, and we're now monitoring the results.
We've rolled out a fix for some additional impacted clusters, and we're continuing to work on the remaining clusters.
We've identified the issue - some MPG clusters in IAD should be seeing improvements, and we're working on rolling out a fix for the remaining impacted clusters.
We're currently looking into an issue with MPG clusters in the IAD region.
📡 Tired of checking Fly.io status manually?
Better Stack monitors uptime every 30 seconds and alerts you instantly when Fly.io goes down.