Is Databricks Down? Complete Status Check Guide + Quick Fixes
Databricks workspace won't load?
Clusters stuck starting?
Jobs failing with connection errors?
Before panicking, verify if Databricks is actually downβor if it's a problem with your workspace, clusters, or network. Here's your complete guide to checking Databricks status and fixing common issues fast.
Quick Check: Is Databricks Actually Down?
Don't assume it's Databricks. 70% of "Databricks down" reports are actually workspace configuration issues, cluster startup failures, cloud provider problems, or networking misconfigurations.
1. Check Official Sources
Databricks Status Page:
π status.databricks.com
What to look for:
- β "All Systems Operational" = Databricks is fine
- β οΈ "Partial Service Disruption" = Some services affected
- π΄ "Service Disruption" = Databricks is down
Real-time updates:
- Control Plane status (workspace access, authentication)
- Data Plane status (clusters, jobs, notebooks)
- Regional outages (AWS, Azure, GCP)
- API availability
- SQL Warehouses status
- Unity Catalog status
Twitter/X Search:
π Search "Databricks down" on Twitter
Why it works:
- Users report outages instantly
- See if others in your region affected
- Databricks team responds here
Pro tip: If 100+ tweets in the last hour mention "Databricks down," it's probably actually down.
2. Check Service-Specific Status
Databricks has multiple services that can fail independently:
| Service | What It Does | Status Check |
|---|---|---|
| Control Plane | Workspace UI, authentication, API | status.databricks.com |
| Data Plane | Clusters, jobs, notebooks, compute | Check status page under "Data Plane" |
| SQL Warehouses | SQL endpoints, queries, dashboards | Check status page under "SQL" |
| Unity Catalog | Data governance, metadata | Check status page under "Unity Catalog" |
| Delta Lake | Table reads/writes, transactions | Check status page under "Delta" |
| MLflow | Model tracking, registry | Check status page under "MLflow" |
| Jobs/Workflows | Scheduled jobs, orchestration | Check status page under "Jobs" |
Your service might be down while Databricks globally is up.
How to check which service is affected:
- Visit status.databricks.com
- Look for specific service status
- Check your cloud provider region (AWS us-east-1, Azure East US, etc.)
- Check "Incident History" for recent issues
- Subscribe to status updates (email/SMS)
3. Check Cloud Provider Status
Databricks runs on cloud providersβtheir outages affect Databricks.
| Cloud Provider | Status Page | What to Check |
|---|---|---|
| AWS | health.aws.amazon.com | EC2, S3, IAM in your region |
| Azure | status.azure.com | Virtual Machines, Storage, Active Directory |
| GCP | status.cloud.google.com | Compute Engine, Cloud Storage |
Decision tree:
Cloud provider down + Databricks status OK β Cloud provider issue
Cloud provider OK + Databricks status down β Databricks issue
Both OK + Your workspace down β Workspace configuration issue
Specific region down β Regional cloud outage
4. Test Different Access Methods
If workspace UI fails but REST API works, it's likely a browser/network issue.
| Access Method | Test Method |
|---|---|
| Workspace UI | Try loading your workspace URL |
| REST API | Test API endpoint with curl |
| CLI | Run databricks workspace list |
| JDBC/ODBC | Try SQL Warehouse connection |
Quick API test:
# Test Databricks REST API
curl -H "Authorization: Bearer <your-token>" \
https://<workspace-url>/api/2.0/clusters/list
If API works but UI doesn't:
- Clear browser cache
- Try incognito/private mode
- Try different browser
- Check browser console for errors (F12)
Common Databricks Error Messages (And What They Mean)
Error: "Unable to Reach Workspace"
What it means: Can't connect to Databricks workspace.
Causes:
- Network connectivity issues
- DNS resolution failure
- Workspace suspended/deleted
- VPN/proxy interference
- Browser cache corruption
Quick fixes:
- Check if databricks.com loads in browser
- Verify workspace URL is correct
- Check workspace status in cloud provider console
- Disable VPN temporarily
- Clear browser cache and cookies
- Try different browser or incognito mode
Error: "RESOURCE_DOES_NOT_EXIST"
What it means: Cluster, job, or resource not found.
Causes:
- Cluster terminated
- Job deleted
- Incorrect cluster ID
- Workspace permissions changed
- Resource moved to different workspace
Quick fixes:
- Verify resource ID is correct
- Check if cluster was auto-terminated
- Start a new cluster if needed
- Check workspace permissions
- Verify you're in the correct workspace
Error: "Cluster Failed to Start: Cloud Provider Error"
What it means: Can't provision cloud resources for cluster.
Causes:
- Cloud provider capacity limits (no available VMs)
- Insufficient cloud account quotas
- Service limits exceeded
- Regional outage
- IAM/permissions issues
- Invalid instance type
Quick fixes:
- Check cloud provider quotas:
- AWS: EC2 vCPU limits
- Azure: VM core limits
- GCP: Compute Engine quotas
- Try different instance type:
- Use smaller instance size
- Switch to different instance family
- Try different availability zone:
- Edit cluster config β Availability β Change zone
- Request quota increase:
- AWS: Service Quotas console
- Azure: Subscription β Usage + quotas
- GCP: IAM & Admin β Quotas
- Retry in a few minutes:
- Transient capacity issues often resolve quickly
Error: "Authentication Failed" / "Invalid Access Token"
What it means: Can't authenticate to Databricks.
Causes:
- Token expired
- Token revoked
- Wrong token for workspace
- SSO/SAML issues
- Permissions changed
Quick fixes:
- Generate new personal access token:
- Workspace β Settings β User Settings β Access Tokens
- Generate New Token
- Copy and save securely
- Check token permissions:
- Token must have appropriate scopes
- Check workspace admin didn't revoke access
- Re-authenticate CLI:
databricks auth login --host <workspace-url> - Check SSO status:
- Try logging in via browser first
- SSO provider might be down
- Check with IT if corporate SSO
Error: "Notebook Execution Failed: Cluster Terminated"
What it means: Cluster stopped while notebook was running.
Causes:
- Auto-termination triggered (idle timeout)
- Cluster crashed (OOM, driver failure)
- Cloud provider spot instance preempted
- Manual termination
- Cost limits exceeded
Quick fixes:
- Check cluster event log:
- Compute β Click cluster β Event Log tab
- Look for termination reason
- Restart cluster:
- Click "Start" on terminated cluster
- Or create new cluster
- Adjust auto-termination:
- Edit cluster β Auto Termination
- Set longer timeout (60-120 minutes)
- Use on-demand instances:
- Edit cluster β AWS/Azure/GCP settings
- Disable Spot/Preemptible instances
- Increase cluster resources:
- OOM errors? Add more memory/nodes
Error: "Job Run Failed: Cannot Create Run"
What it means: Job scheduler can't start new job run.
Causes:
- Cluster pool exhausted
- Job concurrency limits
- Cluster policy restrictions
- Permissions issues
- Cluster configuration errors
Quick fixes:
- Check job run history:
- Workflows β Your Job β Run History
- Look for error details
- Check cluster availability:
- If using cluster pool, check pool capacity
- Try running manually first
- Check job concurrency:
- Edit job β Advanced β Max Concurrent Runs
- Increase if needed
- Verify cluster config:
- Job cluster configuration valid?
- Instance types available?
- Check permissions:
- User has "Can Manage Run" permission?
Error: "Delta Table Transaction Conflict"
What it means: Concurrent writes to same Delta table failed.
Causes:
- Multiple jobs writing simultaneously
- Optimistic concurrency conflict
- Incomplete transactions
- Table locked
Quick fixes:
- Retry transaction:
- Delta handles most conflicts automatically
- Retry usually succeeds
- Check concurrent jobs:
- Multiple jobs writing to same table?
- Add job dependencies or locks
- Run OPTIMIZE:
OPTIMIZE delta.`/path/to/table` - Check table history:
DESCRIBE HISTORY delta.`/path/to/table` - Increase retry settings:
spark.conf.set("spark.databricks.delta.retryWriteConflict.enabled", "true")
Error: "SQL Warehouse Connection Failed"
What it means: Can't connect to SQL Warehouse endpoint.
Causes:
- Warehouse stopped
- Warehouse starting up
- Network connectivity
- Authentication failure
- Warehouse configuration error
Quick fixes:
- Check warehouse status:
- SQL Warehouses β Your Warehouse β Status
- Start if stopped
- Wait for startup:
- Warehouses take 1-3 minutes to start
- Check status indicator
- Test connection string:
# Test JDBC connection curl https://<workspace>.cloud.databricks.com/sql/1.0/warehouses/<warehouse-id> - Check network access:
- IP Access Lists blocking you?
- VPN required for workspace?
- Verify credentials:
- Token valid and not expired?
- User has warehouse access?
Quick Fixes: Databricks Not Working?
Fix #1: Restart Cluster (The Classic)
Why it works: Clears connection cache, restarts Spark driver, resets configurations.
How to do it right:
For interactive clusters:
- Compute β Select your cluster
- Click "Restart" (not "Terminate")
- Wait 2-5 minutes for startup
- Check cluster event log if restart fails
For job clusters:
- Workflows β Select job
- "Run Now" creates new cluster automatically
- Or edit job β Cluster β Change configuration
- Save and run
Pro tip: Use "Restart" not "Terminate" to keep cluster config and libraries installed.
Fix #2: Check Cloud Provider Quotas
Databricks needs cloud resourcesβquotas limit what you can provision.
Common quota issues:
AWS:
- vCPU limits: Default 5 vCPUs per instance type
- Spot instance limits: Lower than on-demand
- EBS volume limits: Storage quotas
Check AWS quotas:
- AWS Console β Service Quotas
- Search "EC2"
- Look for "Running On-Demand instances"
- Request increase if needed
Azure:
- VM core limits: Total vCPUs per region
- Spot VM limits: Separate quota
- Storage account limits: IOPS/throughput
Check Azure quotas:
- Azure Portal β Subscriptions
- Usage + quotas
- Search "Compute"
- Request increase if needed
GCP:
- Compute Engine quotas: CPUs, GPUs, IP addresses
- Preemptible VM quotas: Separate from regular VMs
- Persistent disk quotas: Storage limits
Check GCP quotas:
- GCP Console β IAM & Admin β Quotas
- Filter by "Compute Engine"
- Request increase if needed
Pro tip: Request quota increases before launching large clusters. Approval can take 24-48 hours.
Fix #3: Clear Browser Cache and Cookies
Workspace UI issues often caused by stale cache.
Chrome:
- Press
Ctrl+Shift+Delete(Windows) orCmd+Shift+Delete(Mac) - Time range: "All time"
- Check: Cookies, Cached images and files
- Click "Clear data"
- Reload workspace
Firefox:
- Press
Ctrl+Shift+Delete(Windows) orCmd+Shift+Delete(Mac) - Time range: "Everything"
- Check: Cookies, Cache
- Click "Clear Now"
- Reload workspace
Safari:
- Safari β Preferences β Privacy
- Click "Manage Website Data"
- Remove databricks.com entries
- Reload workspace
Quick test: Try incognito/private mode firstβif it works, cache is the issue.
Fix #4: Check Cluster Logs
Cluster logs show what went wrong.
View cluster logs:
- Compute β Select cluster
- Click "Event Log" tab (for cluster lifecycle events)
- Click "Spark UI" β Executors (for Spark errors)
- Click "Driver Logs" (for detailed driver errors)
Common log messages:
"Driver not responding":
- Driver crashed (OOM, error)
- Network connectivity lost
- Fix: Increase driver memory, check network
"Executor lost":
- Executor node failed
- Cloud provider reclaimed spot instance
- Fix: Use on-demand instances, add retry logic
"Failed to bind to port":
- Port conflict (rare)
- Fix: Restart cluster, try different cluster
"Cannot connect to S3/ADLS/GCS":
- Storage credentials expired/invalid
- Fix: Update workspace storage credentials
Fix #5: Verify Network Configuration
Network issues prevent cluster communication.
Check VPC/VNet configuration:
AWS:
- VPC must allow outbound internet (for cluster communication)
- Security groups must allow internal cluster traffic
- Subnet must have NAT gateway or internet gateway
- Check: Databricks workspace β Settings β Network
Azure:
- VNet must allow outbound internet
- NSG rules must allow cluster communication
- Subnet delegation required for Databricks
- Check: Azure Portal β Virtual Networks
GCP:
- VPC must allow outbound internet
- Firewall rules must allow cluster traffic
- Subnet must have Private Google Access enabled
- Check: GCP Console β VPC Networks
Quick test:
# From cluster notebook, test outbound connectivity
%sh
curl -I https://pypi.org
If curl fails:
- Network configuration issue
- Check firewall/security groups
- Verify NAT gateway/internet gateway configured
Fix #6: Update Libraries and Dependencies
Outdated or conflicting libraries cause failures.
Check installed libraries:
- Compute β Select cluster
- Click "Libraries" tab
- Look for red "Failed" status
Common library issues:
"Library installation failed":
- PyPI/Maven package not found
- Network connectivity to package repository
- Conflicting dependencies
Fix:
- Remove failing library
- Restart cluster
- Install compatible version
- Check library logs for details
Best practices:
- Pin library versions (
pandas==1.5.3, not justpandas) - Test libraries on test cluster first
- Use init scripts for complex setups
- Avoid conflicting libraries (e.g., TensorFlow + PyTorch issues)
Fix #7: Check Workspace Storage Credentials
Databricks needs credentials to access your cloud storage.
AWS S3:
- IAM role attached to cluster
- Instance profile configured
- S3 bucket policy allows Databricks role
Check credentials:
# Test S3 access from notebook
dbutils.fs.ls("s3://your-bucket/")
If access denied:
- Workspace Admin β Settings β AWS Credentials
- Verify instance profile ARN correct
- Check S3 bucket policy
- Test with
aws s3 lsfrom cluster
Azure ADLS:
- Service principal credentials
- OAuth tokens
- Managed identity
Check credentials:
# Test ADLS access
dbutils.fs.ls("abfss://container@storage.dfs.core.windows.net/")
If access denied:
- Workspace Settings β Azure ADLS Gen2
- Verify service principal credentials
- Check storage account IAM roles
GCP GCS:
- Service account keys
- Workload identity
Check credentials:
# Test GCS access
dbutils.fs.ls("gs://your-bucket/")
If access denied:
- Workspace Settings β GCP Credentials
- Verify service account has Storage Object Admin role
Fix #8: Adjust Cluster Configuration
Wrong cluster config causes failures.
Common configuration issues:
1. Instance type not available:
- Try different instance type
- Check cloud provider availability
- Use instance pool for guaranteed capacity
2. Insufficient resources:
- Increase driver memory (Edit β Driver β Memory)
- Add more worker nodes
- Use larger instance types
3. Auto-scaling issues:
- Set min/max workers appropriately
- Don't set min = max (disables autoscaling)
- Allow 2-3x headroom for scaling
4. Spark configuration:
- Check advanced options for custom Spark configs
- Common settings:
spark.sql.shuffle.partitions 200 spark.executor.memory 4g spark.driver.memory 8g
Test configuration:
- Create new cluster with default config
- If works, issue was custom config
- Add custom configs one by one to isolate problem
Databricks Workspace Not Loading?
Issue: "Workspace URL Not Responding"
Troubleshoot:
1. Check workspace status:
- Log into cloud provider console (AWS/Azure/GCP)
- Find Databricks workspace resource
- Check if workspace running/healthy
- Check for cost/budget alerts (workspace suspended?)
2. Check DNS resolution:
# Test DNS lookup
nslookup <workspace-url>.cloud.databricks.com
If DNS fails:
- DNS server issue
- Try Google DNS (8.8.8.8)
- Try different network
3. Check browser:
- Try incognito mode
- Try different browser
- Clear cache and cookies (see Fix #3)
- Check browser console for errors (F12)
4. Check network:
- Disable VPN temporarily
- Try mobile hotspot (bypass corporate network)
- Check firewall rules
- Try from different location
Issue: "403 Forbidden" or "Access Denied"
Troubleshoot:
1. Check workspace permissions:
- Workspace admin may have removed your access
- Contact workspace admin
- Check email for access revocation notice
2. Check IP Access Lists:
- Workspace β Settings β IP Access Lists
- Your IP might be blocked
- VPN might change your IP to blocked range
3. Check SSO/SAML:
- Corporate SSO might be down
- Re-authenticate via SSO portal
- Contact IT if persistent
4. Check user status:
- User account might be disabled
- Check with workspace admin
Issue: "Slow Workspace Performance"
Causes:
- Too many notebooks/jobs open
- Large result sets loading
- Browser memory exhaustion
- Network latency
Fixes:
1. Close unused notebooks:
- File β Close other notebooks
- Detach from clusters when not in use
2. Limit result display:
# Don't display huge DataFrames
# Instead of:
display(df)
# Use:
display(df.limit(100))
3. Clear output:
- Cell menu β Clear All Outputs
- Reduces page memory
4. Use dedicated browser:
- Use separate browser profile for Databricks
- Avoid 50+ tabs in same browser
Databricks Clusters Not Starting?
Issue: "Cluster Stuck on 'Pending'"
Troubleshoot:
1. Check cloud provider capacity:
- No available VMs in region/zone
- Try different instance type
- Try different availability zone
- Use on-demand instead of spot
2. Check cluster event log:
- Compute β Cluster β Event Log
- Look for error messages
- Common: "Cannot launch instances", "Insufficient capacity"
3. Check quotas:
- See Fix #2 (Check Cloud Provider Quotas)
- Request quota increase if needed
4. Wait and retry:
- Capacity issues often transient
- Wait 10-15 minutes
- Terminate and restart cluster
Issue: "Cluster Starts Then Immediately Terminates"
Troubleshoot:
1. Check init scripts:
- Init script failure causes cluster termination
- Edit cluster β Init Scripts β Remove temporarily
- Test if cluster starts without init scripts
- Fix init script errors
2. Check cluster policy:
- Policy restrictions preventing cluster launch?
- Contact workspace admin
- Try cluster without policy
3. Check driver logs:
- Compute β Cluster β Driver Logs
- Look for startup errors
- Common: Library conflicts, configuration errors
4. Check instance profile/service principal:
- Invalid credentials cause startup failure
- Test credentials separately
- Update workspace credentials if needed
Issue: "Cluster Running But Notebooks Won't Execute"
Troubleshoot:
1. Detach and reattach notebook:
- Notebook β Cluster dropdown β Detach
- Wait 10 seconds
- Reattach to cluster
2. Check cluster status:
- Green = Running
- Gray = Stopped
- Orange = Starting/Restarting
- Red = Failed
3. Check notebook language:
- Notebook language must match cluster
- SQL notebooks need SQL-compatible cluster
- Python notebooks work on all clusters
4. Test with simple command:
# Test if cluster responding
print("Hello from cluster!")
If timeout:
- Cluster may be overloaded
- Check Spark UI β Executors β Active tasks
- Restart cluster if needed
Databricks Jobs Not Running?
Issue: "Job Stuck in 'Pending' State"
Troubleshoot:
1. Check job queue:
- Workflows β Job runs
- Look for many pending runs
- Max concurrent runs limit reached?
2. Check cluster availability:
- If using cluster pool, pool might be empty
- If using existing cluster, cluster might be stopped
- Try "Run Now" manually to test
3. Check permissions:
- User must have "Can Manage Run" permission
- Check job β Permissions tab
- Contact job owner if needed
4. Check job schedule:
- Edit job β Schedule
- Verify schedule is enabled
- Check if manual pause enabled
Issue: "Job Runs But Fails Immediately"
Troubleshoot:
1. Check job run output:
- Workflows β Job β Latest run β View run
- Click failed task
- Check error message and stack trace
2. Check notebook/script:
- Syntax errors
- Missing parameters
- Broken dependencies
- Test manually in notebook first
3. Check job parameters:
- Edit job β Parameters
- Verify parameter values correct
- Especially file paths, credentials
4. Check cluster logs:
- Click failed run β Cluster Logs
- Look for startup or execution errors
Issue: "Job Runs Slower Than Expected"
Causes:
- Undersized cluster
- Data skew
- Inefficient queries
- Cold start (cluster creation time)
Fixes:
1. Use existing cluster:
- Edit job β Cluster β Use existing cluster
- Avoid cold start time
- But: cluster must be running when job triggers
2. Use cluster pools:
- Pre-warmed instances
- Faster startup (30-60 seconds vs 3-5 minutes)
- Edit job β Cluster β Pool
3. Optimize job:
- Check Spark UI for bottlenecks
- Reduce data shuffles
- Add partitioning
- Cache intermediate results
4. Scale up cluster:
- Increase worker nodes
- Use larger instance types
- Enable autoscaling
Databricks SQL Warehouses Issues?
Issue: "SQL Warehouse Won't Start"
Troubleshoot:
1. Check warehouse size:
- Larger warehouses take longer (1-3 minutes)
- Wait patiently
- Check status indicator
2. Check cloud quotas:
- Same quota issues as clusters
- See Fix #2 (Check Cloud Provider Quotas)
3. Check permissions:
- User must have "Can Use" permission
- SQL Warehouses β Warehouse β Permissions
4. Check workspace status:
- Control plane down affects warehouse startup
- Check status.databricks.com
Issue: "SQL Query Timeout"
Troubleshoot:
1. Check query complexity:
- Large joins, aggregations take time
- Break into smaller queries
- Add filters to reduce data scanned
2. Increase warehouse size:
- Edit warehouse β Cluster size
- Larger = more query slots, faster execution
- 2X-Large for heavy workloads
3. Check query queue:
- SQL Warehouses β Query History
- Too many concurrent queries?
- Increase warehouse cluster size or concurrency
4. Optimize query:
-- Add filters to reduce data
SELECT * FROM large_table
WHERE date >= '2026-01-01' -- Partition filter
LIMIT 1000
-- Use materialized views for common queries
CREATE MATERIALIZED VIEW AS ...
Issue: "SQL Dashboard Not Loading"
Troubleshoot:
1. Check warehouse status:
- Dashboard queries need running warehouse
- Start warehouse if stopped
- Auto-stop might have terminated it
2. Check query refresh:
- Dashboard β Refresh settings
- Manual refresh vs auto-refresh
- Long-running queries block dashboard
3. Check data permissions:
- Unity Catalog permissions required
- User must have SELECT on tables
- Check with data owner
4. Check widget queries:
- Dashboard β Edit β Check each widget
- Individual widget query might be failing
- Fix or disable problematic widgets
Unity Catalog Issues?
Issue: "Cannot Access Table: Unity Catalog Error"
Troubleshoot:
1. Check catalog permissions:
-- Check grants on catalog
SHOW GRANTS ON CATALOG your_catalog;
-- Check grants on schema
SHOW GRANTS ON SCHEMA your_catalog.your_schema;
-- Check grants on table
SHOW GRANTS ON TABLE your_catalog.your_schema.your_table;
2. Request access:
- Contact data owner
- Use "Request Access" button in Catalog Explorer
- Workspace admin can grant permissions
3. Check catalog exists:
-- List available catalogs
SHOW CATALOGS;
-- List schemas in catalog
SHOW SCHEMAS IN your_catalog;
4. Check table path:
- Unity Catalog uses three-level namespace
- Format:
catalog.schema.table - Check for typos
Issue: "Metastore Connection Failed"
Troubleshoot:
1. Check workspace metastore assignment:
- Workspace Settings β Unity Catalog
- Verify metastore assigned to workspace
- Contact workspace admin if not assigned
2. Check network connectivity:
- Metastore in different region?
- Network rules blocking connection?
- Check VPC/VNet peering if using private connectivity
3. Check metastore status:
- Account Console β Metastores
- Check if metastore healthy
- Look for error messages
Delta Lake Issues?
Issue: "Delta Table Not Found"
Troubleshoot:
1. Check table path:
# Verify path exists
dbutils.fs.ls("dbfs:/path/to/delta/table")
# Or for Unity Catalog
spark.sql("DESCRIBE TABLE your_catalog.your_schema.your_table")
2. Check table registration:
-- List tables in schema
SHOW TABLES IN your_schema;
-- Register external Delta table
CREATE TABLE your_table
USING DELTA
LOCATION '/path/to/delta/table';
3. Check permissions:
- Read permissions on storage location
- Unity Catalog permissions if using UC
- Check with workspace admin
Issue: "Delta Transaction Failed"
Troubleshoot:
1. Retry operation:
- Delta handles most conflicts automatically
- Simply retry the operation
2. Check concurrent writes:
- Multiple jobs writing same table?
- Use merge operations instead of inserts
- Add transaction isolation
3. Run table maintenance:
-- Optimize table
OPTIMIZE your_table;
-- Vacuum old files (default 7 day retention)
VACUUM your_table RETAIN 168 HOURS;
-- Check table history
DESCRIBE HISTORY your_table;
Regional Outages: Is It Just Me?
Databricks deploys across multiple cloud regions:
| Cloud Provider | Common Regions |
|---|---|
| AWS | us-east-1, us-west-2, eu-west-1, ap-southeast-1 |
| Azure | East US, West Europe, Southeast Asia, UK South |
| GCP | us-central1, europe-west1, asia-southeast1 |
How to check for regional issues:
1. Check DownDetector:
π downdetector.com/status/databricks
Shows:
- Real-time outage reports
- Heatmap of affected regions
- Spike in reports = likely real outage
2. Check cloud provider status:
- AWS outage might affect only us-east-1
- Azure issue might affect only one region
- GCP regional issues isolated
3. Check Databricks status by region:
- status.databricks.com
- Filter by cloud provider and region
- Subscribe to your specific region alerts
4. Test from different region:
- If available, try workspace in different region
- Isolates if issue is regional vs global
When Databricks Actually Goes Down
What Happens
Recent major outages:
- October 2023: 4-hour AWS us-east-1 control plane outage
- July 2023: 2-hour authentication service disruption (all clouds)
- March 2023: 6-hour Azure East US regional outage
Typical causes:
- Cloud provider outages (AWS/Azure/GCP failures)
- Control plane authentication issues
- Network connectivity problems
- Database backend failures
- Deployment issues (rare)
How Databricks Responds
Communication channels:
- status.databricks.com - Primary source
- @databricks on Twitter/X
- Email alerts (if subscribed to status page)
- In-app notifications (if workspace accessible)
Timeline:
- 0-15 min: Users report issues on Twitter/DownDetector
- 15-30 min: Databricks acknowledges on status page
- 30-120 min: Updates posted every 30 min
- Resolution: Usually 1-4 hours for major outages
What to Do During Outages
1. Check if data plane still works:
- Running clusters may continue working
- Jobs may complete even if UI is down
- Check via CLI:
databricks clusters list
2. Use backup compute:
- AWS EMR for Spark workloads
- Azure HDInsight or Synapse
- GCP Dataproc
- Run critical jobs elsewhere temporarily
3. Monitor status page:
- status.databricks.com
- Subscribe to SMS/email updates
- Check estimated time to resolution
4. Document impact:
- Note affected jobs/workflows
- Capture error messages
- Will help with root cause analysis later
5. Prepare for recovery:
- Have restart procedures ready
- Check data consistency after outage
- Re-run failed jobs when service restored
Databricks Down Checklist
Follow these steps in order:
Step 1: Verify it's actually down
- Check Databricks Status
- Check API Status Check
- Check cloud provider status (AWS/Azure/GCP)
- Search Twitter: "Databricks down"
- Test REST API:
curlworkspace API endpoint - Try different browser/incognito mode
Step 2: Quick fixes (if Databricks is up)
- Restart cluster
- Clear browser cache and cookies
- Check cluster event logs
- Verify network connectivity
- Check cloud provider quotas
- Update workspace credentials
Step 3: Cluster troubleshooting
- Check cluster configuration (instance type, size)
- Verify cloud provider capacity available
- Check init scripts (disable temporarily)
- Review driver logs for errors
- Test with default cluster config
- Check cluster policy restrictions
Step 4: Network troubleshooting
- Test outbound connectivity from cluster
- Verify VPC/VNet configuration
- Check security group / NSG rules
- Verify NAT gateway / internet gateway
- Check storage credentials (S3/ADLS/GCS)
- Test with different network/VPN
Step 5: Job/workflow troubleshooting
- Check job run history and error messages
- Test notebook manually first
- Verify job parameters correct
- Check job permissions
- Review cluster logs for failed run
- Test with simpler job configuration
Step 6: Nuclear option
- Create new cluster with default config
- Re-import notebook from revision history
- Contact Databricks support: databricks.com/support
- Open ticket with cloud provider if quota/capacity issue
Prevent Future Issues
1. Set Up Proactive Monitoring
Monitor Databricks status:
- Subscribe to status.databricks.com (email/SMS)
- Use API Status Check for automated monitoring
- Set up Slack/Discord/email alerts for outages
Monitor your workloads:
# Add health checks to critical notebooks
try:
# Your data pipeline code
df = spark.read.table("my_table")
# Alert on failure
except Exception as e:
dbutils.notebook.exit(f"FAILED: {str(e)}")
Monitor cluster health:
- Set up alerts for cluster failures
- Monitor job success rates
- Track cluster startup times (increasing = potential issues)
2. Use Cluster Pools
Why cluster pools help:
- Pre-warmed instances
- Faster startup (30-60 seconds vs 3-5 minutes)
- Guaranteed capacity
- Consistent environment
Create cluster pool:
- Compute β Pools β Create Pool
- Set min/max idle instances
- Choose instance type
- Use pool for interactive clusters and jobs
Pro tip: Size pool based on peak demand. Keep 2-3 idle instances ready.
3. Build Redundancy
For critical pipelines:
Multi-region strategy:
- Deploy workspaces in multiple regions
- Failover to backup region during outage
- Use cross-region storage replication
Retry logic:
from retry import retry
@retry(tries=3, delay=60)
def run_critical_job():
# Your job code
spark.sql("INSERT INTO target_table SELECT * FROM source_table")
Backup compute:
- Keep alternate compute ready (EMR, HDInsight, Dataproc)
- Document failover procedures
- Test failover quarterly
4. Optimize Cluster Configuration
Right-size clusters:
- Start small, scale up as needed
- Use autoscaling for variable workloads
- Don't over-provision (wastes cost)
Best practices:
- Use cluster pools for fast startup
- Set appropriate auto-termination (30-60 min idle)
- Use spot/preemptible for non-critical workloads
- Pin library versions for consistency
Test configurations:
- Test new configs on dev cluster first
- Gradually roll out to production
- Monitor performance metrics
5. Implement Job Orchestration Best Practices
Job dependencies:
- Use Databricks Workflows for orchestration
- Set proper task dependencies
- Add retry policies (3 retries with exponential backoff)
Job monitoring:
# Send notifications on job completion
dbutils.notebook.exit(json.dumps({
"status": "SUCCESS",
"rows_processed": row_count,
"duration_seconds": duration
}))
Failure handling:
- Set up alerts for job failures
- Use dead letter queues for failed records
- Log detailed error messages
6. Maintain Cloud Provider Health
Monitor quotas:
- Set up alerts for quota usage (80% threshold)
- Request quota increases proactively
- Keep buffer for burst capacity
Track cloud provider status:
- Subscribe to AWS/Azure/GCP status pages
- Monitor your specific regions
- Note cloud provider maintenance windows
Resource management:
- Clean up unused clusters/pools
- Delete old job runs (retention policy)
- Archive unused notebooks/data
7. Keep Credentials Updated
Regular credential rotation:
- Rotate personal access tokens quarterly
- Update service principal credentials before expiration
- Test credentials after rotation
Credential management:
- Use Databricks Secrets for sensitive data
- Avoid hardcoding credentials
- Use service principals for automation
# Use Databricks Secrets
secret = dbutils.secrets.get(scope="my_scope", key="api_key")
8. Document Your Setup
Critical documentation:
- Cluster configurations (save as JSON)
- Job configurations and dependencies
- Network architecture (VPC/VNet setup)
- Credential management procedures
- Incident response runbooks
Keep updated:
- Review docs quarterly
- Update after any config changes
- Share with team members
Key Takeaways
Before assuming Databricks is down:
- β Check Databricks Status
- β Check cloud provider status (AWS/Azure/GCP)
- β
Test REST API with
curl - β Search Twitter for "Databricks down"
- β Check cluster event logs and driver logs
Common fixes:
- Restart cluster (fixes 50% of issues)
- Check cloud provider quotas (capacity limits)
- Clear browser cache and cookies
- Verify network configuration (VPC/VNet)
- Update storage credentials
- Adjust cluster configuration (instance type, size)
Cluster issues:
- Check event logs for startup failures
- Verify cloud provider capacity available
- Use on-demand instances instead of spot
- Remove init scripts temporarily to test
- Check cluster policy restrictions
Job/workflow issues:
- Test notebooks manually first
- Check job run history for error details
- Verify job parameters and permissions
- Review cluster logs for failed runs
- Add retry logic and monitoring
SQL Warehouse issues:
- Wait for startup (1-3 minutes)
- Check warehouse permissions
- Increase warehouse size for heavy workloads
- Optimize slow queries
Unity Catalog issues:
- Check three-level namespace (catalog.schema.table)
- Verify permissions with SHOW GRANTS
- Request access from data owner
If Databricks is actually down:
- Monitor status.databricks.com
- Running clusters may continue working
- Use backup compute for critical jobs
- Usually resolved within 1-4 hours
Prevent future issues:
- Use cluster pools for fast, reliable startup
- Build retry logic into critical jobs
- Monitor proactively with alerts
- Keep cloud quotas sized appropriately
- Document configurations and procedures
- Test failover scenarios regularly
Remember: Most "Databricks down" issues are actually cluster configuration, cloud provider quotas, network setup, or permissions problems. Try the fixes in this guide before assuming Databricks is down.
Need real-time Databricks status monitoring? Track Databricks uptime with API Status Check - Get instant alerts when Databricks goes down.
Related Resources
- Is Databricks Down Right Now? β Live status check
- Databricks Outage History β Past incidents and timeline
- Databricks vs Snowflake Uptime β Which platform is more reliable?
- API Outage Response Plan β How to handle downtime like a pro
π Tools We Recommend
Uptime monitoring, incident management, and status pages β know before your users do.
Securely manage API keys, database credentials, and service tokens across your team.
Remove your personal data from 350+ data broker sites automatically.
Monitor your developer content performance and track API documentation rankings.
API Status Check
Stop checking API status pages manually
Get instant email alerts when OpenAI, Stripe, AWS, and 100+ APIs go down. Know before your users do.
Free dashboard available Β· 14-day trial on paid plans Β· Cancel anytime
Browse Free Dashboard β