Is Databricks Down? Complete Status Check Guide + Quick Fixes

Databricks workspace won't load?
Clusters stuck starting?
Jobs failing with connection errors?

Before panicking, verify if Databricks is actually downβ€”or if it's a problem with your workspace, clusters, or network. Here's your complete guide to checking Databricks status and fixing common issues fast.

Quick Check: Is Databricks Actually Down?

Don't assume it's Databricks. 70% of "Databricks down" reports are actually workspace configuration issues, cluster startup failures, cloud provider problems, or networking misconfigurations.

1. Check Official Sources

Databricks Status Page:
πŸ”— status.databricks.com

What to look for:

  • βœ… "All Systems Operational" = Databricks is fine
  • ⚠️ "Partial Service Disruption" = Some services affected
  • πŸ”΄ "Service Disruption" = Databricks is down

Real-time updates:

  • Control Plane status (workspace access, authentication)
  • Data Plane status (clusters, jobs, notebooks)
  • Regional outages (AWS, Azure, GCP)
  • API availability
  • SQL Warehouses status
  • Unity Catalog status

Twitter/X Search:
πŸ”— Search "Databricks down" on Twitter

Why it works:

  • Users report outages instantly
  • See if others in your region affected
  • Databricks team responds here

Pro tip: If 100+ tweets in the last hour mention "Databricks down," it's probably actually down.


2. Check Service-Specific Status

Databricks has multiple services that can fail independently:

Service What It Does Status Check
Control Plane Workspace UI, authentication, API status.databricks.com
Data Plane Clusters, jobs, notebooks, compute Check status page under "Data Plane"
SQL Warehouses SQL endpoints, queries, dashboards Check status page under "SQL"
Unity Catalog Data governance, metadata Check status page under "Unity Catalog"
Delta Lake Table reads/writes, transactions Check status page under "Delta"
MLflow Model tracking, registry Check status page under "MLflow"
Jobs/Workflows Scheduled jobs, orchestration Check status page under "Jobs"

Your service might be down while Databricks globally is up.

How to check which service is affected:

  1. Visit status.databricks.com
  2. Look for specific service status
  3. Check your cloud provider region (AWS us-east-1, Azure East US, etc.)
  4. Check "Incident History" for recent issues
  5. Subscribe to status updates (email/SMS)

3. Check Cloud Provider Status

Databricks runs on cloud providersβ€”their outages affect Databricks.

Cloud Provider Status Page What to Check
AWS health.aws.amazon.com EC2, S3, IAM in your region
Azure status.azure.com Virtual Machines, Storage, Active Directory
GCP status.cloud.google.com Compute Engine, Cloud Storage

Decision tree:

Cloud provider down + Databricks status OK β†’ Cloud provider issue
Cloud provider OK + Databricks status down β†’ Databricks issue
Both OK + Your workspace down β†’ Workspace configuration issue
Specific region down β†’ Regional cloud outage

4. Test Different Access Methods

If workspace UI fails but REST API works, it's likely a browser/network issue.

Access Method Test Method
Workspace UI Try loading your workspace URL
REST API Test API endpoint with curl
CLI Run databricks workspace list
JDBC/ODBC Try SQL Warehouse connection

Quick API test:

# Test Databricks REST API
curl -H "Authorization: Bearer <your-token>" \
  https://<workspace-url>/api/2.0/clusters/list

If API works but UI doesn't:

  • Clear browser cache
  • Try incognito/private mode
  • Try different browser
  • Check browser console for errors (F12)

Common Databricks Error Messages (And What They Mean)

Error: "Unable to Reach Workspace"

What it means: Can't connect to Databricks workspace.

Causes:

  • Network connectivity issues
  • DNS resolution failure
  • Workspace suspended/deleted
  • VPN/proxy interference
  • Browser cache corruption

Quick fixes:

  1. Check if databricks.com loads in browser
  2. Verify workspace URL is correct
  3. Check workspace status in cloud provider console
  4. Disable VPN temporarily
  5. Clear browser cache and cookies
  6. Try different browser or incognito mode

Error: "RESOURCE_DOES_NOT_EXIST"

What it means: Cluster, job, or resource not found.

Causes:

  • Cluster terminated
  • Job deleted
  • Incorrect cluster ID
  • Workspace permissions changed
  • Resource moved to different workspace

Quick fixes:

  1. Verify resource ID is correct
  2. Check if cluster was auto-terminated
  3. Start a new cluster if needed
  4. Check workspace permissions
  5. Verify you're in the correct workspace

Error: "Cluster Failed to Start: Cloud Provider Error"

What it means: Can't provision cloud resources for cluster.

Causes:

  • Cloud provider capacity limits (no available VMs)
  • Insufficient cloud account quotas
  • Service limits exceeded
  • Regional outage
  • IAM/permissions issues
  • Invalid instance type

Quick fixes:

  1. Check cloud provider quotas:
    • AWS: EC2 vCPU limits
    • Azure: VM core limits
    • GCP: Compute Engine quotas
  2. Try different instance type:
    • Use smaller instance size
    • Switch to different instance family
  3. Try different availability zone:
    • Edit cluster config β†’ Availability β†’ Change zone
  4. Request quota increase:
    • AWS: Service Quotas console
    • Azure: Subscription β†’ Usage + quotas
    • GCP: IAM & Admin β†’ Quotas
  5. Retry in a few minutes:
    • Transient capacity issues often resolve quickly

Error: "Authentication Failed" / "Invalid Access Token"

What it means: Can't authenticate to Databricks.

Causes:

  • Token expired
  • Token revoked
  • Wrong token for workspace
  • SSO/SAML issues
  • Permissions changed

Quick fixes:

  1. Generate new personal access token:
    • Workspace β†’ Settings β†’ User Settings β†’ Access Tokens
    • Generate New Token
    • Copy and save securely
  2. Check token permissions:
    • Token must have appropriate scopes
    • Check workspace admin didn't revoke access
  3. Re-authenticate CLI:
    databricks auth login --host <workspace-url>
    
  4. Check SSO status:
    • Try logging in via browser first
    • SSO provider might be down
    • Check with IT if corporate SSO

Error: "Notebook Execution Failed: Cluster Terminated"

What it means: Cluster stopped while notebook was running.

Causes:

  • Auto-termination triggered (idle timeout)
  • Cluster crashed (OOM, driver failure)
  • Cloud provider spot instance preempted
  • Manual termination
  • Cost limits exceeded

Quick fixes:

  1. Check cluster event log:
    • Compute β†’ Click cluster β†’ Event Log tab
    • Look for termination reason
  2. Restart cluster:
    • Click "Start" on terminated cluster
    • Or create new cluster
  3. Adjust auto-termination:
    • Edit cluster β†’ Auto Termination
    • Set longer timeout (60-120 minutes)
  4. Use on-demand instances:
    • Edit cluster β†’ AWS/Azure/GCP settings
    • Disable Spot/Preemptible instances
  5. Increase cluster resources:
    • OOM errors? Add more memory/nodes

Error: "Job Run Failed: Cannot Create Run"

What it means: Job scheduler can't start new job run.

Causes:

  • Cluster pool exhausted
  • Job concurrency limits
  • Cluster policy restrictions
  • Permissions issues
  • Cluster configuration errors

Quick fixes:

  1. Check job run history:
    • Workflows β†’ Your Job β†’ Run History
    • Look for error details
  2. Check cluster availability:
    • If using cluster pool, check pool capacity
    • Try running manually first
  3. Check job concurrency:
    • Edit job β†’ Advanced β†’ Max Concurrent Runs
    • Increase if needed
  4. Verify cluster config:
    • Job cluster configuration valid?
    • Instance types available?
  5. Check permissions:
    • User has "Can Manage Run" permission?

Error: "Delta Table Transaction Conflict"

What it means: Concurrent writes to same Delta table failed.

Causes:

  • Multiple jobs writing simultaneously
  • Optimistic concurrency conflict
  • Incomplete transactions
  • Table locked

Quick fixes:

  1. Retry transaction:
    • Delta handles most conflicts automatically
    • Retry usually succeeds
  2. Check concurrent jobs:
    • Multiple jobs writing to same table?
    • Add job dependencies or locks
  3. Run OPTIMIZE:
    OPTIMIZE delta.`/path/to/table`
    
  4. Check table history:
    DESCRIBE HISTORY delta.`/path/to/table`
    
  5. Increase retry settings:
    spark.conf.set("spark.databricks.delta.retryWriteConflict.enabled", "true")
    

Error: "SQL Warehouse Connection Failed"

What it means: Can't connect to SQL Warehouse endpoint.

Causes:

  • Warehouse stopped
  • Warehouse starting up
  • Network connectivity
  • Authentication failure
  • Warehouse configuration error

Quick fixes:

  1. Check warehouse status:
    • SQL Warehouses β†’ Your Warehouse β†’ Status
    • Start if stopped
  2. Wait for startup:
    • Warehouses take 1-3 minutes to start
    • Check status indicator
  3. Test connection string:
    # Test JDBC connection
    curl https://<workspace>.cloud.databricks.com/sql/1.0/warehouses/<warehouse-id>
    
  4. Check network access:
    • IP Access Lists blocking you?
    • VPN required for workspace?
  5. Verify credentials:
    • Token valid and not expired?
    • User has warehouse access?

Quick Fixes: Databricks Not Working?

Fix #1: Restart Cluster (The Classic)

Why it works: Clears connection cache, restarts Spark driver, resets configurations.

How to do it right:

For interactive clusters:

  1. Compute β†’ Select your cluster
  2. Click "Restart" (not "Terminate")
  3. Wait 2-5 minutes for startup
  4. Check cluster event log if restart fails

For job clusters:

  1. Workflows β†’ Select job
  2. "Run Now" creates new cluster automatically
  3. Or edit job β†’ Cluster β†’ Change configuration
  4. Save and run

Pro tip: Use "Restart" not "Terminate" to keep cluster config and libraries installed.


Fix #2: Check Cloud Provider Quotas

Databricks needs cloud resourcesβ€”quotas limit what you can provision.

Common quota issues:

AWS:

  • vCPU limits: Default 5 vCPUs per instance type
  • Spot instance limits: Lower than on-demand
  • EBS volume limits: Storage quotas

Check AWS quotas:

  1. AWS Console β†’ Service Quotas
  2. Search "EC2"
  3. Look for "Running On-Demand instances"
  4. Request increase if needed

Azure:

  • VM core limits: Total vCPUs per region
  • Spot VM limits: Separate quota
  • Storage account limits: IOPS/throughput

Check Azure quotas:

  1. Azure Portal β†’ Subscriptions
  2. Usage + quotas
  3. Search "Compute"
  4. Request increase if needed

GCP:

  • Compute Engine quotas: CPUs, GPUs, IP addresses
  • Preemptible VM quotas: Separate from regular VMs
  • Persistent disk quotas: Storage limits

Check GCP quotas:

  1. GCP Console β†’ IAM & Admin β†’ Quotas
  2. Filter by "Compute Engine"
  3. Request increase if needed

Pro tip: Request quota increases before launching large clusters. Approval can take 24-48 hours.


Fix #3: Clear Browser Cache and Cookies

Workspace UI issues often caused by stale cache.

Chrome:

  1. Press Ctrl+Shift+Delete (Windows) or Cmd+Shift+Delete (Mac)
  2. Time range: "All time"
  3. Check: Cookies, Cached images and files
  4. Click "Clear data"
  5. Reload workspace

Firefox:

  1. Press Ctrl+Shift+Delete (Windows) or Cmd+Shift+Delete (Mac)
  2. Time range: "Everything"
  3. Check: Cookies, Cache
  4. Click "Clear Now"
  5. Reload workspace

Safari:

  1. Safari β†’ Preferences β†’ Privacy
  2. Click "Manage Website Data"
  3. Remove databricks.com entries
  4. Reload workspace

Quick test: Try incognito/private mode firstβ€”if it works, cache is the issue.


Fix #4: Check Cluster Logs

Cluster logs show what went wrong.

View cluster logs:

  1. Compute β†’ Select cluster
  2. Click "Event Log" tab (for cluster lifecycle events)
  3. Click "Spark UI" β†’ Executors (for Spark errors)
  4. Click "Driver Logs" (for detailed driver errors)

Common log messages:

"Driver not responding":

  • Driver crashed (OOM, error)
  • Network connectivity lost
  • Fix: Increase driver memory, check network

"Executor lost":

  • Executor node failed
  • Cloud provider reclaimed spot instance
  • Fix: Use on-demand instances, add retry logic

"Failed to bind to port":

  • Port conflict (rare)
  • Fix: Restart cluster, try different cluster

"Cannot connect to S3/ADLS/GCS":

  • Storage credentials expired/invalid
  • Fix: Update workspace storage credentials

Fix #5: Verify Network Configuration

Network issues prevent cluster communication.

Check VPC/VNet configuration:

AWS:

  • VPC must allow outbound internet (for cluster communication)
  • Security groups must allow internal cluster traffic
  • Subnet must have NAT gateway or internet gateway
  • Check: Databricks workspace β†’ Settings β†’ Network

Azure:

  • VNet must allow outbound internet
  • NSG rules must allow cluster communication
  • Subnet delegation required for Databricks
  • Check: Azure Portal β†’ Virtual Networks

GCP:

  • VPC must allow outbound internet
  • Firewall rules must allow cluster traffic
  • Subnet must have Private Google Access enabled
  • Check: GCP Console β†’ VPC Networks

Quick test:

# From cluster notebook, test outbound connectivity
%sh
curl -I https://pypi.org

If curl fails:

  • Network configuration issue
  • Check firewall/security groups
  • Verify NAT gateway/internet gateway configured

Fix #6: Update Libraries and Dependencies

Outdated or conflicting libraries cause failures.

Check installed libraries:

  1. Compute β†’ Select cluster
  2. Click "Libraries" tab
  3. Look for red "Failed" status

Common library issues:

"Library installation failed":

  • PyPI/Maven package not found
  • Network connectivity to package repository
  • Conflicting dependencies

Fix:

  1. Remove failing library
  2. Restart cluster
  3. Install compatible version
  4. Check library logs for details

Best practices:

  • Pin library versions (pandas==1.5.3, not just pandas)
  • Test libraries on test cluster first
  • Use init scripts for complex setups
  • Avoid conflicting libraries (e.g., TensorFlow + PyTorch issues)

Fix #7: Check Workspace Storage Credentials

Databricks needs credentials to access your cloud storage.

AWS S3:

  • IAM role attached to cluster
  • Instance profile configured
  • S3 bucket policy allows Databricks role

Check credentials:

# Test S3 access from notebook
dbutils.fs.ls("s3://your-bucket/")

If access denied:

  1. Workspace Admin β†’ Settings β†’ AWS Credentials
  2. Verify instance profile ARN correct
  3. Check S3 bucket policy
  4. Test with aws s3 ls from cluster

Azure ADLS:

  • Service principal credentials
  • OAuth tokens
  • Managed identity

Check credentials:

# Test ADLS access
dbutils.fs.ls("abfss://container@storage.dfs.core.windows.net/")

If access denied:

  1. Workspace Settings β†’ Azure ADLS Gen2
  2. Verify service principal credentials
  3. Check storage account IAM roles

GCP GCS:

  • Service account keys
  • Workload identity

Check credentials:

# Test GCS access
dbutils.fs.ls("gs://your-bucket/")

If access denied:

  1. Workspace Settings β†’ GCP Credentials
  2. Verify service account has Storage Object Admin role

Fix #8: Adjust Cluster Configuration

Wrong cluster config causes failures.

Common configuration issues:

1. Instance type not available:

  • Try different instance type
  • Check cloud provider availability
  • Use instance pool for guaranteed capacity

2. Insufficient resources:

  • Increase driver memory (Edit β†’ Driver β†’ Memory)
  • Add more worker nodes
  • Use larger instance types

3. Auto-scaling issues:

  • Set min/max workers appropriately
  • Don't set min = max (disables autoscaling)
  • Allow 2-3x headroom for scaling

4. Spark configuration:

  • Check advanced options for custom Spark configs
  • Common settings:
    spark.sql.shuffle.partitions 200
    spark.executor.memory 4g
    spark.driver.memory 8g
    

Test configuration:

  1. Create new cluster with default config
  2. If works, issue was custom config
  3. Add custom configs one by one to isolate problem

Databricks Workspace Not Loading?

Issue: "Workspace URL Not Responding"

Troubleshoot:

1. Check workspace status:

  • Log into cloud provider console (AWS/Azure/GCP)
  • Find Databricks workspace resource
  • Check if workspace running/healthy
  • Check for cost/budget alerts (workspace suspended?)

2. Check DNS resolution:

# Test DNS lookup
nslookup <workspace-url>.cloud.databricks.com

If DNS fails:

  • DNS server issue
  • Try Google DNS (8.8.8.8)
  • Try different network

3. Check browser:

  • Try incognito mode
  • Try different browser
  • Clear cache and cookies (see Fix #3)
  • Check browser console for errors (F12)

4. Check network:

  • Disable VPN temporarily
  • Try mobile hotspot (bypass corporate network)
  • Check firewall rules
  • Try from different location

Issue: "403 Forbidden" or "Access Denied"

Troubleshoot:

1. Check workspace permissions:

  • Workspace admin may have removed your access
  • Contact workspace admin
  • Check email for access revocation notice

2. Check IP Access Lists:

  • Workspace β†’ Settings β†’ IP Access Lists
  • Your IP might be blocked
  • VPN might change your IP to blocked range

3. Check SSO/SAML:

  • Corporate SSO might be down
  • Re-authenticate via SSO portal
  • Contact IT if persistent

4. Check user status:

  • User account might be disabled
  • Check with workspace admin

Issue: "Slow Workspace Performance"

Causes:

  • Too many notebooks/jobs open
  • Large result sets loading
  • Browser memory exhaustion
  • Network latency

Fixes:

1. Close unused notebooks:

  • File β†’ Close other notebooks
  • Detach from clusters when not in use

2. Limit result display:

# Don't display huge DataFrames
# Instead of:
display(df)

# Use:
display(df.limit(100))

3. Clear output:

  • Cell menu β†’ Clear All Outputs
  • Reduces page memory

4. Use dedicated browser:

  • Use separate browser profile for Databricks
  • Avoid 50+ tabs in same browser

Databricks Clusters Not Starting?

Issue: "Cluster Stuck on 'Pending'"

Troubleshoot:

1. Check cloud provider capacity:

  • No available VMs in region/zone
  • Try different instance type
  • Try different availability zone
  • Use on-demand instead of spot

2. Check cluster event log:

  • Compute β†’ Cluster β†’ Event Log
  • Look for error messages
  • Common: "Cannot launch instances", "Insufficient capacity"

3. Check quotas:

  • See Fix #2 (Check Cloud Provider Quotas)
  • Request quota increase if needed

4. Wait and retry:

  • Capacity issues often transient
  • Wait 10-15 minutes
  • Terminate and restart cluster

Issue: "Cluster Starts Then Immediately Terminates"

Troubleshoot:

1. Check init scripts:

  • Init script failure causes cluster termination
  • Edit cluster β†’ Init Scripts β†’ Remove temporarily
  • Test if cluster starts without init scripts
  • Fix init script errors

2. Check cluster policy:

  • Policy restrictions preventing cluster launch?
  • Contact workspace admin
  • Try cluster without policy

3. Check driver logs:

  • Compute β†’ Cluster β†’ Driver Logs
  • Look for startup errors
  • Common: Library conflicts, configuration errors

4. Check instance profile/service principal:

  • Invalid credentials cause startup failure
  • Test credentials separately
  • Update workspace credentials if needed

Issue: "Cluster Running But Notebooks Won't Execute"

Troubleshoot:

1. Detach and reattach notebook:

  • Notebook β†’ Cluster dropdown β†’ Detach
  • Wait 10 seconds
  • Reattach to cluster

2. Check cluster status:

  • Green = Running
  • Gray = Stopped
  • Orange = Starting/Restarting
  • Red = Failed

3. Check notebook language:

  • Notebook language must match cluster
  • SQL notebooks need SQL-compatible cluster
  • Python notebooks work on all clusters

4. Test with simple command:

# Test if cluster responding
print("Hello from cluster!")

If timeout:

  • Cluster may be overloaded
  • Check Spark UI β†’ Executors β†’ Active tasks
  • Restart cluster if needed

Databricks Jobs Not Running?

Issue: "Job Stuck in 'Pending' State"

Troubleshoot:

1. Check job queue:

  • Workflows β†’ Job runs
  • Look for many pending runs
  • Max concurrent runs limit reached?

2. Check cluster availability:

  • If using cluster pool, pool might be empty
  • If using existing cluster, cluster might be stopped
  • Try "Run Now" manually to test

3. Check permissions:

  • User must have "Can Manage Run" permission
  • Check job β†’ Permissions tab
  • Contact job owner if needed

4. Check job schedule:

  • Edit job β†’ Schedule
  • Verify schedule is enabled
  • Check if manual pause enabled

Issue: "Job Runs But Fails Immediately"

Troubleshoot:

1. Check job run output:

  • Workflows β†’ Job β†’ Latest run β†’ View run
  • Click failed task
  • Check error message and stack trace

2. Check notebook/script:

  • Syntax errors
  • Missing parameters
  • Broken dependencies
  • Test manually in notebook first

3. Check job parameters:

  • Edit job β†’ Parameters
  • Verify parameter values correct
  • Especially file paths, credentials

4. Check cluster logs:

  • Click failed run β†’ Cluster Logs
  • Look for startup or execution errors

Issue: "Job Runs Slower Than Expected"

Causes:

  • Undersized cluster
  • Data skew
  • Inefficient queries
  • Cold start (cluster creation time)

Fixes:

1. Use existing cluster:

  • Edit job β†’ Cluster β†’ Use existing cluster
  • Avoid cold start time
  • But: cluster must be running when job triggers

2. Use cluster pools:

  • Pre-warmed instances
  • Faster startup (30-60 seconds vs 3-5 minutes)
  • Edit job β†’ Cluster β†’ Pool

3. Optimize job:

  • Check Spark UI for bottlenecks
  • Reduce data shuffles
  • Add partitioning
  • Cache intermediate results

4. Scale up cluster:

  • Increase worker nodes
  • Use larger instance types
  • Enable autoscaling

Databricks SQL Warehouses Issues?

Issue: "SQL Warehouse Won't Start"

Troubleshoot:

1. Check warehouse size:

  • Larger warehouses take longer (1-3 minutes)
  • Wait patiently
  • Check status indicator

2. Check cloud quotas:

  • Same quota issues as clusters
  • See Fix #2 (Check Cloud Provider Quotas)

3. Check permissions:

  • User must have "Can Use" permission
  • SQL Warehouses β†’ Warehouse β†’ Permissions

4. Check workspace status:


Issue: "SQL Query Timeout"

Troubleshoot:

1. Check query complexity:

  • Large joins, aggregations take time
  • Break into smaller queries
  • Add filters to reduce data scanned

2. Increase warehouse size:

  • Edit warehouse β†’ Cluster size
  • Larger = more query slots, faster execution
  • 2X-Large for heavy workloads

3. Check query queue:

  • SQL Warehouses β†’ Query History
  • Too many concurrent queries?
  • Increase warehouse cluster size or concurrency

4. Optimize query:

-- Add filters to reduce data
SELECT * FROM large_table
WHERE date >= '2026-01-01'  -- Partition filter
LIMIT 1000

-- Use materialized views for common queries
CREATE MATERIALIZED VIEW AS ...

Issue: "SQL Dashboard Not Loading"

Troubleshoot:

1. Check warehouse status:

  • Dashboard queries need running warehouse
  • Start warehouse if stopped
  • Auto-stop might have terminated it

2. Check query refresh:

  • Dashboard β†’ Refresh settings
  • Manual refresh vs auto-refresh
  • Long-running queries block dashboard

3. Check data permissions:

  • Unity Catalog permissions required
  • User must have SELECT on tables
  • Check with data owner

4. Check widget queries:

  • Dashboard β†’ Edit β†’ Check each widget
  • Individual widget query might be failing
  • Fix or disable problematic widgets

Unity Catalog Issues?

Issue: "Cannot Access Table: Unity Catalog Error"

Troubleshoot:

1. Check catalog permissions:

-- Check grants on catalog
SHOW GRANTS ON CATALOG your_catalog;

-- Check grants on schema
SHOW GRANTS ON SCHEMA your_catalog.your_schema;

-- Check grants on table
SHOW GRANTS ON TABLE your_catalog.your_schema.your_table;

2. Request access:

  • Contact data owner
  • Use "Request Access" button in Catalog Explorer
  • Workspace admin can grant permissions

3. Check catalog exists:

-- List available catalogs
SHOW CATALOGS;

-- List schemas in catalog
SHOW SCHEMAS IN your_catalog;

4. Check table path:

  • Unity Catalog uses three-level namespace
  • Format: catalog.schema.table
  • Check for typos

Issue: "Metastore Connection Failed"

Troubleshoot:

1. Check workspace metastore assignment:

  • Workspace Settings β†’ Unity Catalog
  • Verify metastore assigned to workspace
  • Contact workspace admin if not assigned

2. Check network connectivity:

  • Metastore in different region?
  • Network rules blocking connection?
  • Check VPC/VNet peering if using private connectivity

3. Check metastore status:

  • Account Console β†’ Metastores
  • Check if metastore healthy
  • Look for error messages

Delta Lake Issues?

Issue: "Delta Table Not Found"

Troubleshoot:

1. Check table path:

# Verify path exists
dbutils.fs.ls("dbfs:/path/to/delta/table")

# Or for Unity Catalog
spark.sql("DESCRIBE TABLE your_catalog.your_schema.your_table")

2. Check table registration:

-- List tables in schema
SHOW TABLES IN your_schema;

-- Register external Delta table
CREATE TABLE your_table
USING DELTA
LOCATION '/path/to/delta/table';

3. Check permissions:

  • Read permissions on storage location
  • Unity Catalog permissions if using UC
  • Check with workspace admin

Issue: "Delta Transaction Failed"

Troubleshoot:

1. Retry operation:

  • Delta handles most conflicts automatically
  • Simply retry the operation

2. Check concurrent writes:

  • Multiple jobs writing same table?
  • Use merge operations instead of inserts
  • Add transaction isolation

3. Run table maintenance:

-- Optimize table
OPTIMIZE your_table;

-- Vacuum old files (default 7 day retention)
VACUUM your_table RETAIN 168 HOURS;

-- Check table history
DESCRIBE HISTORY your_table;

Regional Outages: Is It Just Me?

Databricks deploys across multiple cloud regions:

Cloud Provider Common Regions
AWS us-east-1, us-west-2, eu-west-1, ap-southeast-1
Azure East US, West Europe, Southeast Asia, UK South
GCP us-central1, europe-west1, asia-southeast1

How to check for regional issues:

1. Check DownDetector:
πŸ”— downdetector.com/status/databricks

Shows:

  • Real-time outage reports
  • Heatmap of affected regions
  • Spike in reports = likely real outage

2. Check cloud provider status:

  • AWS outage might affect only us-east-1
  • Azure issue might affect only one region
  • GCP regional issues isolated

3. Check Databricks status by region:

4. Test from different region:

  • If available, try workspace in different region
  • Isolates if issue is regional vs global

When Databricks Actually Goes Down

What Happens

Recent major outages:

  • October 2023: 4-hour AWS us-east-1 control plane outage
  • July 2023: 2-hour authentication service disruption (all clouds)
  • March 2023: 6-hour Azure East US regional outage

Typical causes:

  1. Cloud provider outages (AWS/Azure/GCP failures)
  2. Control plane authentication issues
  3. Network connectivity problems
  4. Database backend failures
  5. Deployment issues (rare)

How Databricks Responds

Communication channels:

Timeline:

  1. 0-15 min: Users report issues on Twitter/DownDetector
  2. 15-30 min: Databricks acknowledges on status page
  3. 30-120 min: Updates posted every 30 min
  4. Resolution: Usually 1-4 hours for major outages

What to Do During Outages

1. Check if data plane still works:

  • Running clusters may continue working
  • Jobs may complete even if UI is down
  • Check via CLI: databricks clusters list

2. Use backup compute:

  • AWS EMR for Spark workloads
  • Azure HDInsight or Synapse
  • GCP Dataproc
  • Run critical jobs elsewhere temporarily

3. Monitor status page:

4. Document impact:

  • Note affected jobs/workflows
  • Capture error messages
  • Will help with root cause analysis later

5. Prepare for recovery:

  • Have restart procedures ready
  • Check data consistency after outage
  • Re-run failed jobs when service restored

Databricks Down Checklist

Follow these steps in order:

Step 1: Verify it's actually down

  • Check Databricks Status
  • Check API Status Check
  • Check cloud provider status (AWS/Azure/GCP)
  • Search Twitter: "Databricks down"
  • Test REST API: curl workspace API endpoint
  • Try different browser/incognito mode

Step 2: Quick fixes (if Databricks is up)

  • Restart cluster
  • Clear browser cache and cookies
  • Check cluster event logs
  • Verify network connectivity
  • Check cloud provider quotas
  • Update workspace credentials

Step 3: Cluster troubleshooting

  • Check cluster configuration (instance type, size)
  • Verify cloud provider capacity available
  • Check init scripts (disable temporarily)
  • Review driver logs for errors
  • Test with default cluster config
  • Check cluster policy restrictions

Step 4: Network troubleshooting

  • Test outbound connectivity from cluster
  • Verify VPC/VNet configuration
  • Check security group / NSG rules
  • Verify NAT gateway / internet gateway
  • Check storage credentials (S3/ADLS/GCS)
  • Test with different network/VPN

Step 5: Job/workflow troubleshooting

  • Check job run history and error messages
  • Test notebook manually first
  • Verify job parameters correct
  • Check job permissions
  • Review cluster logs for failed run
  • Test with simpler job configuration

Step 6: Nuclear option

  • Create new cluster with default config
  • Re-import notebook from revision history
  • Contact Databricks support: databricks.com/support
  • Open ticket with cloud provider if quota/capacity issue

Prevent Future Issues

1. Set Up Proactive Monitoring

Monitor Databricks status:

Monitor your workloads:

# Add health checks to critical notebooks
try:
    # Your data pipeline code
    df = spark.read.table("my_table")
    # Alert on failure
except Exception as e:
    dbutils.notebook.exit(f"FAILED: {str(e)}")

Monitor cluster health:

  • Set up alerts for cluster failures
  • Monitor job success rates
  • Track cluster startup times (increasing = potential issues)

2. Use Cluster Pools

Why cluster pools help:

  • Pre-warmed instances
  • Faster startup (30-60 seconds vs 3-5 minutes)
  • Guaranteed capacity
  • Consistent environment

Create cluster pool:

  1. Compute β†’ Pools β†’ Create Pool
  2. Set min/max idle instances
  3. Choose instance type
  4. Use pool for interactive clusters and jobs

Pro tip: Size pool based on peak demand. Keep 2-3 idle instances ready.


3. Build Redundancy

For critical pipelines:

Multi-region strategy:

  • Deploy workspaces in multiple regions
  • Failover to backup region during outage
  • Use cross-region storage replication

Retry logic:

from retry import retry

@retry(tries=3, delay=60)
def run_critical_job():
    # Your job code
    spark.sql("INSERT INTO target_table SELECT * FROM source_table")

Backup compute:

  • Keep alternate compute ready (EMR, HDInsight, Dataproc)
  • Document failover procedures
  • Test failover quarterly

4. Optimize Cluster Configuration

Right-size clusters:

  • Start small, scale up as needed
  • Use autoscaling for variable workloads
  • Don't over-provision (wastes cost)

Best practices:

  • Use cluster pools for fast startup
  • Set appropriate auto-termination (30-60 min idle)
  • Use spot/preemptible for non-critical workloads
  • Pin library versions for consistency

Test configurations:

  • Test new configs on dev cluster first
  • Gradually roll out to production
  • Monitor performance metrics

5. Implement Job Orchestration Best Practices

Job dependencies:

  • Use Databricks Workflows for orchestration
  • Set proper task dependencies
  • Add retry policies (3 retries with exponential backoff)

Job monitoring:

# Send notifications on job completion
dbutils.notebook.exit(json.dumps({
    "status": "SUCCESS",
    "rows_processed": row_count,
    "duration_seconds": duration
}))

Failure handling:

  • Set up alerts for job failures
  • Use dead letter queues for failed records
  • Log detailed error messages

6. Maintain Cloud Provider Health

Monitor quotas:

  • Set up alerts for quota usage (80% threshold)
  • Request quota increases proactively
  • Keep buffer for burst capacity

Track cloud provider status:

  • Subscribe to AWS/Azure/GCP status pages
  • Monitor your specific regions
  • Note cloud provider maintenance windows

Resource management:

  • Clean up unused clusters/pools
  • Delete old job runs (retention policy)
  • Archive unused notebooks/data

7. Keep Credentials Updated

Regular credential rotation:

  • Rotate personal access tokens quarterly
  • Update service principal credentials before expiration
  • Test credentials after rotation

Credential management:

  • Use Databricks Secrets for sensitive data
  • Avoid hardcoding credentials
  • Use service principals for automation
# Use Databricks Secrets
secret = dbutils.secrets.get(scope="my_scope", key="api_key")

8. Document Your Setup

Critical documentation:

  • Cluster configurations (save as JSON)
  • Job configurations and dependencies
  • Network architecture (VPC/VNet setup)
  • Credential management procedures
  • Incident response runbooks

Keep updated:

  • Review docs quarterly
  • Update after any config changes
  • Share with team members

Key Takeaways

Before assuming Databricks is down:

  1. βœ… Check Databricks Status
  2. βœ… Check cloud provider status (AWS/Azure/GCP)
  3. βœ… Test REST API with curl
  4. βœ… Search Twitter for "Databricks down"
  5. βœ… Check cluster event logs and driver logs

Common fixes:

  • Restart cluster (fixes 50% of issues)
  • Check cloud provider quotas (capacity limits)
  • Clear browser cache and cookies
  • Verify network configuration (VPC/VNet)
  • Update storage credentials
  • Adjust cluster configuration (instance type, size)

Cluster issues:

  • Check event logs for startup failures
  • Verify cloud provider capacity available
  • Use on-demand instances instead of spot
  • Remove init scripts temporarily to test
  • Check cluster policy restrictions

Job/workflow issues:

  • Test notebooks manually first
  • Check job run history for error details
  • Verify job parameters and permissions
  • Review cluster logs for failed runs
  • Add retry logic and monitoring

SQL Warehouse issues:

  • Wait for startup (1-3 minutes)
  • Check warehouse permissions
  • Increase warehouse size for heavy workloads
  • Optimize slow queries

Unity Catalog issues:

  • Check three-level namespace (catalog.schema.table)
  • Verify permissions with SHOW GRANTS
  • Request access from data owner

If Databricks is actually down:

  • Monitor status.databricks.com
  • Running clusters may continue working
  • Use backup compute for critical jobs
  • Usually resolved within 1-4 hours

Prevent future issues:

  • Use cluster pools for fast, reliable startup
  • Build retry logic into critical jobs
  • Monitor proactively with alerts
  • Keep cloud quotas sized appropriately
  • Document configurations and procedures
  • Test failover scenarios regularly

Remember: Most "Databricks down" issues are actually cluster configuration, cloud provider quotas, network setup, or permissions problems. Try the fixes in this guide before assuming Databricks is down.


Need real-time Databricks status monitoring? Track Databricks uptime with API Status Check - Get instant alerts when Databricks goes down.


Related Resources

Monitor Your APIs

Check the real-time status of 100+ popular APIs used by developers.

View API Status β†’