How long do Milvus outages usually last?

Query node crashes on self-hosted clusters often recover within minutes once Kubernetes reschedules the pod. Metadata store (etcd) corruption or object storage outages can take much longer to resolve and may require restoring from backup.

Can I monitor Milvus uptime automatically?

Yes. API Status Check monitors Milvus and Zilliz Cloud endpoints continuously and sends instant alerts via email, Slack, or PagerDuty when downtime is detected — so your RAG pipeline doesn't silently start returning errors.

Is Milvus Down? How to Check Milvus / Zilliz Cloud Status in 2026

Q: How do I know if Milvus is down?

Check Milvus status by: 1) Visiting the Zilliz Cloud status page if you use the managed service, 2) Running a health check against your self-hosted Milvus cluster with the Milvus SDK, 3) Checking pod/container health if self-hosted on Kubernetes, or 4) Searching "Milvus down" or "Zilliz down" on X/Twitter and GitHub issues.

Q: Why does Milvus go down?

Milvus outages are typically caused by etcd or object storage backend failures (Milvus depends on both for metadata and segment storage), query node memory pressure from oversized indexes, coordinator node crashes, or — for self-hosted deployments — Kubernetes resource exhaustion.

Q: What should I do when Milvus is down?

When Milvus is down: check the health of dependent services (etcd, MinIO/S3, Pulsar/Kafka if used), restart unhealthy coordinator or query node pods, and if the outage is prolonged, route read traffic to a fallback vector store or cached embeddings while you resolve the underlying issue.

Milvus is one of the most widely deployed open-source vector databases, powering similarity search for RAG pipelines, recommendation systems, and image/video search at scale. Teams run it either self-hosted on Kubernetes or through the managed Zilliz Cloud service. Because Milvus separates compute (query nodes, coordinators) from storage (etcd for metadata, object storage for vector segments), an outage can originate in several different layers, not just the database itself.

Whether you're seeing connection timeouts, failed collection loads, or query errors, this guide will help you determine: is Milvus down entirely, or is a specific dependency — like etcd or object storage — the actual culprit?

How to Check if Milvus is Down (Fastest Methods)

1. Check Zilliz Cloud Status (If Using Managed Milvus)

If you run Milvus through Zilliz Cloud, check their status page for incidents scoped to your cluster's region before assuming a self-hosted style problem.

2. Run a Minimal Health Check

Test connectivity with a lightweight SDK call:

from pymilvus import connections, utility

connections.connect(uri="YOUR_MILVUS_URI", token="YOUR_TOKEN")
print(utility.list_collections())

A successful response confirms the coordinator and query layers are reachable. A connection timeout or UnexpectedError points to a platform or networking issue.

3. Check Dependent Services (etcd, Object Storage)

Self-hosted Milvus depends on etcd for metadata and MinIO/S3 for segment storage. A failure in either dependency cascades into search failures even if the Milvus process itself is running. Check etcd cluster health and object storage connectivity directly.

4. Inspect Kubernetes Pod Health

If self-hosted on Kubernetes, check for crashing components:

kubectl get pods -n milvus
kubectl logs -n milvus <querynode-pod-name> --tail=100

Look for CrashLoopBackOff on query node, data node, or coordinator pods, and check logs for out-of-memory kills, which are the most common cause of query node crashes.

5. Use API Status Check for Automated Monitoring

For production RAG pipelines, API Status Check monitors your Milvus or Zilliz Cloud endpoint every 30 seconds and sends instant alerts via Slack, email, or PagerDuty.

📡

Recommended

Monitor Your Vector Search Pipeline

Don't let a Milvus or Zilliz outage silently break your RAG pipeline's retrieval step. Get instant alerts the moment vector search stops responding.

Try Better Stack Free →

Why Does Milvus Go Down?

Milvus's distributed, multi-component architecture creates several distinct failure points:

etcd Metadata Store Failures: Milvus stores collection schemas, index metadata, and cluster state in etcd. If etcd becomes unavailable or its disk fills up, Milvus can lose the ability to load collections or serve queries.
Object Storage Backend Issues: Vector segments are stored in MinIO, S3, or another object store. Latency or availability problems there directly translate into slow or failed searches.
Query Node Memory Pressure: Large indexes loaded entirely into memory can trigger OOM kills on query nodes, especially after adding new collections without scaling resources accordingly.
Coordinator Node Crashes: Milvus's coordinator components (root coord, data coord, query coord) manage cluster state. A coordinator crash can stall collection loads and index builds even if data nodes are healthy.
Kubernetes Resource Exhaustion: For self-hosted clusters, node-level CPU, memory, or disk exhaustion on the underlying Kubernetes nodes can cascade into Milvus component failures.
Message Queue Backpressure: Milvus uses Pulsar or Kafka internally for write-ahead logging. Backpressure or outages in the message queue layer can block inserts and delay index updates.

Common Milvus Error Signals and What They Mean

Connection refused / timeout

The proxy or coordinator endpoint is unreachable. Check network policies, load balancer health, and whether coordinator pods are running.

CollectionNotLoaded

The collection exists but hasn't been loaded into query nodes, often because a query node crashed mid-load. Reissue the load call after confirming query node health.

OOMKilled on query node pods

The query node ran out of memory holding indexes in RAM. Scale query node memory limits or shard collections across more replicas.

etcd server has no leader

The etcd cluster backing Milvus metadata has lost quorum. Milvus cannot reliably serve or update metadata until etcd recovers.

Slow search latency with no errors

Object storage or message queue latency is degrading performance without causing outright failures. Check MinIO/S3 and Pulsar/Kafka metrics.

What to Do When Milvus Is Down

Isolate the failing layer: Distinguish between a Milvus process issue, an etcd metadata problem, and an object storage outage before making changes — they require different fixes.
Restart unhealthy pods: For self-hosted clusters, restarting crashed query node or coordinator pods often restores service quickly once the underlying resource issue is resolved.
Check dependency health directly: Verify etcd quorum and object storage connectivity independently of Milvus — many "Milvus outages" are actually dependency failures.
Fail over reads if possible: For read-heavy RAG applications, temporarily route retrieval to a cached result set or a secondary vector store while you resolve the primary cluster.
Set up automated monitoring: Configure API Status Check to ping your Milvus or Zilliz Cloud endpoint and alert you within 30 seconds of any downtime.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time Milvus goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for Milvus + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Milvus Alternatives When the Database is Down

These vector databases can serve as a temporary fallback during an extended Milvus or Zilliz Cloud incident:

Pinecone: Fully managed vector database with a simple API, useful as a fast fallback since there's no infrastructure to debug.
Qdrant: Open-source vector database with a simpler single-binary deployment model than Milvus's multi-component architecture.
Weaviate: Vector database with built-in hybrid search, available both self-hosted and as a managed cloud service.
Chroma: Lightweight, developer-friendly vector store good for smaller-scale or prototype fallback during an outage.
pgvector: If you already run PostgreSQL, pgvector can serve as an emergency fallback for smaller collections without standing up new infrastructure.

Frequently Asked Questions

How do I know if Milvus is down?

Check Zilliz Cloud's status page if using the managed service, run a minimal SDK health check, or inspect Kubernetes pod health if self-hosted. A connection timeout or CollectionNotLoaded error on a simple query points to a platform-level issue.

Why does Milvus go down?

Common causes include etcd metadata store failures, object storage backend issues, query node memory pressure (OOM kills), coordinator crashes, and Kubernetes resource exhaustion on self-hosted clusters.

What should I do when Milvus is down?

Isolate which layer failed (Milvus itself, etcd, or object storage), restart unhealthy pods, verify dependency health directly, and temporarily fail over reads to a secondary vector store if the outage is prolonged.

How long do Milvus outages last?

Query node crashes often recover within minutes once Kubernetes reschedules the pod. etcd quorum loss or object storage outages can take much longer and may require manual intervention.

Can I monitor Milvus automatically?

Yes. API Status Check monitors Milvus and Zilliz Cloud endpoints continuously, alerting you via Slack, email, or PagerDuty the moment downtime is detected.

🛠 Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

See all →

Better StackBest for API Teams

Uptime Monitoring & Incident Management

Used by 100,000+ websites

Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.

“We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.”

Free tier · Paid from $24/moStart Free Monitoring

1PasswordBest for Credential Security

Secrets Management & Developer Security

Trusted by 150,000+ businesses

Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.

“After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.”

From $2.99/moTry Free for 14 Days

ElevenLabsBest for AI Voice

AI Voice & Audio Generation

Used by 1M+ developers

Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.

“The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.”

Free tier · Paid from $5/moTry ElevenLabs Free

SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

“We use SEMrush to track how our API status pages rank and catch site health issues early.”

From $129.95/moTry SEMrush Free

View full comparison & more tools →Affiliate links — we earn a commission at no extra cost to you