Is Milvus Down? How to Check Milvus / Zilliz Cloud Status in 2026
Complete guide to verifying Milvus and Zilliz Cloud outages and keeping your RAG or semantic search pipeline running when vector search fails.
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
Milvus is one of the most widely deployed open-source vector databases, powering similarity search for RAG pipelines, recommendation systems, and image/video search at scale. Teams run it either self-hosted on Kubernetes or through the managed Zilliz Cloud service. Because Milvus separates compute (query nodes, coordinators) from storage (etcd for metadata, object storage for vector segments), an outage can originate in several different layers, not just the database itself.
Whether you're seeing connection timeouts, failed collection loads, or query errors, this guide will help you determine: is Milvus down entirely, or is a specific dependency — like etcd or object storage — the actual culprit?
How to Check if Milvus is Down (Fastest Methods)
1. Check Zilliz Cloud Status (If Using Managed Milvus)
If you run Milvus through Zilliz Cloud, check their status page for incidents scoped to your cluster's region before assuming a self-hosted style problem.
2. Run a Minimal Health Check
Test connectivity with a lightweight SDK call:
from pymilvus import connections, utility connections.connect(uri="YOUR_MILVUS_URI", token="YOUR_TOKEN") print(utility.list_collections())
A successful response confirms the coordinator and query layers are reachable. A connection timeout or UnexpectedError points to a platform or networking issue.
3. Check Dependent Services (etcd, Object Storage)
Self-hosted Milvus depends on etcd for metadata and MinIO/S3 for segment storage. A failure in either dependency cascades into search failures even if the Milvus process itself is running. Check etcd cluster health and object storage connectivity directly.
4. Inspect Kubernetes Pod Health
If self-hosted on Kubernetes, check for crashing components:
kubectl get pods -n milvus kubectl logs -n milvus <querynode-pod-name> --tail=100
Look for CrashLoopBackOff on query node, data node, or coordinator pods, and check logs for out-of-memory kills, which are the most common cause of query node crashes.
5. Use API Status Check for Automated Monitoring
For production RAG pipelines, API Status Check monitors your Milvus or Zilliz Cloud endpoint every 30 seconds and sends instant alerts via Slack, email, or PagerDuty.
Monitor Your Vector Search Pipeline
Don't let a Milvus or Zilliz outage silently break your RAG pipeline's retrieval step. Get instant alerts the moment vector search stops responding.
Try Better Stack Free →Why Does Milvus Go Down?
Milvus's distributed, multi-component architecture creates several distinct failure points:
- etcd Metadata Store Failures: Milvus stores collection schemas, index metadata, and cluster state in etcd. If etcd becomes unavailable or its disk fills up, Milvus can lose the ability to load collections or serve queries.
- Object Storage Backend Issues: Vector segments are stored in MinIO, S3, or another object store. Latency or availability problems there directly translate into slow or failed searches.
- Query Node Memory Pressure: Large indexes loaded entirely into memory can trigger OOM kills on query nodes, especially after adding new collections without scaling resources accordingly.
- Coordinator Node Crashes: Milvus's coordinator components (root coord, data coord, query coord) manage cluster state. A coordinator crash can stall collection loads and index builds even if data nodes are healthy.
- Kubernetes Resource Exhaustion: For self-hosted clusters, node-level CPU, memory, or disk exhaustion on the underlying Kubernetes nodes can cascade into Milvus component failures.
- Message Queue Backpressure: Milvus uses Pulsar or Kafka internally for write-ahead logging. Backpressure or outages in the message queue layer can block inserts and delay index updates.
Common Milvus Error Signals and What They Mean
Connection refused / timeoutThe proxy or coordinator endpoint is unreachable. Check network policies, load balancer health, and whether coordinator pods are running.
CollectionNotLoadedThe collection exists but hasn't been loaded into query nodes, often because a query node crashed mid-load. Reissue the load call after confirming query node health.
OOMKilled on query node podsThe query node ran out of memory holding indexes in RAM. Scale query node memory limits or shard collections across more replicas.
etcd server has no leaderThe etcd cluster backing Milvus metadata has lost quorum. Milvus cannot reliably serve or update metadata until etcd recovers.
Slow search latency with no errorsObject storage or message queue latency is degrading performance without causing outright failures. Check MinIO/S3 and Pulsar/Kafka metrics.
What to Do When Milvus Is Down
- Isolate the failing layer: Distinguish between a Milvus process issue, an etcd metadata problem, and an object storage outage before making changes — they require different fixes.
- Restart unhealthy pods: For self-hosted clusters, restarting crashed query node or coordinator pods often restores service quickly once the underlying resource issue is resolved.
- Check dependency health directly: Verify etcd quorum and object storage connectivity independently of Milvus — many "Milvus outages" are actually dependency failures.
- Fail over reads if possible: For read-heavy RAG applications, temporarily route retrieval to a cached result set or a secondary vector store while you resolve the primary cluster.
- Set up automated monitoring: Configure API Status Check to ping your Milvus or Zilliz Cloud endpoint and alert you within 30 seconds of any downtime.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time Milvus goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for Milvus + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Milvus Alternatives When the Database is Down
These vector databases can serve as a temporary fallback during an extended Milvus or Zilliz Cloud incident:
- Pinecone: Fully managed vector database with a simple API, useful as a fast fallback since there's no infrastructure to debug.
- Qdrant: Open-source vector database with a simpler single-binary deployment model than Milvus's multi-component architecture.
- Weaviate: Vector database with built-in hybrid search, available both self-hosted and as a managed cloud service.
- Chroma: Lightweight, developer-friendly vector store good for smaller-scale or prototype fallback during an outage.
- pgvector: If you already run PostgreSQL, pgvector can serve as an emergency fallback for smaller collections without standing up new infrastructure.
Frequently Asked Questions
How do I know if Milvus is down?
Check Zilliz Cloud's status page if using the managed service, run a minimal SDK health check, or inspect Kubernetes pod health if self-hosted. A connection timeout or CollectionNotLoaded error on a simple query points to a platform-level issue.
Why does Milvus go down?
Common causes include etcd metadata store failures, object storage backend issues, query node memory pressure (OOM kills), coordinator crashes, and Kubernetes resource exhaustion on self-hosted clusters.
What should I do when Milvus is down?
Isolate which layer failed (Milvus itself, etcd, or object storage), restart unhealthy pods, verify dependency health directly, and temporarily fail over reads to a secondary vector store if the outage is prolonged.
How long do Milvus outages last?
Query node crashes often recover within minutes once Kubernetes reschedules the pod. etcd quorum loss or object storage outages can take much longer and may require manual intervention.
Can I monitor Milvus automatically?
Yes. API Status Check monitors Milvus and Zilliz Cloud endpoints continuously, alerting you via Slack, email, or PagerDuty the moment downtime is detected.
🛠 Tools We Use & Recommend
Tested across our own infrastructure monitoring 200+ APIs daily
Uptime Monitoring & Incident Management
Used by 100,000+ websites
Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.
“We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.”
Secrets Management & Developer Security
Trusted by 150,000+ businesses
Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.
“After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.”
AI Voice & Audio Generation
Used by 1M+ developers
Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.
“The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.”
SEO & Site Performance Monitoring
Used by 10M+ marketers
Track your site health, uptime, search rankings, and competitor movements from one dashboard.
“We use SEMrush to track how our API status pages rank and catch site health issues early.”