Is Weaviate Down? How to Check Weaviate Status in Real-Time

Is Weaviate Down? How to Check Weaviate Status in Real-Time

Quick Answer: To check if Weaviate is down, visit apistatuscheck.com/api/weaviate for real-time monitoring of Weaviate Cloud Services. For self-hosted instances, check your cluster health endpoint at http://your-instance:8080/v1/.well-known/ready. Common signs include schema creation failures, batch import timeouts, GraphQL query errors, and cluster synchronization issues.

When your vector database suddenly stops responding, your entire AI application grinds to a halt. Weaviate powers semantic search, RAG systems, and hybrid search applications for thousands of organizations worldwide. Whether you're running Weaviate Cloud Services (WCS) or a self-hosted cluster, knowing how to quickly diagnose connectivity issues, performance degradation, or complete outages is critical for maintaining your AI infrastructure uptime.

How to Check Weaviate Status in Real-Time

1. API Status Check (Fastest Method for WCS)

The quickest way to verify Weaviate Cloud Services operational status is through apistatuscheck.com/api/weaviate. This real-time monitoring service:

  • Tests actual Weaviate endpoints every 60 seconds
  • Monitors query response times and vector search latency
  • Tracks historical uptime over 30/60/90 days
  • Provides instant alerts when WCS issues are detected
  • Checks cluster health across multiple regions

Unlike status pages that rely on manual updates, API Status Check performs active health checks against Weaviate's production API endpoints, giving you the most accurate real-time picture of service availability for cloud-hosted instances.

2. Weaviate Cloud Services Status Page

Weaviate maintains an official status page at status.weaviate.io for their cloud service. The page displays:

  • Current operational status for WCS clusters
  • Active incidents and investigations
  • Scheduled maintenance windows
  • Historical incident reports
  • Regional availability (US, EU, Asia-Pacific)

Pro tip: Subscribe to status updates via email or RSS feed to receive immediate notifications when incidents affecting Weaviate Cloud Services occur.

3. Check Your Self-Hosted Cluster Health

For self-hosted Weaviate instances, use the built-in health check endpoints:

import requests

# Health check endpoint
response = requests.get('http://localhost:8080/v1/.well-known/ready')

if response.status_code == 200:
    print("✓ Weaviate is healthy and ready")
else:
    print(f"✗ Weaviate health check failed: {response.status_code}")

# Liveness check
liveness = requests.get('http://localhost:8080/v1/.well-known/live')
print(f"Liveness status: {liveness.status_code}")

# Cluster nodes status
meta = requests.get('http://localhost:8080/v1/meta')
print(f"Cluster meta: {meta.json()}")

Health check endpoints:

  • /v1/.well-known/ready - Returns 200 when instance is ready to accept traffic
  • /v1/.well-known/live - Returns 200 when instance is running (but may not be ready)
  • /v1/meta - Returns cluster metadata and node information

4. Query Performance Testing

For deeper diagnostics, run a test vector search query:

import weaviate
from weaviate.classes.query import MetadataQuery

client = weaviate.connect_to_local()

try:
    # Simple test query with metrics
    collection = client.collections.get("YourCollection")
    
    response = collection.query.near_text(
        query="test query",
        limit=5,
        return_metadata=MetadataQuery(distance=True, certainty=True)
    )
    
    print(f"✓ Query successful, returned {len(response.objects)} results")
    
except Exception as e:
    print(f"✗ Query failed: {str(e)}")
    
finally:
    client.close()

5. Monitor Docker/Kubernetes Health

For containerized deployments, check container and pod health:

# Docker health check
docker ps | grep weaviate
docker logs weaviate --tail 50

# Kubernetes pod status
kubectl get pods -n weaviate
kubectl describe pod weaviate-0 -n weaviate
kubectl logs weaviate-0 -n weaviate --tail=100

# Check resource consumption
kubectl top pod weaviate-0 -n weaviate

Look for OOMKilled status, restart loops, or resource exhaustion indicators.

Common Weaviate Issues and How to Identify Them

Schema Creation Failures

Symptoms:

  • 422 Unprocessable Entity errors when creating classes
  • Schema validation errors about property types
  • "class already exists" errors despite not being visible
  • Timeout errors during schema operations

Example error:

try:
    client.collections.create(
        name="Article",
        properties=[
            weaviate.classes.Property(
                name="title",
                data_type=weaviate.classes.DataType.TEXT
            )
        ]
    )
except weaviate.exceptions.UnexpectedStatusCodeException as e:
    print(f"Schema creation failed: {e}")
    # Often indicates: cluster sync issues, version mismatch, or service degradation

Common causes:

  • Cluster nodes out of sync (multi-node setups)
  • Raft consensus failures in distributed deployments
  • Schema migration conflicts during version upgrades
  • Insufficient permissions (WCS authentication issues)

Diagnostic check:

# Verify existing schema
schema = client.collections.list_all()
print(f"Current collections: {[c.name for c in schema]}")

# Check for orphaned schemas
meta = client.get_meta()
print(f"Weaviate version: {meta['version']}")

Batch Import Timeouts

Symptoms:

  • Batch operations hanging indefinitely
  • ConnectionTimeout or ReadTimeout exceptions
  • Partial batches succeeding while others fail
  • Memory spike followed by OOM errors

Example scenario:

from weaviate.util import generate_uuid5

client = weaviate.connect_to_wcs(
    cluster_url="https://your-cluster.weaviate.network",
    auth_credentials=weaviate.auth.AuthApiKey("your-key")
)

collection = client.collections.get("Documents")

# This may timeout during Weaviate degradation
with collection.batch.dynamic() as batch:
    for i in range(10000):
        batch.add_object(
            properties={
                "title": f"Document {i}",
                "content": "Large text content here..." * 100
            },
            uuid=generate_uuid5(i)
        )

# Check for failures
if collection.batch.failed_objects:
    print(f"Failed objects: {len(collection.batch.failed_objects)}")
    for failed in collection.batch.failed_objects[:5]:
        print(f"Error: {failed.message}")

Root causes:

  • Cluster under heavy load (CPU/memory exhaustion)
  • Network connectivity issues between nodes
  • Vectorization module (transformers) timing out
  • Insufficient batch size configuration
  • Disk I/O bottlenecks

Mitigation strategies:

# Configure batch settings for reliability
with collection.batch.rate_limit(requests_per_minute=600) as batch:
    for item in data:
        batch.add_object(properties=item)
        
# Use smaller batch sizes during degradation
with collection.batch.fixed_size(batch_size=50) as batch:
    # Smaller batches = better error recovery
    pass

GraphQL Query Errors

Symptoms:

  • 500 Internal Server Error from GraphQL endpoint
  • Query parsing errors for valid syntax
  • Timeout on complex queries (filters, aggregations)
  • Inconsistent results between identical queries

Testing GraphQL health:

# Direct GraphQL query
query = """
{
    Get {
        Article(limit: 10) {
            title
            content
            _additional {
                id
                certainty
            }
        }
    }
}
"""

try:
    result = client.graphql_raw_query(query)
    print(f"Query successful: {len(result.get)} results")
except Exception as e:
    print(f"GraphQL query failed: {str(e)}")
    # Indicates: backend service issues, schema corruption, or cluster problems

Common GraphQL errors during outages:

  • Cannot query field "X" on type "Y" - Schema synchronization issues
  • context deadline exceeded - Query timeout (backend overloaded)
  • connection refused - Complete service unavailability
  • invalid character '<' looking for beginning of value - Backend returning HTML error pages

Advanced diagnostics:

# Test with different query complexities
simple_query = '{ Get { Article(limit: 1) { title } } }'
complex_query = '''
{
    Get {
        Article(
            nearText: { concepts: ["AI"] }
            where: { path: ["published"], operator: Equal, valueBoolean: true }
            limit: 100
        ) {
            title
            _additional {
                distance
                vector
            }
        }
    }
}
'''

# If simple succeeds but complex fails = performance degradation
# If both fail = complete outage

Cluster Health Issues

Symptoms in multi-node deployments:

  • Inconsistent query results across requests
  • Node availability fluctuating
  • Leader election failures (Raft logs)
  • Data replication lag

Cluster health check:

import requests

meta_response = requests.get('http://localhost:8080/v1/meta')
meta = meta_response.json()

print(f"Version: {meta['version']}")
print(f"Modules: {meta.get('modules', {})}")

# For WCS, check cluster status via API
nodes_response = requests.get(
    'http://localhost:8080/v1/nodes',
    headers={'Authorization': 'Bearer YOUR_TOKEN'}
)

if nodes_response.status_code == 200:
    nodes = nodes_response.json()
    for node in nodes.get('nodes', []):
        print(f"Node {node['name']}: {node['status']}")
        print(f"  Shards: {node.get('shards', [])}")
else:
    print("Cannot retrieve cluster node information")

Red flags in logs:

ERROR raft: failed to contact node-2
WARN  replication: shard sync failed, retrying...
ERROR storage: compaction failed: disk full
FATAL vectorizer: module not responding

Memory and Resource Limits

Symptoms:

  • Queries slowing down progressively
  • OOMKilled in container logs
  • Swap usage at 100%
  • Vector index rebuild failures

Resource monitoring:

# Check Weaviate metrics endpoint (if enabled)
metrics = requests.get('http://localhost:2112/metrics')

# Look for these indicators:
# - weaviate_object_count_total (growing unbounded?)
# - weaviate_vector_index_size (memory pressure)
# - weaviate_batch_durations_seconds (increasing latency)

# Alternative: parse meta endpoint
meta = client.get_meta()
print(f"Objects in cluster: {meta.get('object_count', 'unknown')}")

Common resource issues:

Issue Symptom Fix
Heap exhaustion OOM errors, crash loops Increase GOMEMLIMIT, scale vertically
Vector index overflow Slow queries, high CPU Enable compression, use PQ/SQ
Disk full Write failures, compaction errors Increase volume size, cleanup old data
CPU throttling Query timeouts, batch delays Scale horizontally, optimize queries

The Real Impact When Weaviate Goes Down

RAG Systems Completely Broken

Modern Retrieval-Augmented Generation (RAG) applications depend entirely on vector database availability:

  • ChatGPT-style interfaces: Cannot retrieve relevant context for answers
  • Documentation assistants: Unable to search knowledge bases
  • Customer support bots: No access to historical conversation embeddings
  • Code completion tools: Cannot fetch similar code examples

When Weaviate is down, the entire RAG pipeline fails. LLMs receive no context and either hallucinate answers or return "I don't have enough information" responses, rendering the application effectively useless.

Example impact:

# Typical RAG flow - breaks at step 2 when Weaviate is down
def answer_question(question: str) -> str:
    # 1. Generate embedding (works - uses OpenAI/Cohere)
    embedding = openai.embeddings.create(input=question)
    
    # 2. Search vector DB (FAILS when Weaviate is down)
    results = weaviate_client.query.near_vector(embedding.data[0].embedding)
    
    # 3. Generate answer with context (never reached)
    context = "\n".join([r.properties['text'] for r in results.objects])
    answer = openai.chat.completions.create(
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": question}
        ]
    )
    return answer

Related reading: Is OpenAI Down? | Is Cohere Down?

Semantic Search Outages

E-commerce, content platforms, and SaaS applications using semantic search lose critical functionality:

  • Product discovery: "Find similar products" features break
  • Content recommendations: Related articles/videos fail to load
  • Internal knowledge bases: Employees cannot search documentation
  • Research platforms: Academic paper similarity search unavailable

Unlike keyword search fallbacks, semantic search is uniquely dependent on vector databases. There's no simple degraded mode—either it works or it doesn't.

Business impact:

  • 40-60% reduction in user engagement (semantic search users)
  • Increased bounce rates on content platforms
  • Support ticket volume spikes as users can't self-serve
  • Lost revenue on recommendation-driven purchases

Hybrid Search Degradation

Weaviate's hybrid search (combining BM25 keyword + vector similarity) offers the best of both worlds—until an outage forces fallback to keyword-only:

# Normal hybrid search (superior results)
response = collection.query.hybrid(
    query="machine learning tutorials",
    alpha=0.5,  # Balance between keyword and vector
    limit=20
)

# During outage, forced to use basic keyword search
# Results quality drops significantly
fallback_response = elasticsearch.search(
    index="articles",
    body={"query": {"match": {"content": "machine learning tutorials"}}}
)

Quality degradation:

  • Semantic understanding lost (synonyms, concepts)
  • Reduced result relevance (precision drops 20-40%)
  • User dissatisfaction with search quality
  • Increased "no results found" for conceptual queries

AI Application Development Halted

Development teams building AI features experience immediate blockers:

  • Testing pipelines: Cannot validate embedding generation
  • Staging environments: Integration tests fail
  • Demo environments: Sales demos crash during presentations
  • CI/CD pipelines: End-to-end tests timeout

Every minute of downtime delays feature releases and product iterations.

Data Pipeline Failures

Real-time data ingestion pipelines break when Weaviate becomes unavailable:

# Streaming pipeline - fails to commit batches
def process_stream(kafka_consumer):
    with collection.batch.dynamic() as batch:
        for message in kafka_consumer:
            embedding = vectorize(message.value)
            batch.add_object({
                'content': message.value,
                'timestamp': message.timestamp
            })
            # BREAKS HERE during Weaviate outage
            # Messages accumulate in Kafka, causing backlog

Cascade effects:

  • Message queue backpressure (Kafka/RabbitMQ)
  • Data freshness issues (stale embeddings)
  • Batch job failures (nightly data loads)
  • Increased infrastructure costs (queue storage)

Competitive Disadvantage vs. Managed Alternatives

Organizations evaluating vector databases compare Weaviate downtime against competitors:

  • Pinecone's 99.9% SLA (fully managed)
  • Qdrant's high-availability architecture
  • Milvus with Kubernetes auto-recovery
  • PostgreSQL pgvector (simpler but less featured)

Extended outages in self-hosted Weaviate may accelerate migration to managed alternatives, despite Weaviate's technical advantages.

What to Do When Weaviate Goes Down

1. Implement Comprehensive Health Checks

Proactive monitoring catches issues before user impact:

import time
import logging
from datetime import datetime

class WeaviateHealthMonitor:
    def __init__(self, client, alert_callback):
        self.client = client
        self.alert = alert_callback
        self.consecutive_failures = 0
        
    def check_health(self):
        """Comprehensive health check suite"""
        checks = {
            'connectivity': self._check_connectivity(),
            'schema_access': self._check_schema(),
            'query_performance': self._check_query(),
            'write_capability': self._check_write()
        }
        
        if not all(checks.values()):
            self.consecutive_failures += 1
            if self.consecutive_failures >= 3:
                self.alert(f"Weaviate health degraded: {checks}")
        else:
            self.consecutive_failures = 0
            
        return checks
    
    def _check_connectivity(self):
        """Basic connectivity test"""
        try:
            self.client.get_meta()
            return True
        except Exception as e:
            logging.error(f"Connectivity check failed: {e}")
            return False
    
    def _check_schema(self):
        """Verify schema operations"""
        try:
            collections = self.client.collections.list_all()
            return len(collections) > 0
        except Exception as e:
            logging.error(f"Schema check failed: {e}")
            return False
    
    def _check_query(self):
        """Test query performance"""
        try:
            start = time.time()
            collection = self.client.collections.get("TestCollection")
            collection.query.fetch_objects(limit=1)
            latency = time.time() - start
            
            if latency > 5.0:  # 5 second threshold
                logging.warning(f"Query latency high: {latency}s")
                return False
            return True
        except Exception as e:
            logging.error(f"Query check failed: {e}")
            return False
    
    def _check_write(self):
        """Test write capability"""
        try:
            collection = self.client.collections.get("TestCollection")
            test_uuid = generate_uuid5("health_check")
            collection.data.insert(
                properties={"test": "health_check", "timestamp": datetime.now().isoformat()},
                uuid=test_uuid
            )
            collection.data.delete_by_id(test_uuid)
            return True
        except Exception as e:
            logging.error(f"Write check failed: {e}")
            return False

# Usage
monitor = WeaviateHealthMonitor(
    client=weaviate_client,
    alert_callback=lambda msg: send_alert_to_slack(msg)
)

# Run every 60 seconds
while True:
    monitor.check_health()
    time.sleep(60)

2. Implement Circuit Breaker Pattern

Prevent cascade failures by failing fast:

from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"      # Fast-fail mode
    HALF_OPEN = "half_open"  # Testing recovery

class WeaviateCircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        
    def call(self, func, *args, **kwargs):
        """Execute function with circuit breaker protection"""
        if self.state == CircuitState.OPEN:
            if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker OPEN - Weaviate unavailable")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED
        
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            logging.error(f"Circuit breaker OPEN after {self.failure_count} failures")

# Usage
breaker = WeaviateCircuitBreaker(failure_threshold=5, timeout=60)

def search_vectors(query):
    return breaker.call(
        lambda: collection.query.near_text(query=query, limit=10)
    )

try:
    results = search_vectors("machine learning")
except Exception as e:
    # Fallback to cached results or keyword search
    results = fallback_search(query)

3. Queue Write Operations for Retry

Don't lose data during outages:

import json
from pathlib import Path
from queue import Queue
import threading

class WeaviateWriteQueue:
    def __init__(self, client, queue_file='weaviate_queue.jsonl'):
        self.client = client
        self.queue_file = Path(queue_file)
        self.queue = Queue()
        self.running = True
        
        # Load persisted queue
        self._load_queue()
        
        # Start background worker
        self.worker = threading.Thread(target=self._process_queue)
        self.worker.start()
    
    def add_object(self, collection_name, properties, uuid=None):
        """Add object to queue instead of direct write"""
        item = {
            'collection': collection_name,
            'properties': properties,
            'uuid': uuid,
            'timestamp': datetime.now().isoformat()
        }
        self.queue.put(item)
        self._persist_item(item)
    
    def _persist_item(self, item):
        """Persist queue item to disk"""
        with open(self.queue_file, 'a') as f:
            f.write(json.dumps(item) + '\n')
    
    def _load_queue(self):
        """Load persisted queue on startup"""
        if self.queue_file.exists():
            with open(self.queue_file, 'r') as f:
                for line in f:
                    self.queue.put(json.loads(line))
    
    def _process_queue(self):
        """Background worker to process queue"""
        while self.running:
            try:
                if not self.queue.empty():
                    item = self.queue.get()
                    
                    # Try to write to Weaviate
                    collection = self.client.collections.get(item['collection'])
                    collection.data.insert(
                        properties=item['properties'],
                        uuid=item['uuid']
                    )
                    
                    logging.info(f"Successfully wrote queued item: {item['uuid']}")
                    
                    # Remove from persistent queue
                    self._remove_from_disk(item)
                    
                else:
                    time.sleep(1)
                    
            except Exception as e:
                logging.warning(f"Failed to process queue item: {e}")
                # Put back in queue
                self.queue.put(item)
                time.sleep(5)  # Back off before retry
    
    def shutdown(self):
        """Graceful shutdown"""
        self.running = False
        self.worker.join()

# Usage
write_queue = WeaviateWriteQueue(weaviate_client)

# Instead of direct writes during potential outages
write_queue.add_object(
    collection_name="Documents",
    properties={"title": "New document", "content": "..."},
    uuid=generate_uuid5("doc123")
)

4. Implement Multi-Region Failover (WCS)

For Weaviate Cloud Services, use multi-region deployment:

class MultiRegionWeaviate:
    def __init__(self, regions):
        self.clients = {
            region: weaviate.connect_to_wcs(
                cluster_url=config['url'],
                auth_credentials=weaviate.auth.AuthApiKey(config['key'])
            )
            for region, config in regions.items()
        }
        self.primary_region = list(regions.keys())[0]
        self.current_region = self.primary_region
    
    def query(self, collection_name, **kwargs):
        """Query with automatic failover"""
        for region in self._get_region_priority():
            try:
                client = self.clients[region]
                collection = client.collections.get(collection_name)
                return collection.query.near_text(**kwargs)
            except Exception as e:
                logging.warning(f"Query failed in {region}: {e}")
                continue
        
        raise Exception("All Weaviate regions unavailable")
    
    def _get_region_priority(self):
        """Current region first, then others"""
        regions = list(self.clients.keys())
        regions.remove(self.current_region)
        return [self.current_region] + regions

# Usage
multi_region = MultiRegionWeaviate({
    'us-east-1': {'url': 'https://cluster-us.weaviate.network', 'key': '...'},
    'eu-west-1': {'url': 'https://cluster-eu.weaviate.network', 'key': '...'}
})

results = multi_region.query("Articles", query="AI", limit=10)

5. Cache Frequent Queries

Reduce dependency on Weaviate for common queries:

from functools import lru_cache
import hashlib
import json

class WeaviateQueryCache:
    def __init__(self, client, ttl=300):
        self.client = client
        self.ttl = ttl
        self.cache = {}
    
    def query_with_cache(self, collection_name, query, limit=10):
        """Cache query results"""
        cache_key = self._make_cache_key(collection_name, query, limit)
        
        # Check cache first
        if cache_key in self.cache:
            cached_result, timestamp = self.cache[cache_key]
            if time.time() - timestamp < self.ttl:
                logging.info(f"Cache HIT: {cache_key}")
                return cached_result
        
        # Cache miss - query Weaviate
        try:
            collection = self.client.collections.get(collection_name)
            result = collection.query.near_text(query=query, limit=limit)
            
            # Store in cache
            self.cache[cache_key] = (result, time.time())
            logging.info(f"Cache MISS: {cache_key}")
            return result
            
        except Exception as e:
            # If Weaviate is down and we have stale cache, return it
            if cache_key in self.cache:
                logging.warning(f"Weaviate down, returning stale cache for {cache_key}")
                return self.cache[cache_key][0]
            raise e
    
    def _make_cache_key(self, collection, query, limit):
        """Generate cache key from query parameters"""
        key_data = f"{collection}:{query}:{limit}"
        return hashlib.md5(key_data.encode()).hexdigest()

# Usage
cache = WeaviateQueryCache(weaviate_client, ttl=300)  # 5 minute TTL
results = cache.query_with_cache("Articles", "machine learning", limit=10)

6. Set Up Comprehensive Alerting

Get notified immediately when issues occur:

# Integration with API Status Check
import requests

def setup_weaviate_monitoring():
    """Subscribe to Weaviate status alerts"""
    response = requests.post(
        'https://apistatuscheck.com/api/weaviate/alerts',
        json={
            'email': 'team@yourcompany.com',
            'webhook': 'https://yourapp.com/webhooks/weaviate-status',
            'channels': ['email', 'slack', 'webhook']
        }
    )
    return response.json()

# Custom health check with alerting
def monitor_weaviate_health():
    """Comprehensive monitoring with multi-channel alerts"""
    try:
        # Run health checks
        health = WeaviateHealthMonitor(weaviate_client, alert_callback)
        results = health.check_health()
        
        if not all(results.values()):
            # Alert via multiple channels
            send_slack_alert(f"Weaviate health degraded: {results}")
            send_pagerduty_alert(f"Weaviate outage detected", severity="critical")
            send_email_alert("engineering@company.com", f"Weaviate issues: {results}")
            
    except Exception as e:
        # Even health check failed - critical alert
        send_pagerduty_alert(f"Cannot reach Weaviate: {e}", severity="critical")

# Run every minute
import schedule
schedule.every(1).minutes.do(monitor_weaviate_health)

7. Document Runbook for Incidents

Prepare your team with clear incident response procedures:

# Weaviate Outage Runbook

## Immediate Actions (0-5 minutes)
1. Confirm outage: Check apistatuscheck.com/api/weaviate
2. Verify scope: WCS vs self-hosted? Single region vs global?
3. Check Weaviate status page: status.weaviate.io
4. Post status update in #engineering Slack
5. Enable maintenance mode if user-facing impact severe

## Investigation (5-15 minutes)
1. Check cluster logs: kubectl logs weaviate-0
2. Review resource usage: kubectl top pod weaviate-0
3. Test health endpoints: curl http://weaviate:8080/v1/.well-known/ready
4. Check recent deployments/changes: git log --since="1 hour ago"
5. Review monitoring dashboards: Grafana/Datadog

## Mitigation (15-30 minutes)
1. Restart affected pods: kubectl rollout restart deployment/weaviate
2. Scale horizontally if resource exhaustion: kubectl scale deployment weaviate --replicas=5
3. Enable circuit breakers in application code
4. Activate query cache with extended TTL
5. Switch to fallback vector DB if available (Pinecone, Qdrant)

## Communication
- Internal: Update #incident channel every 15 minutes
- External: Post status on status.yourcompany.com if customer-facing
- Support: Prepare templated responses for tickets
- Stakeholders: Notify leadership if revenue impact >$X

## Post-Incident (After resolution)
1. Document root cause
2. Update monitoring/alerting based on learnings
3. Schedule post-mortem meeting
4. Implement preventive measures
5. Update this runbook with improvements

Frequently Asked Questions

How often does Weaviate Cloud Services go down?

Weaviate Cloud Services (WCS) maintains high availability with typical uptime above 99.9%. Major outages affecting all customers are rare (1-2 times per year), though regional issues or specific cluster degradation may occur more frequently. Self-hosted Weaviate uptime depends entirely on your infrastructure, configuration, and operational practices.

What's the difference between liveness and readiness checks?

The liveness endpoint (/v1/.well-known/live) indicates whether the Weaviate process is running—it returns 200 if the application hasn't crashed. The readiness endpoint (/v1/.well-known/ready) indicates whether Weaviate is ready to accept traffic—it checks that dependencies (storage, vectorizers) are available and the instance can serve requests. Use readiness for health checks in load balancers and orchestrators.

Can I use Weaviate offline for local development?

Yes! Weaviate can run entirely locally using Docker for development and testing:

docker run -d \
  -p 8080:8080 \
  -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true \
  -e PERSISTENCE_DATA_PATH='/var/lib/weaviate' \
  semitechnologies/weaviate:latest

This local instance doesn't require internet connectivity (except when using remote vectorizers like OpenAI or Cohere). For offline development, use local vectorizer modules like text2vec-transformers.

How do I migrate from Weaviate to Pinecone during an outage?

Migrating vector databases during an outage requires preparation in advance. You'll need to:

  1. Export your data regularly using Weaviate's backup API
  2. Maintain schema mapping between Weaviate and Pinecone formats
  3. Transform vectors to Pinecone's format (if using different dimensions)
  4. Update application code to use Pinecone SDK
  5. Bulk upload vectors to Pinecone

For detailed migration guidance, see our Is Pinecone Down? guide. Realistic migration during an active outage takes hours to days depending on data volume—better to implement multi-provider support proactively.

What causes "context deadline exceeded" errors in Weaviate?

This error indicates that a query or operation exceeded its timeout limit. Common causes:

  • Cluster overload: Too many concurrent queries exhausting CPU/memory
  • Large result sets: Queries returning millions of objects without pagination
  • Complex filters: Multiple nested where clauses with aggregations
  • Slow vectorizers: Remote API calls to OpenAI/Cohere timing out
  • Network issues: Latency between client and Weaviate cluster

To fix: Add pagination (limit), optimize filters, increase timeout in client configuration, or scale your Weaviate cluster.

Should I use Weaviate Cloud or self-hosted for production?

Choose Weaviate Cloud Services (WCS) if:

  • You want managed infrastructure (no DevOps overhead)
  • You need quick scaling without cluster management
  • You value predictable costs and SLAs
  • Your team is small or lacks Kubernetes expertise

Choose self-hosted if:

  • You need complete control over infrastructure
  • You have strict data residency requirements
  • You already have Kubernetes expertise in-house
  • You need custom configurations not available in WCS
  • You're cost-optimizing at very large scale (100M+ vectors)

For most organizations, WCS provides better reliability and lower operational burden. Self-hosted makes sense for large enterprises with dedicated platform teams.

How do I handle Weaviate outages in real-time data pipelines?

Implement a dead-letter queue (DLQ) pattern:

def process_stream_with_dlq(kafka_consumer):
    """Process stream with automatic failover to DLQ"""
    while True:
        message = kafka_consumer.poll(timeout=1.0)
        
        try:
            # Try writing to Weaviate
            embedding = vectorize(message.value)
            collection.data.insert(properties={
                'content': message.value,
                'embedding': embedding
            })
            kafka_consumer.commit()
            
        except WeaviateException as e:
            # Weaviate unavailable - write to DLQ
            dlq_topic.send({
                'original_message': message.value,
                'error': str(e),
                'timestamp': datetime.now().isoformat(),
                'retry_count': 0
            })
            kafka_consumer.commit()  # Don't block pipeline
            
# Separate consumer reprocesses DLQ when Weaviate recovers

This prevents data loss while keeping your pipeline flowing during outages.

What's the best way to monitor Weaviate performance degradation before complete outage?

Implement progressive alerting based on key metrics:

# Alert thresholds
METRICS_THRESHOLDS = {
    'query_latency_p95': 1000,  # milliseconds
    'batch_import_rate': 100,   # objects/second
    'memory_usage_percent': 85,
    'cpu_usage_percent': 80,
    'error_rate_percent': 1
}

def check_performance_degradation():
    """Monitor for early warning signs"""
    metrics = get_weaviate_metrics()
    
    alerts = []
    if metrics['query_latency_p95'] > METRICS_THRESHOLDS['query_latency_p95']:
        alerts.append('⚠️ Query latency degraded')
    
    if metrics['memory_usage'] > METRICS_THRESHOLDS['memory_usage_percent']:
        alerts.append('🔴 Memory pressure high')
    
    if metrics['error_rate'] > METRICS_THRESHOLDS['error_rate_percent']:
        alerts.append('⚠️ Error rate elevated')
    
    if alerts:
        send_warning_alert(f"Weaviate degradation detected: {', '.join(alerts)}")

Early warning systems prevent surprises and enable proactive intervention before user-facing outages occur.

How does Weaviate downtime compare to other vector databases?

Based on monitoring across vector database providers:

Provider Typical Uptime Incident Frequency Recovery Time
Weaviate Cloud 99.9%+ 1-2 major/year 1-4 hours
Pinecone 99.9%+ 1-3 major/year 1-3 hours
Qdrant Cloud 99.5%+ 2-4 major/year 2-6 hours
Self-hosted Varies widely Depends on ops Depends on team

All major providers maintain excellent reliability. Self-hosted solutions offer more control but require dedicated operational expertise. For uptime monitoring across providers, check our comparison guides: Is Pinecone Down? | Is OpenAI Down?

Stay Ahead of Weaviate Outages

Don't let vector database issues take down your AI applications. Subscribe to real-time Weaviate alerts and get notified instantly when issues are detected—before your users notice.

API Status Check monitors Weaviate 24/7 with:

  • 60-second health checks on Weaviate Cloud Services
  • Instant alerts via email, Slack, Discord, or webhook
  • Historical uptime tracking and incident reports
  • Multi-region monitoring for global deployments
  • Query performance and latency tracking

Start monitoring Weaviate now →

Building AI infrastructure? Monitor your entire stack:

Monitor all your AI APIs →


Last updated: February 4, 2026. Weaviate status information is provided in real-time based on active monitoring. For official incident reports, always refer to status.weaviate.io.

Monitor Your APIs

Check the real-time status of 100+ popular APIs used by developers.

View API Status →