Why Database Monitoring Is Different from API Monitoring
Application and API monitoring is relatively straightforward: check if the endpoint responds, measure response time, alert on errors. Database monitoring is more nuanced because:
- Databases are stateful. A slow query doesn't just affect that one request — it can hold locks that cascade to block hundreds of other queries.
- Resource contention is invisible from the outside. Your API returns 200, but behind it a database query is taking 8 seconds instead of 8 milliseconds.
- Degradation is gradual. Databases don't typically go from healthy to down instantly. They degrade — slowly — until they tip over.
- Replication adds complexity. Primary/replica setups create a new failure mode: replication lag, where reads return stale data without any error.
The Core Database Metrics to Monitor
These seven metrics cover the failure modes that cause the majority of database incidents:
1. Query Response Time (P95/P99)
The most important performance metric. Track query latency at the 95th and 99th percentile — not the average. Database averages lie because a small percentage of extremely slow queries (N+1 problems, missing indexes, table scans) can have an outsized impact without moving the mean much.
| Percentile | Good | Warning | Critical |
|---|---|---|---|
| P50 | < 5ms | 5-20ms | > 20ms |
| P95 | < 50ms | 50-200ms | > 200ms |
| P99 | < 200ms | 200-500ms | > 500ms |
2. Connection Pool Utilization
Every database has a maximum connection limit. When you hit it, new application requests fail immediately with connection errors. Track:
- Active connections — queries currently executing
- Idle connections — connections in the pool, waiting
- Waiting connections — requests waiting for a connection (bad sign)
- Max connections — your configured limit
Alert at 80% pool utilization. Page at 95%. If you see waiting connections, you're already in an incident.
3. Replication Lag
For any primary/replica database setup, replication lag is the time delay between a write on the primary and its appearance on the replica. Applications routing reads to replicas will serve stale data during lag spikes.
Alert thresholds by criticality: <1s (good), 1-10s (warning), >30s (critical for most applications), >5 min (potential data loss window if primary fails).
4. Lock Waits and Deadlocks
Long-running transactions hold row locks, blocking other queries. Deadlocks are circular lock dependencies that database engines resolve by killing one of the transactions. Both degrade application throughput.
-- PostgreSQL: find queries waiting for locks
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY duration DESC;5. Buffer/Cache Hit Rate
A healthy database serves most reads from memory (buffer cache), not disk. Low cache hit rates mean expensive disk I/O on every query.
Target: >95% cache hit rate for OLTP workloads. Below 90% indicates your working set doesn't fit in memory — consider increasing RAM or caching at the application layer.
6. Disk I/O and Storage
- Disk throughput (MB/s) — alert when approaching disk bandwidth limits
- IOPS utilization — especially important on provisioned IOPS storage (AWS RDS gp3, io1)
- Disk usage percentage — alert at 75%, page at 90% (full disk = database crash)
- Write-ahead log (WAL) growth — PostgreSQL WAL accumulating faster than it can replay is a replication warning sign
7. Slow Query Rate
Enable the slow query log and track the count of queries exceeding your latency threshold (typically >100ms or >1s). A sudden spike in slow query count is often the first signal of a missing index, lock contention, or a bad query introduced in a new deployment.
📡 Monitor your database uptime every 30 seconds — get alerted in under a minute
Trusted by 100,000+ websites · Free tier available
Database Monitoring by Type
PostgreSQL Monitoring
PostgreSQL ships with powerful built-in statistics views:
pg_stat_activity— active connections, running queries, wait eventspg_stat_statements— query-level performance statistics (requires extension)pg_stat_replication— replication lag per replicapg_stat_bgwriter— buffer writes, checkpointspg_stat_user_tables— table-level scan/fetch/insert/update/delete counts
Enable pg_stat_statements (it's not on by default) — this single extension unlocks per-query performance data that's essential for finding slow queries.
MySQL / MariaDB Monitoring
Key MySQL monitoring queries:
-- Active connections and queries
SHOW PROCESSLIST;
-- Global status counters
SHOW GLOBAL STATUS LIKE 'Threads_connected';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';
-- Cache hit rate = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)
-- Replication lag (on replica)
SHOW REPLICA STATUS\G
-- Look for Seconds_Behind_SourceMongoDB Monitoring
MongoDB key metrics differ from relational databases:
- Current operations (
db.currentOp()) — long-running operations in progress - WiredTiger cache usage — MongoDB's storage engine cache hit rate
- Replication oplog window — how much time the oplog covers; if a secondary falls behind beyond the oplog window, it needs a full resync
- Index miss rate — queries performing full collection scans instead of using indexes
Redis Monitoring
Redis monitoring focuses on memory and throughput:
- Memory usage vs maxmemory — when Redis hits maxmemory, it evicts keys (or rejects writes, depending on policy)
- Keyspace hit/miss rate — cache miss rate should stay below 5-10% for a healthy cache layer
- Connected clients — Redis has a client limit (default 10,000)
- Commands/second — throughput; alert on sudden drops (may indicate connection issues)
- Replication offset lag — distance between master and replica in bytes
Monitor your database with Better Stack
Better Stack integrates with all major databases and cloud providers. Get unified alerting across your databases, APIs, and infrastructure in one dashboard.
Try Better Stack Free →Best Database Monitoring Tools in 2026
For Managed Databases (RDS, Cloud SQL, Supabase)
- Better Stack: Cross-service alerting — combine database metrics with API and uptime monitoring in a single dashboard. Integrates with RDS, Cloud SQL, and Supabase via CloudWatch/log forwarding.
- DataDog: Deep RDS and CloudSQL integrations. Per-query performance tracking, anomaly detection, and correlation with application traces.
- Cloud provider native: AWS CloudWatch for RDS, Google Cloud Monitoring for Cloud SQL. Good for basic metrics but limited for cross-service correlation.
For Self-Hosted Databases
- Prometheus + Grafana: The most flexible open-source stack. Use database-specific exporters:
postgres_exporter,mysqld_exporter,mongodb_exporter,redis_exporter. - pgBadger: PostgreSQL slow query log analyzer. Generates beautiful reports from your pg_log with query categorization and timing breakdowns.
- Percona Monitoring and Management (PMM): Free, purpose-built for MySQL/MongoDB/PostgreSQL. Excellent slow query analysis.
For Query-Level Insights
- New Relic APM: Traces individual application queries to database calls. Identifies the exact code path generating slow queries.
- Scout APM: Lightweight APM focused on query performance. Good for Rails/Django/Laravel applications with heavy database usage.
- Metabase: Business intelligence tool that doubles as a query analysis dashboard when pointed at your production database replica.
Setting Up Your Database Monitoring Stack
Step 1: Enable Slow Query Logging
This is the single highest-leverage action for database observability. Enable it everywhere, always.
# PostgreSQL (postgresql.conf)
log_min_duration_statement = 100 # Log queries > 100ms
log_statement = 'none' # Don't log all statements
# MySQL (my.cnf)
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.1 # 100ms threshold
# MongoDB (mongod.conf)
operationProfiling:
mode: slowOp
slowOpThresholdMs: 100Step 2: Set Up Connection Pool Monitoring
If you use PgBouncer, Vitess, or application-level pooling (HikariCP, pg-pool), monitor the pool itself — not just the database:
- Pool size: total connections provisioned
- Active: connections currently servicing a query
- Idle: available connections waiting
- Wait queue: requests waiting for a connection (should be zero at steady state)
Step 3: Create a Runbook for Each Alert
Every database alert should have a corresponding runbook. When your on-call engineer gets paged at 3am for "Connection pool at 92%", they should have a document explaining: what this means, what queries to run to diagnose it, and what actions to take (kill connections, scale the database, add replicas). See our runbook guide for templates.
Database Monitoring Checklist
- ☐Slow query logging enabled with threshold ≤ 100ms
- ☐pg_stat_statements or performance_schema enabled for query-level metrics
- ☐Connection pool monitoring — alert at 80%, page at 95%
- ☐Replication lag monitoring — alert > 10s, page > 60s
- ☐Disk usage alert at 75%, page at 90%
- ☐Deadlock rate monitoring — alert on any increase above baseline
- ☐Buffer cache hit rate alert below 90%
- ☐Backup verification — confirm backups completed within expected window
- ☐Runbook for every database alert type
- ☐Read replica health monitoring separate from primary
Frequently Asked Questions
What is database monitoring?
Database monitoring is the ongoing collection and analysis of metrics from your database system — including query performance, connection usage, replication lag, and resource utilization — to detect and resolve issues before they cause application downtime.
What are the most important database metrics to monitor?
The most critical metrics are: query response time (P95/P99), connection pool saturation, replication lag, slow query count, lock waits, disk I/O, and buffer cache hit rate. Track all seven and you'll catch the vast majority of database problems early.
How do I monitor PostgreSQL performance?
Enable pg_stat_statements and query pg_stat_activity, pg_stat_replication, and pg_stat_user_tables. Use pgBadger for slow query log analysis. Connect an APM tool (DataDog, New Relic) for automated metric collection and alerting.
What is a good alert threshold for database connection pool?
Alert at 80% pool utilization to give you time to respond. Page at 95%. Alert immediately if you see any waiting connections, as that indicates the pool is already exhausted.
What is the best database monitoring tool in 2026?
For managed databases, Better Stack or DataDog provide excellent cross-service visibility. For self-hosted databases, Prometheus + Grafana with database-specific exporters is the most flexible option. For query-level insights, your APM tool (New Relic, DataDog APM) is most valuable.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your database goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your database + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial