Azure is the second-largest cloud provider, powering millions of production workloads. Microsoft gives you a rich monitoring stack โ Azure Monitor, Application Insights, Log Analytics, Azure Security Center, and more โ but knowing which tools to use for which problems, and how to avoid spending a fortune on telemetry ingestion, requires real production experience. This guide distills Azure monitoring into actionable best practices.
The Azure Monitoring Stack: What's Available
Azure provides several native monitoring services, each covering a different layer of observability:
| Service | What It Does | Best For |
|---|---|---|
| Azure Monitor Metrics | Collect and graph metrics from all Azure services | CPU, memory, latency, error rates |
| Log Analytics Workspace | Centralize and query logs via KQL | Log aggregation, complex analysis |
| Azure Monitor Alerts | Alert on metrics, logs, or activity events | Proactive alerting, auto-scaling |
| Application Insights | Full APM for web apps and APIs | Request tracing, exceptions, dependency tracking |
| Azure Activity Log | Audit log of all Azure control-plane operations | Security, compliance, change tracking |
| Azure Service Health | Azure service health events affecting your subscription | Outage awareness, incident tracking |
| Network Watcher | Network topology, traffic analytics, packet capture | Network debugging, security |
| Azure Security Center | Security posture and threat detection | Compliance, vulnerability assessment |
Azure Monitor: Core Infrastructure Monitoring
Understanding Azure Monitor Metrics
Azure Monitor Metrics is the foundation of Azure infrastructure monitoring. Every Azure resource โ VMs, AKS clusters, Functions, SQL databases, App Services, and hundreds more โ automatically publishes metrics. Metrics are organized by resource type and metric namespace.
Key Azure Monitor Metrics concepts:
- Platform metrics: Free metrics published automatically by all Azure services (no configuration needed)
- Custom metrics: Publish application metrics via the Azure Monitor REST API, SDK, or OpenTelemetry
- Metric dimensions: Filter and split metrics by dimensions (e.g., HTTP status code, operation name)
- Metric retention: 93 days for most metrics; high-resolution (sub-minute) retained 3 days
- Diagnostic settings: Required to route metrics and logs to Log Analytics or a storage account
Critical Metrics by Azure Service
Virtual Machines (VMs)
- Percentage CPU: Alert at >80% sustained for 5+ minutes
- Network In/Out: Baseline and alert on anomalies
- Disk Read/Write Bytes: I/O saturation on managed disks
- OS Disk IOPS Consumed Percentage: Alert at >90%
- Memory (via Azure Monitor Agent): Not available by default โ install AMA extension
Azure Kubernetes Service (AKS)
- Node CPU usage percentage: Alert at >80% โ scale node pool before saturation
- Node memory working set percentage: Alert at >80%
- Pod count by phase: Watch Pending/Failed counts โ indicates scheduling issues
- Unschedulable pod count: Immediate alert โ pods waiting for capacity
- Restarting container count: CrashLoopBackOff detection
Azure SQL Database
- DTU percentage (DTU model): Alert at >80% โ scale tier before saturation
- CPU percentage (vCore model): Alert at >70%
- Storage percentage: Alert when >80% to prevent autogrow issues
- Deadlocks: Alert on any deadlocks โ indicates application logic issues
- Connection failures: Alert on sustained failures โ connection pool exhaustion
Azure App Service
- CPU percentage: Alert at >80% โ scale out instances
- Memory percentage: Alert at >85% to prevent OOM kills
- Http5xx: Error rate โ alert on >1% of requests
- Response time: Alert on P95 > your SLA threshold
- Requests: Traffic baseline โ drops indicate upstream issues
Azure Functions
- Execution count: Traffic baseline and anomaly detection
- Function errors: Alert on sustained error rate >1%
- Function execution duration: Watch P99 โ timeout is configurable but default 5 min
- Throttled function runs: Hitting concurrency limits on Consumption plan
- Active instance count: Scale monitoring for Premium/Dedicated plans
Supplement Azure Monitor with Better Stack for on-call alerting
Azure Monitor can alert via Action Groups, but Better Stack gives you escalation policies, on-call schedules, and incident management in one place. Integrates with Azure Monitor in minutes.
Try Better Stack Free โAzure Monitor Alerts: Getting Effective Alerts
Azure Monitor Alerts fire when a condition is met on a metric, log query, or activity log event. Azure supports three alert types โ metric alerts (fastest, near real-time), log search alerts (flexible KQL queries on Log Analytics data), and activity log alerts (service health, resource changes).
Alert Best Practices
- Use dynamic thresholds: Azure Monitor can learn a metric's normal pattern and alert on deviations โ better than static thresholds for seasonal workloads
- Set evaluation frequency thoughtfully: Default is every 1 minute; reduce to 5 minutes for non-critical alerts to cut noise
- Use Action Groups: Route alerts to email, SMS, webhook, Logic App, or ITSM integrations via Action Groups โ define once, reuse across alerts
- Alert on symptom, not cause: Alert on HTTP 5xx rate (symptom) rather than CPU (cause) โ high CPU doesn't always mean user impact
- Severity levels: Use Sev 0 (Critical) for production outages, Sev 1 for degraded performance, Sev 2-4 for warnings and informational
- Enable service health alerts: Free alerts when Azure itself has issues in your region โ these should be Sev 1
Application Insights: APM for Azure Apps
Application Insights is Microsoft's application performance monitoring (APM) service. It instruments your code โ web apps, APIs, Functions, and workers โ to capture telemetry automatically. Unlike Azure Monitor Metrics (infrastructure layer), Application Insights operates at the application layer: requests, exceptions, dependencies, page views, custom events, and traces.
Auto-Instrumented vs SDK Instrumentation
Application Insights offers two instrumentation approaches:
Auto-Instrumentation (Codeless)
Available for Azure App Service (.NET, Java, Node.js, Python), Azure Functions, and AKS via the Application Insights Kubernetes agent. No code changes required โ enable via the Azure Portal.
Captures: HTTP requests/responses, SQL queries, HTTP dependencies, exceptions, performance counters
SDK Instrumentation (Code-Based)
Add the Application Insights SDK to your code for full control. Available for .NET, Java, Node.js, Python, JavaScript, and more.
Adds: Custom events, custom metrics, custom traces, user journey tracking, custom dimensions on all telemetry
Key Application Insights Features
- Application Map: Visual graph of your app's component dependencies โ instantly see where failures propagate
- Live Metrics Stream: Real-time telemetry with sub-second latency โ invaluable during incident response
- Smart Detection: AI-powered anomaly detection โ alerts on unusual failure rates, response time degradation, and dependency failures automatically
- Availability Tests: Run URL ping tests or multi-step web tests from global locations every 1-5 minutes
- Distributed Tracing: End-to-end request correlation across microservices using correlation IDs
- Profiler: CPU profiling in production โ capture call stacks to find hot paths
- Snapshot Debugger: Capture exception snapshots with local variables in production without stopping the app
Log Analytics: Querying Logs with KQL
Log Analytics Workspace is where Azure routes all diagnostic logs โ VM syslog, AKS container logs, App Service application logs, Activity Logs, and Application Insights telemetry. You query it using Kusto Query Language (KQL).
Essential KQL Queries
// Top failing requests in App Insights
requests | where success == false | summarize count() by name, resultCode | order by count_ desc | take 20
// P95 response time by operation in last 1 hour
requests | where timestamp > ago(1h) | summarize percentile(duration, 95) by name | order by percentile_duration_95 desc
// AKS pod restart count (last 24h)
KubePodInventory | where TimeGenerated > ago(24h) | where ContainerRestartCount > 0 | summarize sum(ContainerRestartCount) by PodName, Namespace | order by sum_ContainerRestartCount desc
// Exception volume by type (last 6h)
exceptions | where timestamp > ago(6h) | summarize count() by type, outerMessage | order by count_ desc
Log Ingestion Cost Optimization
Log Analytics can become expensive quickly. Strategies to control costs:
- Use Commitment Tiers: If ingesting >100GB/day, Commitment Tiers offer 15-65% savings over pay-as-you-go
- Set daily cap: Configure a daily ingestion cap per workspace to prevent runaway costs from verbose log sources
- Filter at the source: Use diagnostic setting filters to exclude noisy log categories (e.g., Azure Firewall flow logs can be massive)
- Separate workspaces by tier: High-retention compliance data in one workspace, short-retention debug data in another
- Use Basic Logs: New tier at $0.65/GB (vs. $2.76/GB Analytics) for verbose logs you rarely query
- Application Insights sampling: Enable adaptive sampling in App Insights SDK to reduce telemetry volume by 50-90%
Alert Pro
14-day free trialStop checking โ get alerted instantly
Next time Azure goes down, you'll know in under 60 seconds โ not when your users start complaining.
- Email alerts for Azure + 9 more APIs
- $0 due today for trial
- Cancel anytime โ $9/mo after trial
Azure Workbooks & Dashboards
Azure Monitor Workbooks are interactive reports that combine metrics, logs, parameters, and visualizations. They're the preferred way to build custom Azure dashboards beyond the basic Azure Portal charts.
- Azure Dashboard: Pin charts from any Azure service โ quick at-a-glance operational view
- Workbooks: Full interactive reports with drill-downs, parameters (time range, subscription, resource), and mixed data sources
- Insights: Pre-built Workbooks for specific services (VM Insights, Container Insights, Network Insights, SQL Insights)
- Grafana with Azure Data Source: Use managed Grafana (Azure Managed Grafana) or self-hosted Grafana with the Azure Monitor plugin for Grafana-style dashboards
Third-Party Azure Monitoring Tools
Native Azure Monitor is powerful but has limitations: KQL has a steep learning curve, the alert UI is complex, and cross-cloud visibility is limited. Third-party tools are common in Azure production environments:
| Tool | Azure Integration | Best For |
|---|---|---|
| Datadog | Native Azure integration โ 800+ metrics, log forwarding | Full-stack observability, multi-cloud |
| New Relic | Azure polling + Log Forwarding integration | APM + infrastructure in one platform |
| Grafana Cloud | Azure Monitor data source plugin | Teams already using Grafana/Prometheus |
| Better Stack | Webhook integration with Azure Monitor Alerts | Uptime monitoring + on-call management |
| Dynatrace | Auto-discovery via Azure API | Enterprise APM with AI-powered root cause |
| Elastic | Azure Metricbeat + Filebeat + Functionbeat | Log search + APM on self-managed stack |
Azure Monitoring Best Practices
1. Enable Diagnostic Settings on Every Resource
By default, only platform metrics are collected. Enable diagnostic settings on each resource to route logs and additional metrics to Log Analytics. Use Azure Policy to enforce this at scale across subscriptions.
2. Use Tags for Cost Allocation and Alert Routing
Tag resources with environment, team, and service. Use tags in Action Groups to route alerts to the right team and in Cost Management to allocate monitoring spend.
3. Set Up Azure Service Health Alerts
Configure Service Health alerts (free) to notify your team when Azure has incidents, planned maintenance, or health advisories in your regions. Integrate these with your on-call tool via webhook or Logic App.
4. Define SLOs and Use Error Budgets
Define SLOs in Log Analytics Workbooks (e.g., 99.9% uptime = 43 minutes downtime/month budget). Track error budget burn rate and alert when burning too fast. See our SLO guide for implementation details.
5. Implement Distributed Tracing for Microservices
If you run multiple services on Azure (AKS + Functions + App Service + Service Bus), enable Application Insights distributed tracing or OpenTelemetry to correlate requests across service boundaries. Without it, debugging multi-hop failures is guesswork.
6. Monitor Your Monitoring: Alert on Data Gaps
Create a "Heartbeat absent" alert in Log Analytics to detect when an agent stops reporting. If your VM's Log Analytics agent crashes, you want to know before a real incident happens without visibility.
Better Stack + Azure Monitor: on-call alerting done right
Connect Azure Monitor alerts to Better Stack for smart escalations, on-call scheduling, and incident management. Works with any Azure alert action group.
Try Better Stack Free โAzure vs AWS Monitoring: Key Differences
If you're coming from AWS, here's how Azure's monitoring stack maps:
| AWS Service | Azure Equivalent | Key Difference |
|---|---|---|
| CloudWatch Metrics | Azure Monitor Metrics | Azure uses diagnostic settings instead of automatic log routing |
| CloudWatch Logs | Log Analytics Workspace | Azure uses KQL; AWS uses CloudWatch Logs Insights (SQL-like) |
| CloudWatch Alarms | Azure Monitor Alerts | Azure has dynamic thresholds; AWS requires manual static thresholds |
| AWS X-Ray | Application Insights (distributed tracing) | App Insights also does full APM; X-Ray is tracing only |
| CloudTrail | Azure Activity Log | Very similar; both audit all API calls |
| AWS Health | Azure Service Health | Both offer subscription-level health events |
| CloudWatch Dashboards | Azure Workbooks / Dashboards | Workbooks are more powerful; Azure Portal Dashboards are simpler |
Getting Started: Azure Monitoring Quickstart Checklist
Day 1 Setup
- โCreate a central Log Analytics Workspace in each region
- โEnable diagnostic settings on all critical resources (route to Log Analytics)
- โEnable Application Insights on all web apps and APIs
- โConfigure Azure Service Health alerts (free, high value)
- โSet Action Groups for email/SMS/webhook routing
- โCreate metric alerts for CPU, memory, error rates on production resources
- โEnable Container Insights on AKS clusters
- โSet a daily ingestion cap on Log Analytics to avoid bill shock
Related Azure Monitoring Resources
- AWS Monitoring Guide โ CloudWatch, X-Ray, and Best Practices
- Kubernetes Monitoring Guide โ AKS and Self-Managed Clusters
- OpenTelemetry Guide โ Instrument Once, Export Anywhere
- Distributed Tracing Guide โ Correlate Requests Across Services
- Cloud Monitoring Guide โ Multi-Cloud Observability
- SLA vs SLO vs SLI โ Define and Track Service Reliability
- Is Azure Down? โ Check Azure Status in Real Time
๐ Tools We Use & Recommend
Tested across our own infrastructure monitoring 200+ APIs daily
SEO & Site Performance Monitoring
Used by 10M+ marketers
Track your site health, uptime, search rankings, and competitor movements from one dashboard.
โWe use SEMrush to track how our API status pages rank and catch site health issues early.โ