Monitoring¶
Overview¶
Monitoring is the process of collecting, analyzing, and acting on data about the health, performance, and availability of systems and applications. It encompasses three main data types: metrics (numerical measurements over time), logs (event records), and traces (request paths through distributed systems).
Effective monitoring enables proactive problem detection, capacity planning, performance optimization, and incident response. A typical monitoring stack includes data collection agents (exporters, log forwarders), storage backends (time-series databases, log aggregators), and visualization/alerting tools (dashboards, alert managers).
How It Works¶
The Three Pillars of Observability¶
Metrics¶
Metrics are numerical measurements collected at regular intervals that represent the behavior and health of systems. Examples: CPU usage, memory consumption, request latency, disk I/O, network throughput.
Metrics are stored in time-series databases that enable efficient querying and aggregation of historical and real-time data. They are visualized in dashboards, charts, and graphs.
The typical metrics pipeline:
- Exporters (e.g.,
node_exporter) expose metrics from the system in a standard format - Collection server (e.g., Prometheus) periodically scrapes metrics from exporters
- Visualization (e.g., Grafana) queries the collection server and renders dashboards
Logs¶
Logs are timestamped event records that capture what happened in a system. Unlike metrics (which are numeric), logs are text-based and capture contextual information: error messages, access records, configuration changes, security events.
The typical logging pipeline:
- Applications and services write logs to stdout, files, or syslog
- Log forwarders (e.g., rsyslog, Promtail) collect and ship logs to a central location
- Log aggregators (e.g., Loki, Elasticsearch) store and index logs for searching
- Visualization tools (e.g., Grafana) provide search and analysis interfaces
Traces¶
Traces follow the path of a request through a distributed system. While less relevant for traditional sysadmin work, they are critical for debugging microservices architectures.
Metrics Collection with Prometheus¶
Prometheus is an open-source monitoring system that uses a pull model — it periodically scrapes (fetches) metrics from configured targets.
The workflow:
- node_exporter runs on each machine, exposing OS-level metrics (CPU, memory, disk, network) on port 9100 in Prometheus format
- Prometheus server is configured with a list of targets to scrape at regular intervals
- Metrics are stored in Prometheus's built-in time-series database
- Grafana connects to Prometheus as a data source and renders dashboards with graphs, gauges, and alerts
Prometheus metrics are exposed as plain text in a specific format:
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_memory_MemAvailable_bytes 4294967296
node_filesystem_avail_bytes{mountpoint="/"} 10737418240
Log Management with rsyslog¶
rsyslog is the standard log daemon on most Linux distributions. It handles receiving, processing, and forwarding log messages.
Key capabilities:
- Accept logs from local applications via the system journal
- Receive logs from remote servers on port 514/tcp
- Forward logs to remote log aggregation servers
- Write logs to files with customizable templates and filters
Configuration for centralized logging involves:
- Enable the TCP input module (
imtcp) on the receiving server - Define templates that organize received logs by hostname into separate files
- Configure sending machines to forward their logs to the central server
Centralized Logging with Loki¶
Loki is a log aggregation system designed to work with Grafana. Combined with Promtail (a log shipping agent), it provides a modern alternative to rsyslog for centralized log management.
The stack:
- Promtail — runs on each machine, tails log files, and ships entries to Loki
- Loki — stores and indexes logs (labels-based indexing, not full-text)
- Grafana — provides a unified interface for querying both metrics (Prometheus) and logs (Loki)
Visualization with Grafana¶
Grafana is an open-source visualization platform that connects to multiple data sources (Prometheus, Loki, InfluxDB, etc.) and renders customizable dashboards.
Key features:
- Pre-built dashboard templates (importable by ID from grafana.com)
- Custom queries using PromQL (for Prometheus) or LogQL (for Loki)
- Alerting rules that notify when metrics cross thresholds
- Multi-data-source dashboards combining metrics and logs
Key Terminology¶
- Time-Series Database
- A database optimized for storing and querying timestamped numerical data (metrics).
- Exporter
- A program that exposes metrics in a format that Prometheus can scrape (e.g.,
node_exporterfor OS metrics). - Scraping
- Prometheus's pull-based approach to collecting metrics — it periodically fetches data from configured targets.
- PromQL
- Prometheus Query Language — used to query and aggregate metrics data.
- Dashboard
- A visual display of metrics and logs, typically showing graphs, gauges, and tables in Grafana.
- SLA (Service Level Agreement)
- A commitment to maintain a certain level of service quality. Monitoring data is used to measure and enforce SLAs.
- Alert
- A notification triggered when a metric crosses a defined threshold (e.g., CPU > 90% for 5 minutes).
Common Ports and Protocols¶
| Port | Protocol | Purpose |
|---|---|---|
| 514 | TCP/UDP | rsyslog — remote log reception |
| 9090 | TCP | Prometheus — web UI and API |
| 9100 | TCP | Node Exporter — host metrics |
| 3000 | TCP | Grafana — dashboards and visualization |
| 3100 | TCP | Loki — log aggregation API |
In a Kubernetes deployment, these are typically exposed via NodePort services (e.g. 30909, 30910, 30000, 30310).
Why It Matters¶
As a system administrator, monitoring allows you to:
- Detect problems proactively before users are affected
- Debug issues by correlating metrics spikes with log entries
- Plan capacity by observing resource usage trends over time
- Meet SLAs by measuring and reporting service availability
- Perform root cause analysis after incidents using historical data
- Centralize visibility across multiple machines and services
Common Pitfalls¶
- No monitoring until something breaks — monitoring should be set up from the start, not after an incident.
- Alert fatigue — too many alerts desensitize operators. Only alert on actionable conditions.
- Not centralizing logs — checking logs on individual machines doesn't scale. Use rsyslog or Loki to aggregate.
- Monitoring only infrastructure — OS metrics matter, but application-level metrics (request rate, error rate, latency) are equally important.
- Not retaining enough history — short retention periods prevent trend analysis and capacity planning.
- Ignoring security of monitoring systems — Prometheus and Grafana endpoints should not be publicly accessible without authentication.
Further Reading¶
Related Documentation¶
- Technologies: Prometheus, Grafana, rsyslog
- SOPs: Monitoring Setup