Monitoring¶

Overview¶

Monitoring is the process of collecting, analyzing, and acting on data about the health, performance, and availability of systems and applications. It encompasses three main data types: metrics (numerical measurements over time), logs (event records), and traces (request paths through distributed systems).

Effective monitoring enables proactive problem detection, capacity planning, performance optimization, and incident response. A typical monitoring stack includes data collection agents (exporters, log forwarders), storage backends (time-series databases, log aggregators), and visualization/alerting tools (dashboards, alert managers).

How It Works¶

The Three Pillars of Observability¶

Metrics¶

Metrics are numerical measurements collected at regular intervals that represent the behavior and health of systems. Examples: CPU usage, memory consumption, request latency, disk I/O, network throughput.

Metrics are stored in time-series databases that enable efficient querying and aggregation of historical and real-time data. They are visualized in dashboards, charts, and graphs.

The typical metrics pipeline:

Exporters (e.g., node_exporter) expose metrics from the system in a standard format
Collection server (e.g., Prometheus) periodically scrapes metrics from exporters
Visualization (e.g., Grafana) queries the collection server and renders dashboards

Logs¶

Logs are timestamped event records that capture what happened in a system. Unlike metrics (which are numeric), logs are text-based and capture contextual information: error messages, access records, configuration changes, security events.

The typical logging pipeline:

Applications and services write logs to stdout, files, or syslog
Log forwarders (e.g., rsyslog, Promtail) collect and ship logs to a central location
Log aggregators (e.g., Loki, Elasticsearch) store and index logs for searching
Visualization tools (e.g., Grafana) provide search and analysis interfaces

Traces¶

Traces follow the path of a request through a distributed system. While less relevant for traditional sysadmin work, they are critical for debugging microservices architectures.

Metrics Collection with Prometheus¶

Prometheus is an open-source monitoring system that uses a pull model — it periodically scrapes (fetches) metrics from configured targets.

The workflow:

node_exporter runs on each machine, exposing OS-level metrics (CPU, memory, disk, network) on port 9100 in Prometheus format
Prometheus server is configured with a list of targets to scrape at regular intervals
Metrics are stored in Prometheus's built-in time-series database
Grafana connects to Prometheus as a data source and renders dashboards with graphs, gauges, and alerts

Prometheus metrics are exposed as plain text in a specific format:

node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_memory_MemAvailable_bytes 4294967296
node_filesystem_avail_bytes{mountpoint="/"} 10737418240

Log Management with rsyslog¶

rsyslog is the standard log daemon on most Linux distributions. It handles receiving, processing, and forwarding log messages.

Key capabilities:

Accept logs from local applications via the system journal
Receive logs from remote servers on port 514/tcp
Forward logs to remote log aggregation servers
Write logs to files with customizable templates and filters

Configuration for centralized logging involves:

Enable the TCP input module (imtcp) on the receiving server
Define templates that organize received logs by hostname into separate files
Configure sending machines to forward their logs to the central server

Centralized Logging with Loki¶

Loki is a log aggregation system designed to work with Grafana. Combined with Promtail (a log shipping agent), it provides a modern alternative to rsyslog for centralized log management.

The stack:

Promtail — runs on each machine, tails log files, and ships entries to Loki
Loki — stores and indexes logs (labels-based indexing, not full-text)
Grafana — provides a unified interface for querying both metrics (Prometheus) and logs (Loki)

Visualization with Grafana¶

Grafana is an open-source visualization platform that connects to multiple data sources (Prometheus, Loki, InfluxDB, etc.) and renders customizable dashboards.

Key features:

Pre-built dashboard templates (importable by ID from grafana.com)
Custom queries using PromQL (for Prometheus) or LogQL (for Loki)
Alerting rules that notify when metrics cross thresholds
Multi-data-source dashboards combining metrics and logs

Key Terminology¶

Time-Series Database: A database optimized for storing and querying timestamped numerical data (metrics).
Exporter: A program that exposes metrics in a format that Prometheus can scrape (e.g., node_exporter for OS metrics).
Scraping: Prometheus's pull-based approach to collecting metrics — it periodically fetches data from configured targets.
PromQL: Prometheus Query Language — used to query and aggregate metrics data.
Dashboard: A visual display of metrics and logs, typically showing graphs, gauges, and tables in Grafana.
SLA (Service Level Agreement): A commitment to maintain a certain level of service quality. Monitoring data is used to measure and enforce SLAs.
Alert: A notification triggered when a metric crosses a defined threshold (e.g., CPU > 90% for 5 minutes).

Common Ports and Protocols¶

Port	Protocol	Purpose
514	TCP/UDP	rsyslog — remote log reception
9090	TCP	Prometheus — web UI and API
9100	TCP	Node Exporter — host metrics
3000	TCP	Grafana — dashboards and visualization
3100	TCP	Loki — log aggregation API

In a Kubernetes deployment, these are typically exposed via NodePort services (e.g. 30909, 30910, 30000, 30310).

Why It Matters¶

As a system administrator, monitoring allows you to:

Detect problems proactively before users are affected
Debug issues by correlating metrics spikes with log entries
Plan capacity by observing resource usage trends over time
Meet SLAs by measuring and reporting service availability
Perform root cause analysis after incidents using historical data
Centralize visibility across multiple machines and services

Common Pitfalls¶

No monitoring until something breaks — monitoring should be set up from the start, not after an incident.
Alert fatigue — too many alerts desensitize operators. Only alert on actionable conditions.
Not centralizing logs — checking logs on individual machines doesn't scale. Use rsyslog or Loki to aggregate.
Monitoring only infrastructure — OS metrics matter, but application-level metrics (request rate, error rate, latency) are equally important.
Not retaining enough history — short retention periods prevent trend analysis and capacity planning.
Ignoring security of monitoring systems — Prometheus and Grafana endpoints should not be publicly accessible without authentication.