Week 11 - Monitoring¶

Topic¶

Building a full observability stack on top of the Docker Compose environment from lab 10: host metrics with Prometheus and node_exporter, log aggregation with Loki and Promtail, visualization dashboards in Grafana, and distributed tracing with a central Jaeger instance.

Company Requests¶

Ticket #1101: Host metrics monitoring

"Operations needs visibility into the health of each VM — CPU, memory, disk, network. Deploy Prometheus and node_exporter and expose them so we can monitor them from the scoring server."

Ticket #1102: Centralized logs and dashboards

"Log files on crashed VMs are inaccessible. We need logs centralized somewhere we can always reach them. Deploy Loki to store system logs, Promtail to ship them, and Grafana to visualize both metrics and logs in one place."

Ticket #1103: Distributed tracing for the inventory API

"The dev team wants to understand where time is spent on each inventory API request. The API has been instrumented with OpenTelemetry — configure it to push traces to the central Jaeger instance at jaeger.sysadm.ee and investigate the trace data."

Port access required

The scoring server makes direct HTTP connections to your VM on ports 9100, 9090, and 3100. All three must be open in firewalld. Grafana is checked via HTTPS through your Apache reverse proxy.

Scoring Checks¶

Check 11.1: node_exporter is running on port 9100
- Method: HTTP GET http://<vm-ip>:9100/metrics
- Expected: HTTP 200 with OpenMetrics content (# HELP lines)
Check 11.2: Prometheus is healthy and scraping node_exporter
- Method: Health check at port 9090; then /api/v1/targets — node_exporter target must be UP
- Expected: Prometheus healthy; target UP. WARNING if healthy but target not yet UP.
Check 11.3: Loki is ready and Promtail is shipping logs
- Method: GET http://<vm-ip>:3100/ready; then Loki query API for {job="varlogs"}
- Expected: ready; at least one log entry. WARNING if Loki ready but no logs yet.
Check 11.4: Grafana is running and accessible via HTTPS
- Method: GET https://grafana.<vm_name>.sysadm.ee/api/health
- Expected: HTTP 200 with {"database":"ok"}
Check 11.5: Inventory API traces are appearing in central Jaeger
- Method: Scoring server generates a request to your inventory API, waits 2 s, then queries Jaeger for the service inventory.<vm_name>.sysadm.ee
- Expected: Service name present. WARNING if API responds but traces not in Jaeger.
Check 11.6: Slow span identified correctly
- Method: SSH to VM; read /home/centos/lab11/slow_handler.txt; compare against known span name (case-insensitive)
- Expected: File exists and content matches

Tasks¶

Task 1: Deploy Prometheus and node_exporter¶

Operations needs host-level metrics from every VM. Deploy Prometheus and node_exporter as Docker Compose services so the scoring server can scrape them directly.

Complete

Create prometheus.yml configuring Prometheus to scrape node_exporter. See Technologies: Prometheus for the configuration format and the Docker Compose service definitions.
Create a docker-compose.yml with the node-exporter and prometheus services. node_exporter requires read-only host filesystem mounts; Prometheus needs a named volume for storage persistence.
Start the services and open ports 9100 and 9090 in the firewall.
Verify in the Prometheus web UI (Status → Targets) that the node-exporter target is UP.

Reference: Technologies: Prometheus, SOP: Monitoring Setup — Deploy Prometheus and node_exporter

Task 2: Deploy Loki and Promtail¶

Centralize system logs so they remain accessible even if a VM becomes unreachable. Promtail will tail the main system log files and ship them to Loki.

Complete

Create loki-config.yaml and promtail-config.yaml. See Technologies: Loki and Technologies: Promtail for the minimal working configurations.
Add loki and promtail services to docker-compose.yml. Promtail needs host /var/log mounted read-only and should declare a depends_on on Loki.
Start the services and open port 3100 in the firewall.
Verify Loki is ready and that Promtail has started shipping logs from /var/log/messages.

Reference: Technologies: Loki, Technologies: Promtail, SOP: Monitoring Setup — Deploy Loki and Promtail

Task 3: Deploy Grafana and connect data sources¶

Add Grafana as the single pane of glass for both metrics and logs. Connect it to Prometheus and Loki, and import pre-built dashboards so operations can start using it immediately.

Complete

Add a grafana service to docker-compose.yml with a named volume for persistence. Bind the port to 127.0.0.1:3000:3000 — Grafana must not be directly exposed.
Add a grafana DNS record and configure an Apache HTTPS reverse proxy vhost. See Technologies: Grafana for the exact vhost blocks.
Add Prometheus as a data source. The URL must use the Docker Compose service name as the hostname.
Import dashboard ID 1860 (Node Exporter Full).
Add Loki as a data source using the same approach.
Import dashboard ID 13639 (Logs App).

Reference: Technologies: Grafana, SOP: Monitoring Setup — Deploy Grafana and Connect Data Sources

Task 4: Configure inventory API tracing and investigate Jaeger¶

The inventory API has been updated with OpenTelemetry instrumentation. Configure it to push traces to the central Jaeger instance, then use the Jaeger UI to investigate the performance profile of a request.

Complete

Pull the latest version of the inventory API source and rebuild the Docker image.
Update the inventory service in docker compose to use the new image and add the three OTEL environment variables described in Technologies: Jaeger. The service name must be inventory.<vm_name>.sysadm.ee.
Restart the inventory service, send a test request, then open the Jaeger UI at https://jaeger.sysadm.ee and verify that your service appears in the service list.
Open any trace for your service. In the waterfall view, one span takes several hundred milliseconds while all others are in the single-digit millisecond range. Write the operation name of that slow span to /home/centos/lab11/slow_handler.txt.

Reference: Technologies: Jaeger, SOP: Monitoring Setup — Configure Inventory API Tracing

Ansible Tips¶

This week, there's nothing new for Ansible. Everything that can be automated (creating/templating the docker-compose files, configuring apache + DNS for grafana, firewall) you already know how to do.