Week 11 - Monitoring¶
Topic¶
Building a full observability stack on top of the Docker Compose environment from lab 10: host metrics with Prometheus and node_exporter, log aggregation with Loki and Promtail, visualization dashboards in Grafana, and distributed tracing with a central Jaeger instance.
Company Requests¶
Ticket #1101: Host metrics monitoring
"Operations needs visibility into the health of each VM — CPU, memory, disk, network. Deploy Prometheus and node_exporter and expose them so we can monitor them from the scoring server."
Ticket #1102: Centralized logs and dashboards
"Log files on crashed VMs are inaccessible. We need logs centralized somewhere we can always reach them. Deploy Loki to store system logs, Promtail to ship them, and Grafana to visualize both metrics and logs in one place."
Ticket #1103: Distributed tracing for the inventory API
"The dev team wants to understand where time is spent on each inventory API request. The API has been instrumented with OpenTelemetry — configure it to push traces to the central Jaeger instance at
jaeger.sysadm.eeand investigate the trace data."
Port access required
The scoring server makes direct HTTP connections to your VM on ports 9100, 9090, and 3100. All three must be open in firewalld. Grafana is checked via HTTPS through your Apache reverse proxy.
Scoring Checks¶
- Check 11.1: node_exporter is running on port 9100
- Method: HTTP
GET http://<vm-ip>:9100/metrics - Expected: HTTP 200 with OpenMetrics content (
# HELPlines)
- Method: HTTP
- Check 11.2: Prometheus is healthy and scraping node_exporter
- Method: Health check at port 9090; then
/api/v1/targets— node_exporter target must beUP - Expected: Prometheus healthy; target UP. WARNING if healthy but target not yet UP.
- Method: Health check at port 9090; then
- Check 11.3: Loki is ready and Promtail is shipping logs
- Method:
GET http://<vm-ip>:3100/ready; then Loki query API for{job="varlogs"} - Expected:
ready; at least one log entry. WARNING if Loki ready but no logs yet.
- Method:
- Check 11.4: Grafana is running and accessible via HTTPS
- Method:
GET https://grafana.<vm_name>.sysadm.ee/api/health - Expected: HTTP 200 with
{"database":"ok"}
- Method:
- Check 11.5: Inventory API traces are appearing in central Jaeger
- Method: Scoring server generates a request to your inventory API, waits 2 s, then queries Jaeger for the service
inventory.<vm_name>.sysadm.ee - Expected: Service name present. WARNING if API responds but traces not in Jaeger.
- Method: Scoring server generates a request to your inventory API, waits 2 s, then queries Jaeger for the service
- Check 11.6: Slow span identified correctly
- Method: SSH to VM; read
/home/centos/lab11/slow_handler.txt; compare against known span name (case-insensitive) - Expected: File exists and content matches
- Method: SSH to VM; read
Tasks¶
Task 1: Deploy Prometheus and node_exporter¶
Operations needs host-level metrics from every VM. Deploy Prometheus and node_exporter as Docker Compose services so the scoring server can scrape them directly.
Complete
- Create
prometheus.ymlconfiguring Prometheus to scrape node_exporter. See Technologies: Prometheus for the configuration format and the Docker Compose service definitions. - Create a
docker-compose.ymlwith thenode-exporterandprometheusservices. node_exporter requires read-only host filesystem mounts; Prometheus needs a named volume for storage persistence. - Start the services and open ports 9100 and 9090 in the firewall.
- Verify in the Prometheus web UI (Status → Targets) that the
node-exportertarget isUP.
Reference: Technologies: Prometheus, SOP: Monitoring Setup — Deploy Prometheus and node_exporter
Task 2: Deploy Loki and Promtail¶
Centralize system logs so they remain accessible even if a VM becomes unreachable. Promtail will tail the main system log files and ship them to Loki.
Complete
- Create
loki-config.yamlandpromtail-config.yaml. See Technologies: Loki and Technologies: Promtail for the minimal working configurations. - Add
lokiandpromtailservices todocker-compose.yml. Promtail needs host/var/logmounted read-only and should declare adepends_onon Loki. - Start the services and open port 3100 in the firewall.
- Verify Loki is ready and that Promtail has started shipping logs from
/var/log/messages.
Reference: Technologies: Loki, Technologies: Promtail, SOP: Monitoring Setup — Deploy Loki and Promtail
Task 3: Deploy Grafana and connect data sources¶
Add Grafana as the single pane of glass for both metrics and logs. Connect it to Prometheus and Loki, and import pre-built dashboards so operations can start using it immediately.
Complete
- Add a
grafanaservice todocker-compose.ymlwith a named volume for persistence. Bind the port to127.0.0.1:3000:3000— Grafana must not be directly exposed. - Add a
grafanaDNS record and configure an Apache HTTPS reverse proxy vhost. See Technologies: Grafana for the exact vhost blocks. - Add Prometheus as a data source. The URL must use the Docker Compose service name as the hostname.
- Import dashboard ID
1860(Node Exporter Full). - Add Loki as a data source using the same approach.
- Import dashboard ID
13639(Logs App).
Reference: Technologies: Grafana, SOP: Monitoring Setup — Deploy Grafana and Connect Data Sources
Task 4: Configure inventory API tracing and investigate Jaeger¶
The inventory API has been updated with OpenTelemetry instrumentation. Configure it to push traces to the central Jaeger instance, then use the Jaeger UI to investigate the performance profile of a request.
Complete
- Pull the latest version of the inventory API source and rebuild the Docker image.
- Update the
inventoryservice in docker compose to use the new image and add the three OTEL environment variables described in Technologies: Jaeger. The service name must beinventory.<vm_name>.sysadm.ee. - Restart the inventory service, send a test request, then open the Jaeger UI at
https://jaeger.sysadm.eeand verify that your service appears in the service list. - Open any trace for your service. In the waterfall view, one span takes several hundred milliseconds while all others are in the single-digit millisecond range. Write the operation name of that slow span to
/home/centos/lab11/slow_handler.txt.
Reference: Technologies: Jaeger, SOP: Monitoring Setup — Configure Inventory API Tracing
Ansible Tips¶
This week, there's nothing new for Ansible. Everything that can be automated (creating/templating the docker-compose files, configuring apache + DNS for grafana, firewall) you already know how to do.