Building a Monitoring Stack
Introduction
If you can’t measure it, you can’t improve it. Every production system needs observability – the ability to understand what’s happening inside your services from the outside. In this post, we’ll walk through the key pieces of a monitoring stack: querying metrics with SQL, collecting them with Python, visualising the architecture, and presenting status dashboards with inline HTML.
Whether you’re running a handful of microservices or a full-blown platform, the fundamentals are the same. Let’s build it up from scratch.
Querying Service Health with SQL
Before you build fancy dashboards, it helps to understand the raw data. Suppose you have a service_metrics table that receives health check pings every minute. Here’s a query that identifies services with degraded availability over the past 24 hours:
WITH hourly_health AS (
SELECT
service_name,
DATE_TRUNC('hour', recorded_at) AS hour,
COUNT(*) AS total_checks,
COUNT(*) FILTER (WHERE status = 'healthy') AS healthy_checks,
ROUND(
100.0 * COUNT(*) FILTER (WHERE status = 'healthy') / COUNT(*),
2
) AS uptime_pct
FROM service_metrics
WHERE recorded_at >= NOW() - INTERVAL '24 hours'
GROUP BY service_name, DATE_TRUNC('hour', recorded_at)
),
service_summary AS (
SELECT
service_name,
MIN(uptime_pct) AS min_uptime,
AVG(uptime_pct) AS avg_uptime,
COUNT(*) FILTER (WHERE uptime_pct < 99.9) AS degraded_hours
FROM hourly_health
GROUP BY service_name
)
SELECT
service_name,
ROUND(avg_uptime, 2) AS avg_uptime_pct,
min_uptime AS worst_hour_pct,
degraded_hours,
CASE
WHEN avg_uptime >= 99.9 THEN 'HEALTHY'
WHEN avg_uptime >= 99.0 THEN 'DEGRADED'
ELSE 'CRITICAL'
END AS status
FROM service_summary
ORDER BY avg_uptime ASC;
This gives you a quick triage view. Services at the top of the list are the ones that need attention first. The degraded_hours column is particularly useful for distinguishing between a brief blip and a sustained problem.
Collecting Metrics with a Python Agent
Dashboards are only as good as the data feeding them. Here’s a lightweight metrics collector that polls service endpoints, records latency and status, and ships the results to a Postgres sink:
import logging
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
import httpx
import psycopg2
from psycopg2.extras import execute_values
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class ServiceEndpoint:
name: str
url: str
timeout: float = 5.0
@dataclass
class MetricsCollector:
dsn: str
endpoints: list[ServiceEndpoint] = field(default_factory=list)
interval: int = 60
def probe(self, endpoint: ServiceEndpoint) -> dict:
"""Send a health check request and measure response time."""
try:
start = time.monotonic()
resp = httpx.get(endpoint.url, timeout=endpoint.timeout)
elapsed_ms = round((time.monotonic() - start) * 1000, 2)
return {
"service_name": endpoint.name,
"status": "healthy" if resp.status_code < 500 else "unhealthy",
"response_ms": elapsed_ms,
"status_code": resp.status_code,
"recorded_at": datetime.now(timezone.utc),
}
except httpx.RequestError as exc:
logger.warning("Probe failed for %s: %s", endpoint.name, exc)
return {
"service_name": endpoint.name,
"status": "unhealthy",
"response_ms": None,
"status_code": None,
"recorded_at": datetime.now(timezone.utc),
}
def collect(self) -> list[dict]:
"""Probe all endpoints and return a batch of results."""
return [self.probe(ep) for ep in self.endpoints]
def store(self, results: list[dict]) -> int:
"""Persist metric results to Postgres."""
rows = [
(r["service_name"], r["status"], r["response_ms"],
r["status_code"], r["recorded_at"])
for r in results
]
insert_sql = """
INSERT INTO service_metrics
(service_name, status, response_ms, status_code, recorded_at)
VALUES %s
"""
with psycopg2.connect(self.dsn) as conn:
with conn.cursor() as cur:
execute_values(cur, insert_sql, rows)
conn.commit()
return len(rows)
def run(self):
"""Main loop: collect and store metrics at a fixed interval."""
logger.info("Starting collector with %d endpoints", len(self.endpoints))
while True:
results = self.collect()
count = self.store(results)
healthy = sum(1 for r in results if r["status"] == "healthy")
logger.info(
"Stored %d metrics (%d/%d healthy)",
count, healthy, len(results),
)
time.sleep(self.interval)
if __name__ == "__main__":
endpoints = [
ServiceEndpoint("api-gateway", "https://api.example.com/health"),
ServiceEndpoint("auth-service", "https://auth.example.com/health"),
ServiceEndpoint("payments", "https://payments.example.com/health"),
ServiceEndpoint("notifications", "https://notify.example.com/health"),
]
collector = MetricsCollector(
dsn="postgresql://monitor:pass@localhost:5432/monitoring_db",
endpoints=endpoints,
interval=60,
)
collector.run()
The collector is intentionally simple – probe, record, sleep, repeat. In production you’d add retries, circuit breaking, and maybe push to a time-series database like TimescaleDB or InfluxDB instead. But the core loop stays the same.
Monitoring Architecture
Here’s how these pieces fit together in a typical observability setup:
graph TD
A[Microservices] -->|Health checks| B[Metrics Collector]
A -->|Logs| C[Log Aggregator]
A -->|Traces| D[Trace Collector]
B --> E[(Metrics DB)]
C --> F[(Log Store)]
D --> G[(Trace Store)]
E --> H[Grafana]
F --> H
G --> H
H --> I[Alerting]
I --> J[PagerDuty / Slack]
style A fill:#4a90d9,stroke:#2c5f8a,color:#fff
style B fill:#f5a623,stroke:#c17d1a,color:#fff
style C fill:#f5a623,stroke:#c17d1a,color:#fff
style D fill:#f5a623,stroke:#c17d1a,color:#fff
style E fill:#7ed321,stroke:#5a9a18,color:#fff
style F fill:#7ed321,stroke:#5a9a18,color:#fff
style G fill:#7ed321,stroke:#5a9a18,color:#fff
style H fill:#bd10e0,stroke:#8a0ba3,color:#fff
style I fill:#d0021b,stroke:#9b0116,color:#fff
And a sequence diagram showing the alert flow when a service goes down:
sequenceDiagram
participant Collector
participant MetricsDB
participant AlertEngine
participant Slack
participant OnCall
Collector->>MetricsDB: INSERT unhealthy check
MetricsDB->>AlertEngine: Threshold breached (3 consecutive failures)
AlertEngine->>AlertEngine: Evaluate routing rules
AlertEngine->>Slack: Post #incidents alert
AlertEngine->>OnCall: Page on-call engineer
OnCall-->>Slack: Acknowledge incident
OnCall->>OnCall: Investigate & mitigate
OnCall-->>Slack: Post resolution update
Service Status Dashboard
A status page doesn’t need to be complicated. Here’s a quick visual using inline HTML that you could embed in any markdown-friendly page:
And a comparison of popular monitoring tools:
| Tool | Best For | Data Model | Self-Hosted |
|---|---|---|---|
| Prometheus | Metrics & alerting | Time-series (pull) | Yes |
| Grafana Loki | Log aggregation | Log streams (push) | Yes |
| Jaeger | Distributed tracing | Spans & traces | Yes |
| Datadog | All-in-one SaaS | Metrics, logs, traces | No |
Wrapping Up
Monitoring is not a feature you bolt on at the end – it’s infrastructure you build from day one. We covered how to query service health from raw metrics, how to collect those metrics with a Python agent, how the observability stack fits together architecturally, and how to present status information visually.
The three pillars of observability – metrics, logs, and traces – each tell a different part of the story. Metrics tell you something is wrong, logs tell you what went wrong, and traces tell you where it went wrong. Use all three.
Ship the dashboard. Set up the alerts. Sleep better at night.