Building a Monitoring Stack

February 7, 2026

Introduction

If you can’t measure it, you can’t improve it. Every production system needs observability – the ability to understand what’s happening inside your services from the outside. In this post, we’ll walk through the key pieces of a monitoring stack: querying metrics with SQL, collecting them with Python, visualising the architecture, and presenting status dashboards with inline HTML.

Whether you’re running a handful of microservices or a full-blown platform, the fundamentals are the same. Let’s build it up from scratch.

Querying Service Health with SQL

Before you build fancy dashboards, it helps to understand the raw data. Suppose you have a service_metrics table that receives health check pings every minute. Here’s a query that identifies services with degraded availability over the past 24 hours:

WITH hourly_health AS (
    SELECT
        service_name,
        DATE_TRUNC('hour', recorded_at) AS hour,
        COUNT(*) AS total_checks,
        COUNT(*) FILTER (WHERE status = 'healthy') AS healthy_checks,
        ROUND(
            100.0 * COUNT(*) FILTER (WHERE status = 'healthy') / COUNT(*),
            2
        ) AS uptime_pct
    FROM service_metrics
    WHERE recorded_at >= NOW() - INTERVAL '24 hours'
    GROUP BY service_name, DATE_TRUNC('hour', recorded_at)
),
service_summary AS (
    SELECT
        service_name,
        MIN(uptime_pct) AS min_uptime,
        AVG(uptime_pct) AS avg_uptime,
        COUNT(*) FILTER (WHERE uptime_pct < 99.9) AS degraded_hours
    FROM hourly_health
    GROUP BY service_name
)
SELECT
    service_name,
    ROUND(avg_uptime, 2) AS avg_uptime_pct,
    min_uptime AS worst_hour_pct,
    degraded_hours,
    CASE
        WHEN avg_uptime >= 99.9 THEN 'HEALTHY'
        WHEN avg_uptime >= 99.0 THEN 'DEGRADED'
        ELSE 'CRITICAL'
    END AS status
FROM service_summary
ORDER BY avg_uptime ASC;

This gives you a quick triage view. Services at the top of the list are the ones that need attention first. The degraded_hours column is particularly useful for distinguishing between a brief blip and a sustained problem.

Collecting Metrics with a Python Agent

Dashboards are only as good as the data feeding them. Here’s a lightweight metrics collector that polls service endpoints, records latency and status, and ships the results to a Postgres sink:

import logging
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone

import httpx
import psycopg2
from psycopg2.extras import execute_values

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class ServiceEndpoint:
    name: str
    url: str
    timeout: float = 5.0


@dataclass
class MetricsCollector:
    dsn: str
    endpoints: list[ServiceEndpoint] = field(default_factory=list)
    interval: int = 60

    def probe(self, endpoint: ServiceEndpoint) -> dict:
        """Send a health check request and measure response time."""
        try:
            start = time.monotonic()
            resp = httpx.get(endpoint.url, timeout=endpoint.timeout)
            elapsed_ms = round((time.monotonic() - start) * 1000, 2)
            return {
                "service_name": endpoint.name,
                "status": "healthy" if resp.status_code < 500 else "unhealthy",
                "response_ms": elapsed_ms,
                "status_code": resp.status_code,
                "recorded_at": datetime.now(timezone.utc),
            }
        except httpx.RequestError as exc:
            logger.warning("Probe failed for %s: %s", endpoint.name, exc)
            return {
                "service_name": endpoint.name,
                "status": "unhealthy",
                "response_ms": None,
                "status_code": None,
                "recorded_at": datetime.now(timezone.utc),
            }

    def collect(self) -> list[dict]:
        """Probe all endpoints and return a batch of results."""
        return [self.probe(ep) for ep in self.endpoints]

    def store(self, results: list[dict]) -> int:
        """Persist metric results to Postgres."""
        rows = [
            (r["service_name"], r["status"], r["response_ms"],
             r["status_code"], r["recorded_at"])
            for r in results
        ]
        insert_sql = """
            INSERT INTO service_metrics
                (service_name, status, response_ms, status_code, recorded_at)
            VALUES %s
        """
        with psycopg2.connect(self.dsn) as conn:
            with conn.cursor() as cur:
                execute_values(cur, insert_sql, rows)
            conn.commit()
        return len(rows)

    def run(self):
        """Main loop: collect and store metrics at a fixed interval."""
        logger.info("Starting collector with %d endpoints", len(self.endpoints))
        while True:
            results = self.collect()
            count = self.store(results)
            healthy = sum(1 for r in results if r["status"] == "healthy")
            logger.info(
                "Stored %d metrics (%d/%d healthy)",
                count, healthy, len(results),
            )
            time.sleep(self.interval)


if __name__ == "__main__":
    endpoints = [
        ServiceEndpoint("api-gateway", "https://api.example.com/health"),
        ServiceEndpoint("auth-service", "https://auth.example.com/health"),
        ServiceEndpoint("payments", "https://payments.example.com/health"),
        ServiceEndpoint("notifications", "https://notify.example.com/health"),
    ]
    collector = MetricsCollector(
        dsn="postgresql://monitor:pass@localhost:5432/monitoring_db",
        endpoints=endpoints,
        interval=60,
    )
    collector.run()

The collector is intentionally simple – probe, record, sleep, repeat. In production you’d add retries, circuit breaking, and maybe push to a time-series database like TimescaleDB or InfluxDB instead. But the core loop stays the same.

Monitoring Architecture

Here’s how these pieces fit together in a typical observability setup:

graph TD
    A[Microservices] -->|Health checks| B[Metrics Collector]
    A -->|Logs| C[Log Aggregator]
    A -->|Traces| D[Trace Collector]
    B --> E[(Metrics DB)]
    C --> F[(Log Store)]
    D --> G[(Trace Store)]
    E --> H[Grafana]
    F --> H
    G --> H
    H --> I[Alerting]
    I --> J[PagerDuty / Slack]

    style A fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style B fill:#f5a623,stroke:#c17d1a,color:#fff
    style C fill:#f5a623,stroke:#c17d1a,color:#fff
    style D fill:#f5a623,stroke:#c17d1a,color:#fff
    style E fill:#7ed321,stroke:#5a9a18,color:#fff
    style F fill:#7ed321,stroke:#5a9a18,color:#fff
    style G fill:#7ed321,stroke:#5a9a18,color:#fff
    style H fill:#bd10e0,stroke:#8a0ba3,color:#fff
    style I fill:#d0021b,stroke:#9b0116,color:#fff

And a sequence diagram showing the alert flow when a service goes down:

sequenceDiagram
    participant Collector
    participant MetricsDB
    participant AlertEngine
    participant Slack
    participant OnCall

    Collector->>MetricsDB: INSERT unhealthy check
    MetricsDB->>AlertEngine: Threshold breached (3 consecutive failures)
    AlertEngine->>AlertEngine: Evaluate routing rules
    AlertEngine->>Slack: Post #incidents alert
    AlertEngine->>OnCall: Page on-call engineer
    OnCall-->>Slack: Acknowledge incident
    OnCall->>OnCall: Investigate & mitigate
    OnCall-->>Slack: Post resolution update

Service Status Dashboard

A status page doesn’t need to be complicated. Here’s a quick visual using inline HTML that you could embed in any markdown-friendly page:

API Gateway

99.98%

Avg latency: 42ms

Auth Service

99.72%

Avg latency: 118ms

Payments

99.95%

Avg latency: 203ms

And a comparison of popular monitoring tools:

Tool	Best For	Data Model	Self-Hosted
Prometheus	Metrics & alerting	Time-series (pull)	Yes
Grafana Loki	Log aggregation	Log streams (push)	Yes
Jaeger	Distributed tracing	Spans & traces	Yes
Datadog	All-in-one SaaS	Metrics, logs, traces	No

Wrapping Up

Monitoring is not a feature you bolt on at the end – it’s infrastructure you build from day one. We covered how to query service health from raw metrics, how to collect those metrics with a Python agent, how the observability stack fits together architecturally, and how to present status information visually.

The three pillars of observability – metrics, logs, and traces – each tell a different part of the story. Metrics tell you something is wrong, logs tell you what went wrong, and traces tell you where it went wrong. Use all three.

Ship the dashboard. Set up the alerts. Sleep better at night.