System Monitoring and Logging

In modern distributed systems, monitoring and logging are the twin pillars of observability. They enable engineers to detect anomalies, understand system behavior, and perform root‑cause analysis with confidence. This tutorial walks through the principles, design patterns, and practical tooling required to build robust monitoring and logging solutions for large‑scale applications.

1. The Business Value of Monitoring & Logging

Effective observability translates directly into reduced mean time to detection (MTTD) and mean time to recovery (MTTR). It also supports compliance, capacity planning, and continuous improvement initiatives.

  • Proactive issue detection before customers notice problems
  • Data‑driven performance tuning and cost optimization
  • Regulatory compliance through immutable audit trails
  • Improved developer productivity via fast feedback loops

2. Core Concepts and Terminology

2.1 Monitoring

Monitoring focuses on the collection, aggregation, and visualization of metrics—numerical data points that describe system health (CPU usage, request latency, error rates, etc.).

2.2 Logging

Logging captures event streams—structured or unstructured text that records what happened, when, and why. Logs are essential for debugging and forensic analysis.

2.3 Tracing

Tracing stitches together a single request’s journey across services, providing end‑to‑end visibility of latency and failures.

3. Designing a Monitoring Architecture

A typical monitoring stack consists of four layers: Instrumentation → Collection → Storage → Visualization & Alerting.

LayerResponsibilityCommon Tools
InstrumentationExpose metrics from codePrometheus client libraries, OpenTelemetry
CollectionScrape or push metricsPrometheus server, StatsD, Telegraf
StoragePersist time‑series dataPrometheus TSDB, InfluxDB, VictoriaMetrics
Visualization & AlertingDashboards & notificationsGrafana, Alertmanager, PagerDuty

3.1 Metric Types

  1. Counter – monotonically increasing (e.g., request count)
  2. Gauge – arbitrary value that can go up and down (e.g., current memory usage)
  3. Histogram – distribution of observations (e.g., request latency)
  4. Summary – quantiles over a sliding window

3.2 Best‑Practice Instrumentation

  • Use a context object to propagate request IDs
  • Tag metrics with low‑cardinality labels only (e.g., service, status_code)
  • Avoid high‑cardinality dimensions such as user IDs
  • Instrument at the library level (HTTP server, database client) to reduce duplication
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])

def handle_request(request):
    with REQUEST_LATENCY.labels(method=request.method, endpoint=request.path).time():
        # ... handle request ...
        RESPONSE_STATUS = 200
        REQUEST_COUNT.labels(method=request.method, endpoint=request.path, status=RESPONSE_STATUS).inc()
import ("github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
        "net/http"
)

var (
    requestCount = prometheus.NewCounterVec(
        prometheus.CounterOpts{Name: "http_requests_total", Help: "Total HTTP requests"},
        []string{"method", "endpoint", "code"},
    )
    requestLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{Name: "http_request_duration_seconds", Help: "Request latency"},
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(requestCount, requestLatency)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    // ... other handlers that record metrics ...
    http.ListenAndServe(":9090", nil)
}

4. Designing a Logging Architecture

Logging systems must handle high throughput, provide efficient search, and retain data for compliance periods. The typical pipeline includes Log Generation → Collection → Processing → Indexing → Retention.

StageGoalTypical Tools
Log GenerationEmit structured eventsLog4j, Serilog, Bunyan
CollectionShip logs reliablyFilebeat, Fluent Bit, Fluentd
ProcessingEnrich & parseLogstash, Vector
IndexingMake logs searchableElasticsearch, OpenSearch
RetentionStore long‑termS3, Azure Blob, GCS with lifecycle policies

4.1 Structured Logging

Instead of free‑form text, use JSON (or another machine‑readable format) with consistent field names. This enables powerful filtering and correlation.

{
  "timestamp": "2025-10-31T13:45:22Z",
  "level": "INFO",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "message": "Order created",
  "order_id": "12345",
  "customer_id": "9876",
  "region": "us-east-1"
}

4.2 Log Correlation

  • Propagate a trace_id or request_id across all services.
  • Include the same identifier in metrics, logs, and traces.
  • Use centralized log aggregation to join events across components.
⚠ Warning: Never log personally identifiable information (PII) or secrets in plaintext. Apply redaction or encryption at the collection layer.

5. Alerting Strategies

Alert fatigue is a common pitfall. Follow the SMART criteria (Specific, Measurable, Actionable, Relevant, Timely) and use multi‑stage alerts.

  1. Level‑0: Auto‑recovery (e.g., restart container).
  2. Level‑1: Pager‑on‑call for critical SLO breaches.
  3. Level‑2: Incident manager escalation for prolonged outages.
💡 Tip: Group related alerts using Alertmanager routing trees to reduce duplicate notifications.

6. Observability in Practice: End‑to‑End Example

Below is a simplified architecture diagram that ties together metrics, logs, and traces.

7. Common Pitfalls & How to Avoid Them

  • Over‑instrumentation – leads to high cardinality and performance overhead.
  • Logging at too fine‑grained a level in production – inflates storage costs.
  • Missing context – logs without request IDs are hard to correlate.
  • Static thresholds – prefer dynamic, SLO‑based alerting.
📝 Note: Regularly review and prune unused dashboards, alert rules, and log retention policies to keep the system maintainable.

8. Frequently Asked Questions

Q: What is the difference between monitoring and logging?
A: Monitoring collects quantitative metrics for real‑time health checks and alerting. Logging records qualitative events that provide the narrative behind those metrics.


Q: Should I use a hosted observability service or self‑hosted?
A: It depends on scale, compliance, and team expertise. Hosted services (e.g., Datadog, New Relic) reduce operational overhead, while self‑hosted solutions (Prometheus, Loki) offer full control and cost predictability.


Q: How often should I rotate logs?
A: Rotate logs based on size (e.g., 100 MB) or time (daily). Ensure the rotation policy aligns with your retention requirements and does not interrupt the collection pipeline.


9. Quick Knowledge Check

Q. Which metric type is best suited for measuring request latency distribution?
  • Counter
  • Gauge
  • Histogram
  • Summary

Answer: Histogram
Histograms bucket observations, enabling latency percentile calculations.

Q. What label cardinality practice should you follow?
  • Use user IDs as labels
  • Limit labels to low‑cardinality values
  • Add timestamps as labels
  • Never use labels

Answer: Limit labels to low‑cardinality values
High‑cardinality labels explode the time‑series database size and degrade query performance.

10. Further Reading & References

References
📘 Summary: Monitoring provides real‑time metrics for health checks and alerts, while logging offers detailed event records for post‑mortem analysis. By combining structured logging, low‑cardinality metrics, distributed tracing, and well‑designed alerting pipelines, you can achieve a resilient, observable system that meets both operational and compliance goals.