System Monitoring and Logging - Tutorials

In modern distributed systems, monitoring and logging are the twin pillars of observability. They enable engineers to detect anomalies, understand system behavior, and perform root‑cause analysis with confidence. This tutorial walks through the principles, design patterns, and practical tooling required to build robust monitoring and logging solutions for large‑scale applications.

1. The Business Value of Monitoring & Logging

Effective observability translates directly into reduced mean time to detection (MTTD) and mean time to recovery (MTTR). It also supports compliance, capacity planning, and continuous improvement initiatives.

Proactive issue detection before customers notice problems
Data‑driven performance tuning and cost optimization
Regulatory compliance through immutable audit trails
Improved developer productivity via fast feedback loops

2. Core Concepts and Terminology

2.1 Monitoring

Monitoring focuses on the collection, aggregation, and visualization of metrics—numerical data points that describe system health (CPU usage, request latency, error rates, etc.).

2.2 Logging

Logging captures event streams—structured or unstructured text that records what happened, when, and why. Logs are essential for debugging and forensic analysis.

2.3 Tracing

Tracing stitches together a single request’s journey across services, providing end‑to‑end visibility of latency and failures.

3. Designing a Monitoring Architecture

A typical monitoring stack consists of four layers: Instrumentation → Collection → Storage → Visualization & Alerting.

Layer	Responsibility	Common Tools
Instrumentation	Expose metrics from code	Prometheus client libraries, OpenTelemetry
Collection	Scrape or push metrics	Prometheus server, StatsD, Telegraf
Storage	Persist time‑series data	Prometheus TSDB, InfluxDB, VictoriaMetrics
Visualization & Alerting	Dashboards & notifications	Grafana, Alertmanager, PagerDuty

3.1 Metric Types

Counter – monotonically increasing (e.g., request count)
Gauge – arbitrary value that can go up and down (e.g., current memory usage)
Histogram – distribution of observations (e.g., request latency)
Summary – quantiles over a sliding window

3.2 Best‑Practice Instrumentation

Use a context object to propagate request IDs
Tag metrics with low‑cardinality labels only (e.g., service, status_code)
Avoid high‑cardinality dimensions such as user IDs
Instrument at the library level (HTTP server, database client) to reduce duplication

python go

from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])

def handle_request(request):
    with REQUEST_LATENCY.labels(method=request.method, endpoint=request.path).time():
        # ... handle request ...
        RESPONSE_STATUS = 200
        REQUEST_COUNT.labels(method=request.method, endpoint=request.path, status=RESPONSE_STATUS).inc()

import ("github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
        "net/http"
)

var (
    requestCount = prometheus.NewCounterVec(
        prometheus.CounterOpts{Name: "http_requests_total", Help: "Total HTTP requests"},
        []string{"method", "endpoint", "code"},
    )
    requestLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{Name: "http_request_duration_seconds", Help: "Request latency"},
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(requestCount, requestLatency)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    // ... other handlers that record metrics ...
    http.ListenAndServe(":9090", nil)
}

4. Designing a Logging Architecture

Logging systems must handle high throughput, provide efficient search, and retain data for compliance periods. The typical pipeline includes Log Generation → Collection → Processing → Indexing → Retention.

Stage	Goal	Typical Tools
Log Generation	Emit structured events	Log4j, Serilog, Bunyan
Collection	Ship logs reliably	Filebeat, Fluent Bit, Fluentd
Processing	Enrich & parse	Logstash, Vector
Indexing	Make logs searchable	Elasticsearch, OpenSearch
Retention	Store long‑term	S3, Azure Blob, GCS with lifecycle policies

4.1 Structured Logging

Instead of free‑form text, use JSON (or another machine‑readable format) with consistent field names. This enables powerful filtering and correlation.

json

{
  "timestamp": "2025-10-31T13:45:22Z",
  "level": "INFO",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "message": "Order created",
  "order_id": "12345",
  "customer_id": "9876",
  "region": "us-east-1"
}

4.2 Log Correlation

Propagate a trace_id or request_id across all services.
Include the same identifier in metrics, logs, and traces.
Use centralized log aggregation to join events across components.

⚠ Warning: Never log personally identifiable information (PII) or secrets in plaintext. Apply redaction or encryption at the collection layer.

5. Alerting Strategies

Alert fatigue is a common pitfall. Follow the SMART criteria (Specific, Measurable, Actionable, Relevant, Timely) and use multi‑stage alerts.

Level‑0: Auto‑recovery (e.g., restart container).
Level‑1: Pager‑on‑call for critical SLO breaches.
Level‑2: Incident manager escalation for prolonged outages.

💡 Tip: Group related alerts using Alertmanager routing trees to reduce duplicate notifications.

6. Observability in Practice: End‑to‑End Example

Below is a simplified architecture diagram that ties together metrics, logs, and traces.

7. Common Pitfalls & How to Avoid Them

Over‑instrumentation – leads to high cardinality and performance overhead.
Logging at too fine‑grained a level in production – inflates storage costs.
Missing context – logs without request IDs are hard to correlate.
Static thresholds – prefer dynamic, SLO‑based alerting.

📝 Note: Regularly review and prune unused dashboards, alert rules, and log retention policies to keep the system maintainable.

8. Frequently Asked Questions

Q: What is the difference between monitoring and logging?
A: Monitoring collects quantitative metrics for real‑time health checks and alerting. Logging records qualitative events that provide the narrative behind those metrics.

Q: Should I use a hosted observability service or self‑hosted?
A: It depends on scale, compliance, and team expertise. Hosted services (e.g., Datadog, New Relic) reduce operational overhead, while self‑hosted solutions (Prometheus, Loki) offer full control and cost predictability.

Q: How often should I rotate logs?
A: Rotate logs based on size (e.g., 100 MB) or time (daily). Ensure the rotation policy aligns with your retention requirements and does not interrupt the collection pipeline.

9. Quick Knowledge Check

Q. Which metric type is best suited for measuring request latency distribution?

Counter
Gauge
Histogram
Summary

Answer: Histogram
Histograms bucket observations, enabling latency percentile calculations.

Q. What label cardinality practice should you follow?

Use user IDs as labels
Limit labels to low‑cardinality values
Add timestamps as labels
Never use labels

Answer: Limit labels to low‑cardinality values
High‑cardinality labels explode the time‑series database size and degrade query performance.

10. Further Reading & References

References

📘 Summary: Monitoring provides real‑time metrics for health checks and alerts, while logging offers detailed event records for post‑mortem analysis. By combining structured logging, low‑cardinality metrics, distributed tracing, and well‑designed alerting pipelines, you can achieve a resilient, observable system that meets both operational and compliance goals.

Tags: System Design Monitoring Logging Observability DevOps Software Engineering

#ad