We have a Jakarta app that exposes Prometheus metrics for OIDC session billing. The success rate metric occasionally spikes above 100% (observed 133%, 300%) in production.
The metric:
(
sum by(stage)(rate(oidc_session_success_total{stage="PROD"}[15m]))
) / sum by(stage)(rate(oidc_session_total{stage="PROD"}[15m])) * 100
Root cause
The counters are not event-driven. Instead, on every Prometheus scrape, the app queries the DB for aggregate totals and tries to sync the in-memory counter by computing a delta:
final var oidcSessionSuccessTotal = Counter.builder("oidc_session_success_total")
.tags(tags)
.register(registry);
// Called on every scrape
oidcSessionSuccessTotal.increment(getTotalSuccess() - oidcSessionSuccessTotal.count());
Switching to Gauges could fix the >100% bug but removes the ability to use rate() and increase() on session counts. We lost visibility into sessions-per-second throughput.
Question: What is the correct pattern for metrics that are derived from DB aggregates — where the DB is the source of truth — but where you also need rate/throughput visibility in Prometheus?