We have a requirement for our Camel Kafka consumer to keep retrying indefinitely when it hits a failure while connecting to the broker, instead of giving up. The only way we were able to get a true "retry forever" effect was by also setting metadataMaxAgeMs on the consumer endpoint — even though the Camel Kafka component docs listed metadataMaxAgeMs strictly under (producer) options, not consumer.
Currently using camel-kafka 4.x and spring-boot 3.x.
Config
We build the consumer URI dynamically and pass these (currently hard-coded, would move to config if this approach is validated):
public String getURI() {
int reconnectBackoffMaxMs = SpringContextLookupUtil.getSystemProperty(
"armada.kafka.consumer.auth.retry.max.backoff.ms", Integer.class);
int reconnectBackoffMs = SpringContextLookupUtil.getSystemProperty(
"armada.kafka.consumer.auth.retry.initial.backoff.ms", Integer.class);
int metadataMaxAgeMs = SpringContextLookupUtil.getSystemProperty(
"armada.kafka.consumer.auth.retry.metadata.max.age.ms", Integer.class);
String uri = "%s?groupId=%s&autoOffsetReset=%s&autoCommitIntervalMs=%d&consumersCount=%d"
+ "&sessionTimeoutMs=%d&heartbeatIntervalMs=%d&consumerRequestTimeoutMs=%d&maxPollRecords=%d"
+ "&maxPollIntervalMs=%d&reconnectBackoffMs=%d&reconnectBackoffMaxMs=%d&metadataMaxAgeMs=%d";
String endpointURI = String.format(uri, endpoint,
ComponentConstants.KAFKA_CONSUMER_GROUP_ID_PREFIX + groupId,
autoOffsetReset, autoCommitIntervalMs, consumersCount, sessionTimeout,
heartbeatInterval, requestTimeout, maxPollRecords, maxPollInterval,
reconnectBackoffMs, reconnectBackoffMaxMs, metadataMaxAgeMs);
log.debug("Endpoint Uri : {}", endpointURI);
return endpointURI;
}
We also plugged in a custom PollExceptionStrategy to drive exponential backoff manually on auth failures:
public class KafkaAuthorizationReconnectStrategy implements PollExceptionStrategy {
@Value("${armada.kafka.consumer.auth.retry.initial.backoff.ms:1000}")
private long reconnectBackoffMs;
@Value("${armada.kafka.consumer.auth.retry.max.backoff.ms:30000}")
private long reconnectBackoffMaxMs;
@Override
public void handle(long partitionLastOffset, Exception exception) {
// compute exponential backoff, cap it at reconnectBackoffMaxMs,
// log, then Thread.sleep(backoffMs) before letting Camel retry
...
}
private long computeExponentialBackoff(int attempt) {
double raw = reconnectBackoffMs * Math.pow(2, attempt - 1);
return Math.min((long) raw, reconnectBackoffMaxMs);
}
@Override
public boolean canContinue() {
return true; // always keep retrying
}
}
Example flow
- Connection fails → wait 1s (initial backoff)
- Still failing → wait 2s
- Still failing → wait 4s
- …continues doubling…
- Capped at 30s max between retries, repeats indefinitely
This works in our testing, but only once metadataMaxAgeMs=1000 is also set on the consumer URI. Without it, the consumer would eventually stop retrying after the backoff ceiling was hit a few times.
Summary
We're only intending to control reconnectBackoffMs / reconnectBackoffMaxMs behavior for the consumer, but per the component reference table, metadataMaxAgeMs is documented only as a (producer) option — there's no (consumer) entry for it at all. Adding it anyway to the consumer endpoint URI changed our retry behavior in a way that "fixed" our problem, but we don't fully understand why, and we'd rather not ship a workaround that depends on undocumented/unsupported behavior.
Questions
- Is
metadataMaxAgeMsactually usable/honored on a Camel Kafka consumer endpoint, or is it being silently ignored/dropped by Camel'sKafkaConfigurationfor consumers (in which case our improvement might be coincidental and happening for a different reason)? - Is there a supported Camel/Kafka consumer property for "retry forever with capped exponential backoff" on connection/authorization failures, rather than rolling our own
PollExceptionStrategy? Right now after the backoff ceiling is hit, the consumer effectively keeps retrying everyreconnectBackoffMaxMsuntil the underlying issue is fixed — is that the intended/recommended pattern forTopicAuthorizationException-type failures, or is there a cleaner built-in option (e.g. something equivalent topollOnError=RECONNECTcombined with nativereconnect.backoff.ms/reconnect.backoff.max.ms) that doesn't need a custom strategy class at all? - Why does setting
metadataMaxAgeMson the consumer URI appear to change behavior at all, given its documented purpose ("the period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes, to proactively discover any new brokers or partitions") doesn't obviously relate to authorization-failure reconnect/backoff? Is this an unintended side effect of how Camel parses/forwards endpoint options, or is there a legitimate mechanism by which a metadata refresh interacts with our retry loop?
Any clarification on whether metadataMaxAgeMs is genuinely consumer-applicable in Camel, and whether there's a more "supported" way to achieve indefinite capped-backoff retries on consumer authorization failures, would be appreciated.