Reliability Engineering

Kafka Health Checks with Full Config Binding: Fixing Secure Cluster Readiness

VisualReading

PR #860 patched a subtle but high-impact reliability issue: Kafka health checks validated only bootstrap servers and ignored SSL/SASL options, causing false health failures in secured environments. This insight explains the implementation, release value, and why this matters to feature flag delivery.

TL;DR

The PR replaced partial Kafka health check wiring with full section binding in both API and Evaluation Server. This preserves authentication and TLS settings during readiness checks and removes the mismatch between runtime clients and health probes.

Core Changes

Before this fix, health checks only set bootstrap.servers. Any cluster requiring security.protocol, SASL credentials, or SSL cert options would fail health checks even when producers were correctly configured elsewhere.

PR #860 introduced a shared helper that binds each configuration section into a dictionary, validates the presence of bootstrap.servers, and constructs ProducerConfig from the full key set.

private static ProducerConfig BuildHealthCheckProducerConfig(IConfigurationSection section)
{
    var configDictionary = new Dictionary<string, string>();
    section.Bind(configDictionary);

    if (!configDictionary.ContainsKey("bootstrap.servers"))
    {
        throw new InvalidOperationException($"{section.Path}:bootstrap.servers must be configured.");
    }

    return new ProducerConfig(configDictionary);
}

Developer Insights

Technical Innovation

The change aligns health checks with production client initialization, eliminating config drift between probe path and runtime path.

Technical Optimization

A small helper replacement removed duplicated bootstrap-only logic across two services and made secure-cluster support implicit.

Release Impact

This PR was merged to main after v5.2.4, so it is part of upcoming stabilization work for the next release train. It is backward-compatible because it keeps the original required key check while adding support for the rest of Kafka options.

Operationally, this reduces noisy health check failures and helps teams trust readiness signals during deployments.

Customer Feedback

The PR description confirms a customer-facing incident: secure Kafka endpoints returned task-canceled and protocol mismatch errors in health checks. The additional context for this blog also marks it as a customer requirement.

In short, the patch translates direct production pain into a minimal, auditable code fix.

Feature Flag Connection

Feature flags are only as reliable as their event and evaluation infrastructure. In FeatBit, Kafka is a key delivery backbone for updates and evaluation data paths.

When health checks correctly reflect secure Kafka connectivity, teams can ship and roll out flags with confidence instead of reacting to false negatives from infrastructure probes.

Getting Started