FeatBit

How to Build and Maintain a 100% Uptime SLA for a Feature Flag Platform

Introduction

Customers have been asking FeatBit how to maintain a 100% uptime SLA for their feature flag platform. This article outlines the considerations and solutions FeatBit has already implemented and plans to implement to ensure continuous availability.

We will continue to update this article as FeatBit deploys additional measures to guarantee 100% uptime for its feature flag management solution.

Considerations and Solution

FeatBit Architecture

Essentially, we believe each component in the architecture—shown in the diagram below—must meet a specific uptime requirement, from the remote server all the way to the SDK:

  1. Feature Flag Management Server – Ensure the service is designed for high availability.
  2. Edge Agents (optional) – Provide a local service that remains operational even if the remote server goes down.
  3. SDK – Continue functioning even if both the remote server and edge agents are unavailable.

Feature Flag Management Server

The Feature Flag Management Server is the core component of the feature flag platform. It's responsible for storing and managing feature flag configurations, as well as providing endpoints for SDKs and Edge Agents to evaluate feature flags. It consists of databases, caches, and compute instances, each requiring a high-availability solution. For example, GCP's High Availability PostgreSQL solution. A multi-region data center or a multi-zone infrastructure within the same region is recommended.

In an enterprise-grade feature flag platform, there are typically two key components: the feature flag configuration service (APIs + UI Dashboard) and the feature flag evaluation service.

  • Feature Flag Evaluation Service: This is critical as it directly interacts with SDKs, Edge Agents, and external services. It must handle a high volume of requests, and any downtime can impact end users.
  • Feature Flag Configuration Service: Used for internal operations (developers, product managers, etc.), it generally does not need to support a high request volume.

To ensure high availability and scalability, we recommend decoupling the feature flag evaluation service from the configuration service. This allows the evaluation service to scale up and out independently, optimizing resource usage and cost efficiency. I highly recommend the evaluation service to be designed as a serverless architecture, making it easier to maintain and scale.

FeatBit Architecture

The image above demonstrates a standalone architecture for the feature flag management server. The Evaluation Service and Configuration Service are separated, and PostgreSQL is designed for high availability across multiple zones within the same region.

In an enterprise-grade feature flag platform, data such as flag usage events can become a bottleneck under high request volumes. We recommend using a distributed event store and a stream-processing platform (such as Kafka) to prevent event record loss.So, a more robust high-availability solution for the feature flag management service is significantly more complex than the image above (e.g. Multi-Region Kafka at Uber). We will continue to update this section in the future.

Edge Agents

There’s always a possibility that the remote server goes down or a network issue occurs between the remote server and the enterprise’s data center. To ensure continuous availability of the feature flag platform, edge agents are deployed within the enterprise’s data center.

How Edge Agents Work:

  1. The feature flag server synchronizes the latest feature flag configurations (which can be scoped to a specific project or environment) with edge agents.
  2. Both server-side and client-side SDKs can connect to an edge agent to evaluate feature flag values, just as they would with the remote server.
  3. Edge agents can also be deployed in multiple regions to maintain high availability. To ensure seamless connectivity, a traffic manager may be needed to route traffic to the nearest edge agent while keeping the same connection endpoints.

Key Considerations:

  1. In-Memory Storage: Feature flag configurations should be stored in memory to maintain functionality even if the remote server goes down.
  2. Persistence for Restarts: If an agent needs to restart while the remote server is down, it should be able to load the latest feature flag configuration from persistent storage.
  3. Data Consistency: A mechanism is needed to ensure feature flag configurations remain consistent across agents deployed in different locations.

For the second point, using local persistent storage is a reliable solution. Possible implementations include local files (SQLite), Redis, or a CDN. However, CDNs have a drawback — configuration updates are not real-time and may experience propagation delays.

Maintaining data consistency across multiple edge agents in different regions presents a challenge, which will be further discussed in thein the Chanllenges section.

NOTE: A diagram design will be added in the next updates.

Server-side SDK

SDK + Persistent Config Storage should be considered a high-uptime solution. There are differences between server-side and client-side SDKs.

The feature flag is evaluated locally using configurations stored in memory. If the SDK loses contact with the FeatBit server or Agent, it continues to function and automatically retries the connection.

The FeatBit Server provides an endpoint to periodically export (e.g., every minute) the latest feature flag configuration as JSON data. This allows you to create a program that periodically (e.g., every minute) syncs the latest configuration to persistent storage within your private network.

If the FeatBit server is down and the service is restarted, the FeatBit server SDK can retrieve the latest feature flag configuration from persistent storage, ensuring it always has the most up-to-date data.

Allowing the server SDK to save the feature flag configuration to the local disk of the running machine could be also a solution. However, in modern cloud-native applications, saving serverless state to the local disk is not recommended.

Below is a diagram of a solution we discussed for the .NET server-side SDK

.NET Server-side SDK

If we use the Agent as middleware between the SDK and the FeatBit Server, you can simply replace the FeatBit Server with the Agent in the diagram above.

Client-side SDK

The feature flag in the client-side SDK is evaluated remotely and stored locally (in memory and on disk). It will not go down if the application has already loaded the feature flag values, even if the application restarts.

However, there is a challenge for users using the feature flag for the first time. If the remote server or edge agent is down, the SDK will not be able to retrieve the latest feature flag values. We will discuss this further in the Chanllenges section.

Challenges

Data Consistency Between Applications and Edge Agents

Ensuring data consistency between edge agents is crucial. If the remote server goes down, edge agents in different regions may end up with inconsistent feature flag configurations. FeatBit team has already had a solution for this problem, which will discuss this in the next updates.

Client-Side SDK

We are considering storing feature flag values for each user/project/environment in a CDN file. This would allow users to continue experiencing the feature seamlessly, even when switching devices. However, there is a challenge: when users use the feature flag for the first time in a new device, the SDK will not be able to retrieve the latest feature flag values, even if a CDN file is available.

Therefore, we need to find a solution to address this issue. In other words, the CDN essentially serves the same purpose as the edge agent, as both function as distributed copies of the centralized feature flag management server, but has a latency issue.

Conclusion

By increase the uptime for each component, we can ensure the feature flag platform can be very near 100% uptime. But considering the cost and complexity, we need to find a balance. Today, we think let the SDK be a high-uptime solution is the best cost-effective solution.

We will continue to update this article as FeatBit deploys additional measures to guarantee 100% uptime for its feature flag management solution, especially for the challenges we discussed above.

Ready to use feature flags to expedite your dev and deployment process?