An architectural pattern for reducing DevOps complexity in web services is to offload application state to the various external APIs that the service integrates with. This approach has some drawbacks, but can be very useful for small teams working on a new project. In the limit, the application is entirely stateless which simplifies the infrastructure and scalability story significantly. Many developers may already do this to some extent, however it is rarely articulated as a specific goal.
The thesis of the approach is that the companies which host these APIs spend significant time and money to maintain them. As a result, for the cost of using a service which was already a planned integration, they provide:
- High availability, multi-region databases
- Automatic backup and restore
- Monitoring and dedicated SRE team
- Better security for stored data
There are some drawbacks to this approach, most notably vendor lock-in and lack of control over critical data. However unlikely, when the assumptions made above are incorrect, you must rely on others to fix the outage and, if a change of vendor is necessary, perform a costly rewrite of your application.
A second difficulty is data access patterns: these APIs are not colocated with your application and are not always designed for single-request retrieval of the information you need. This can result in request latency issues and requires careful consideration so as not to abuse the API. In some cases we’ve addressed this by introducing some clever caching in the API client.
Finally, this approach can slow down development because the addition of each “table” requires careful consideration of if and where you want to store data, reading of API documentation to determine capability for storage and access patterns, and investigation of potential consistency issues.
In this post we will share how this pattern was used to build DryDock, which is a registry-agnostic container image build service. For DryDock, we used Stripe and Kubernetes (AWS EKS) to store the necessary state for the application.
The main inspiration for this architecture was the amazing quality of Stripe’s API. This API is well-known for its excellent design and documentation, but it also offers the capability to store arbitrary metadata on most entities, and sometimes has particular support for things you may wish to track like metered usage.
It makes a lot of sense to use Stripe as a user database. In order to accept payments, Stripe already stores a wide variety of information about customers including subscription, contact, and payment information. Storing a Customer in this database does not require any information in particular, or for them to be paying customers. Using a separate table of customers which must be kept in sync with Stripe creates unnecessary complexity.
In DryDock’s case, we only store username and usage information (metered build time) for each customer, which Stripe has first-class support for. You can store arbitrary key-value pairs using the API’s metadata functionality for Customer and Subscription API objects. One thing to keep in mind is that these values must be strings, so you will need to handle serializing and deserializing other types.
Another complexity is record lookup. DryDock uses GitHub as an OAuth2 identity provider. A user’s session is stored in browser cookies to prevent the need for storing session information server side. When a request comes in, headers are included which indicate the client’s username. Stripe does not permit Customer lookup by anything but Stripe Customer id, so each replica maintains an in-memory map from username to Customer id, which is populated from the Stripe API during startup.
Stripe can also be used to manage certain types of application configuration, such as settings related to product tiers, and offers a great user interface for this with audit capabilities. Metadata fields on those entities can define parameters like resource limits. DryDock is able to dynamically load this information from Stripe, display it to the user, and begin creating builds matching the changes right away.
Another record that must be tracked corresponds to the image definitions that our users configure to trigger builds. DryDock was already being designed to execute builds on Kubernetes, so choosing to store data there using Custom Resources had little marginal cost.
Kubernetes typically represents substantial platform engineering, however using it for object storage only requires kube-apiserver and etcd, both of which are entirely maintained by the cloud provider in managed offerings (and can be used without nodes). The Kubernetes API provides a lot of nice things that many custom APIs end up re-implementing:
- High availability etcd storage
- Audit logging (CloudWatch with EKS)
- Extensible API with schema validation
- Object lifetime webhooks
- Efficient watching for object changes
- Rich client libraries in many languages
All of these features make a very appealing case for creating your API by extending the Kubernetes API. There are a lot of non-domain-specific behaviors that APIs can exhibit, but this approach normalizes these so that developers can focus on domain-specific things.
There are a number of drawbacks to this approach. First, while you can in theory access this API directly from frontend applications, this would require careful RBAC management and does not let you store non-public information on your resources. Instead, we chose to implement a proxy which performs additional checks and filtering for each request, and merges this information with other data sources like Stripe.
Another drawback is schema migration. Custom resources provide the capability for object versioning and schema migration, but this is a relatively heavy process. Just adding or altering a single field can be delicate, and care must be taken to test any “migration” that you have implemented.
Finally, we have scalability concerns about this approach. Relative to a database like Postgres, a Kubernetes control plane can handle far fewer records and introduces latency. In the event that you need to vertically scale your control plane, this can be a difficult process with some managed offerings. We believe that this limit is still quite large when using the API efficiently, and would represent thousands of active users, making this a Champagne problem for now.
As you can see, there are many appealing reasons to push the state of your application into external services. This architectural pattern is not without drawbacks, but for a small team working on an initial release, it can make a lot of sense.
We hope that you enjoyed this look into our engineering process, and encourage you to check out DryDock!