The growing pain of caches in microservices

Working with cache is supposed to make our lives better. But what if implementing it creates more problems ?

It all starts with an innocent idea…

Imagine the following situation: a newly-built, fresh-mint microservice is now up and running in production environment. Developers love it, the team celebrates it and it starts going really well. Until someone start looking at the health monitor and realize; hey the micro service takes quite some time just to fulfill its requests. These developers then start troubleshooting, and found out there are just too many calls to external entities; it can be calls to relational databases, or waiting for responses from other microservices.

These developers start thinking, perhaps using a cache mechanism is a great idea!. After all, caches are meant to store readily use data right ?. So the developers now work hard and diligently to implement this idea and voila, now the microservices response quicker! Latencies decrease, and everybody is happy; case closed. Everybody starts assuming caches just work forever and does not require any maintenances.

Over the time (perhaps years), engineers come and go; and everybody just forgets how important maintaining caches in these micro services. As these instances grow more and more to meet business needs, somebody starts noticing a problem: these fleet of microservices starts having cache failures such as unavailable or expired caches. The problem also creates a domino effect because there are other microservices heavily dependent on these microservices in question. These services starts having outages, and everybody is looking after you to mitigate the issue.

What would you do ?

The situations above is typical when building and maintaining micro services. The situation above shows how caches was designed to be a great booster and complement the microservices. But as time goes by, the microservices have become heavily dependent on its caches, and caches are no longer a complementary feature, it is a binding, obligatory feature the microservices cannot live without.

Simply put, the microservice has become addicted to its own caching mechanism.

This article attempts to elaborate the considerations when implementing caches, why we need it (or do we really?), and implementing and monitoring a cache mechanism that makes full use of the advantages each cache type offers.

Why do we necessarily need caches ?

The first question we should ask yourself before developing is why caches are necessary for your microservices in the first place. The answer may vary from one team to another. Some teams desire to reduce latencies to a minimum. Some desire to reduce database calls, or to provide easy-to-work configurations for other engineers to allow minimum development efforts when tackling projects. Perhaps, the combination of these reasons are what motivates a team to implement caches in our microservices. In my line of work, we use caches to improve payment experiences by storing configurations and properties for quicker data access (instead of relying on database calls), and allow less development efforts in the future. Whatever the reasons may be, it has to be technically and economically justifiable; and most importantly it must impactful and improve your core business. After all, we build systems to fulfill our business right ?

The biggest mistake a software developer can make is to implement cache mechanisms without any justifiable reason(s), or want to implement just for the sake of being cool. You must fully understand the impact caches brings, including its advantages and disadvantages. One of the advantages is having quicker access to data is something that we all should strive for. However, often caches are stored locally (i.e. in-memory caches, which are elaborated on later sections) on each microservice instance, which may create a cache coherence issue; a situation whereby cached data are inconsistent across a fleet of instances. This may due to improper data handling, among many others. For example, the operations team may want to update its latest configuration properties, which are stored locally in each and every micro service. After applying the changes, it turns out that not all instances having the latest configurations, due to cache coherence issue. This may create an inconsistency among the instances as one instance may respond differently compared to another instance of the same micro service.

Another mistake a software developer can make is to assume implementing a cache mechanism is the ultimate all-in-one solution to storing your data. I personally think having caches is a complementary solution, not the main solution. An analogy that could be made is that, when you are sick, you may necessarily need take medicine to get healthy. But in order to keep yourself healthy, there are various ways you could do; for instance by drinking more water, sleep sufficiently, and exercise more regularly, and drinking medicine every day is not necessarily good for your health. Similarly, caches are here to provide a quicker solution, but necessarily your permanent solution to storing your data. In fact, storing all data in caches is extremely dangerous, as the data may be lost if the instance experiences outages.

Moreover, when assessing the necessity to use caches, it is important to determine how often a certain data is accessed by your micro service. If a certain data is accessed frequently or a certain data tend to be static (i.e. not changing its value, such as configuration properties), you should consider caching this data. A more dynamic data tend to have accuracy issue when being cached (more on this later), so you should also consider how frequent your caches are refreshed to reach its eventual consistency.

Having a strong, justifiable reason(s) is very important when developing caches. In a team of developers, the initiator usually speak out the intention to the team, and the team will discuss and decide whether it is feasible and easy to implement them. There may be few differences amongst developers, but we often come out with a very strong outcome.

Considerations when building cache mechanism

Cache Types

There are various factors to consider when developing caches. One of the factors is to consider what types of cache you want to build. In-memory caches are the most common which exists usually on hash table on each instance of the micro service. It is easy to implement, low operational overhead, low-risk in terms of integrating to other services. However, as mentioned previously, cache coherence can be a potential issue, thus mitigating and carefully manage these caches are the way to go. On the other hand, microservices which have just started and may take up some time to load all of its caches. However, incoming requests may already start pouring in and caches are needed to fulfill its requests. As a result, the microservices cannot operate instantly and may perform terribly in its first few minutes operating, before slowly recovering back to normal. This is referred to as cold start, and it is a common pitfall when developing cache mechanism.

A simple illustration to describe differences between in-memory and external caches.

On the other hand, external caches can also be a robust solution, in which caches are stored in a centralized instance. The advantage it offers allows coherent data among fleet of microservices, as the information is stored in a central entity. However, it may require some effort to develop as microservices need to perform some sort of call to get the data. This introduces more complexity among existing micro services, such as timeout when requesting data. If not handled carefully, microservices may get null data and may throw exception. This is also a common pitfall; you must avoid this by providing backup mechanism in case the cache is unavailable.

Data to cache and its size

Another factor one may consider is to decide what data you want to cache. Caches are often small in size, as it is not meant for persistent storage. Thus, the key here is to be wise when deciding which data to cache.

Always remember that caches are a complementary feature to your microservices. While it is great to have, your microservices must not cache every single data that exist in your micro services. This is often regarded a general tip everybody should already know, but often I found developers become too complacent and starts caching everything. Thus, it is important to preserve scalability and maintainability to your microservices, thus keeping in line which data to cache is a critical factor to keep in mind.

For example, static data such as endpoint URLs, switches, constant properties can be cached. While more dynamic data such as users who are currently transacting, can be stored with a reasonable expiry time. Again, the solution may vary from one team to another and depends on one’s business needs. The important takeaway is to manage your cache effectively and efficiently.

If cache size becomes critical, another solution is to enforce cache eviction, a mechanism in which some cache are effectively purged to make room when cache hits its full capacity. Usually least recently used (LRU) data are first to be purged. When the purged data is suddenly needed, a remote call can be made to an external cache to get the latest version of the data. Finally, an approach we can consider is to implement time-to-live to each data so that usually unused data can be safely purged from caches, before cache eviction is enforced. Similarly, when the data is suddenly needed while the cache has expired, the microservice can access the data from an external cache.

Availability and Accuracy

Finally the important factor to consider are caches must be readily available and they must be correct. Since microservices are working around the clock 24/7, the cache must be accessible at any times, and outages are impossible to avoid. As developers, we must assume that caches can become unavailable at any given time. Thus, it is important to implement a backup mechanism that can serve the microservices when caches become unavailable.

Data accuracy is also important, as caches must always provide the latest, and most correct data when requested. Returning expired or incorrect data may introduce behavioral errors and bugs to your critical systems. Again, this may be a simple tip for everyone, but when often more dynamic data are cached (which may change its state as time goes by), developers are often not careful and as a result returning incorrect data.

The combination of both availability and accuracy indicates the effectiveness of your caching mechanism. An always available but inaccurate cache returns incorrect data; and an accurate but unavailable cache creates outages. Both factor are equally important when implementing caches.

Implementations

Now that we have taken into account every consideration, now it is time to implement it.

When implementing caches, one crucial thing to keep in mind is that we must build the mechanism to last for a long time. This means the implementation must be easy to maintain and debug, and make full use of any advantages that we have. The caches may not necessarily perform better after implementing, thus we need monitoring system to ensure the effectiveness of the mechanism. I will divide this section into three sections: develop, maintain, and document.

Develop

Combining the best of both worlds in in-memory and external caches is the approach that I often use. Every micro services have a standardized in-memory cache loaded up during start. This allows cache mechanism to work exactly as intended.

To avoid cold start, another simple microservice called cache broker is deployed which acts as a backup broker in case in-memory caches is not ready and yet to be loaded. The cache broker has a direct connection to an external cache that allows data to be transferred. The microservice also notifies any micro services which listen to a particular type of data, and the micro services will run in background to get the latest data should the data change. In this case, cache coherence issue can still be avoided within seconds (less than three seconds would be a good benchmark to start with).

Combining both in-memory and external cache produces a great result !

When retrieving most recently data, a timeout may occur between microservices and the cache broker . If it occurs, micro services will notify that is has missing data, and if the data are required to fulfill one of its requests, it will rebuild the cache by proactively getting the data to the cache broker, and rebuild itself. This way we can recover from missing cache and at the same the recovering the cache to avoid having to ask the external cache again for the same data.

The following is a simple illustration on how to use both in-memory and external cache in the microservice. The implementation of CacheBrokerService is not shown intentionally for simplicity:

public class ChannelConfigurationCacheImpl {    private Map<String, String> channelConfigurationMap = new HashMap<>();

ChannelBrokerService channelBrokerService;

public String getChannelConfiguration(String channelId) {
String result = channelConfigurationMap.get(channelId); if (result == null) {
result = getFromCacheBroker(channelId);
}
return result;
}
public String getFromCacheBroker(String channelId) {
return cacheBrokerService.getChannelConfiguration(channelId);
}
public void setChannelBrokerService(ChannelBrokerService channelBrokerService) {
this.channelBrokerService = channelBrokerService;
}
}

Finally, let us implement a backup to the backup plan. A database connection containing the data to be manifested to the cache is connected to the cache broker. Database calls will be used as a last resort to get the data. Hopefully, I never have to use it.

The design above takes every advantage every type of cache mechanism has to offer. By combining both in-memory and external caches, a systematic approach is implemented to the get data effectively by retrieving from an entity with faster access, to the entity with slower access as backup plans.

On the other hand, there is no such thing as a perfect system. The above implementation still somewhat depends on database calls, which on the surface, defeats the purpose of caching. The recurring theme is that choose a solution that meets your business needs. In my line of work, a slow response is better that no response at all (But not too slow!), as the design above only performs remote call to DB only when all other approach fails.

Finally, the train does not stop here; implementing a cache mechanism is a long road to Rome. Hence why the third step, maintain is a crucial step, which is elaborated on the next section.

Maintain

As a software engineer, we must always measure and monitor our implementations to conclude whether our implementations are correct, and most importantly, impactful by monitoring cache usage, or network penalties, we can deduce next actions to improve our cache even better.

A simple indicator is to measure cache capacity, this is usually can be retrieved by a built-in command in redis or memcached. Check whether the cache capacity spikes within a short period of time, or whether you have the correct data you want to cache (Remember, do not cache unnecessary data). If possible, it would be great to measure how often a cache is missing, or hot often the external cache is invoked (Commonly referred to as network penalty, in which you must request cache to external cache since your in-memory one is unavailable). The possibilities here are endless, and you can be creative here and measure the effectiveness of each data type you cache, or measure how often a data is accessed by sorting from least to most recently used.

Finally, keep in mind maintaining caches is not a difficult task, yet it is often overlooked due to massive workload and other commitments. Keep in mind that cache mechanism is here to make your job easier, and it would not hurt to take some time off your crazy schedule to maintain it.

Document

Ah, the last missing piece of the puzzle, documentations. We developers tend to be procrastinate when it comes to write and document our progresses that sometimes we just assume someone will just understand. If that is the case, I believe other developers will understand what you just implemented in months or perhaps yeas, shook their heads and said “It would be much easier for me to understand if there is a documentation!”.

So, let’s make our lives easy by making others’ easier. Create and start writing a “living” document that evolves from the first time the cache mechanism is implemented with proper versioning is the way to go. It does not have to tidy and neat, it just needs to contain all the information needed for a fellow, educated software engineer, regardless of their experience and seniority levels, to understand.

What comes next ?

Finally, I hope after reading this article, you have some idea on how to implement one yourself in your code, be it on your school project, the company you work for, or simply just for a hobby. Let your curiosity guides you in your path, and I hope you become a better craftsman in software development.

Note: I want to point out that the implementations cannot be done alone, and teamwork plays a huge factor in the success of any implementations. Huge shout out to my partner Ando who takes his time off to work on this and having my back throughout the implementations. Kudos !

Software Developer @ DANA