Lies, Damned Lies & Timeouts Engineering • Yao Yue • YOW! 2017

Don't let timeouts fool you; they're often just the tip of the iceberg in a much bigger system issue.

Hello everybody! Today, I'm going to talk about timeouts. You cannot trust them, as they often give you misconceptions—sort of like how this is called a green room, but it's not green. Yet, you keep believing that it's the Green Room. This discussion is mostly based on my experience working at Twitter, so let me share a little story.

This is a very famous tweet, at least within Twitter, from the 2014 Oscars, where Ellen called her followers to retweet and favorite this tweet as much as possible. While I do not believe that this is the best photo ever, it was the Tweet of the Year in 2014 for Twitter. For anyone who was actually on call that night, they remember it as the tweet that caused a spectacular incident, keeping everybody busy for the next several months. As much as Ellen wants to believe that she single-handedly brought down Twitter, the reality is she started it with her enthusiastic followers. However, the whole situation quickly spiraled out of control due to retries, timeouts, and all those mechanisms in our data centers that were supposed to prevent bad things from happening. At the center of that storm was actually cache.

This brings me to who I am. I was working on cache at Twitter for about six and a half years, and I've been there for a total of seven years. For the majority of that time, I was focused on cache, which makes me a dinosaur, especially by Silicon Valley standards. Now, I have carried the spirit of cache onto my new team, which works on performance in general. Cache is an interesting example because it's such a simple concept but critical in terms of performance. This allowed me to observe a lot of the distributed system behaviors at play at a very high scale.

So, why do I want to give this talk? For a number of years, I was on cache, which means I was on call for cache during all those years. Every time something happened—and inevitably, the things that matter use cache—someone would observe that their cache requests started timing out. When that happened, their retries kicked in, and they would say, "Oh, the cache is not behaving quite well; let's page the cache on call." So, I often got woken up in the middle of the night to look at a bunch of dashboards, and I would say, "You know what? I don't think this is cache."

If anybody watches House MD, this happened about as often as when they say, "It's not lupus." It's never lupus, except when it is—like once or twice. You can imagine that when I went to the incident room or the chat room to tell people, "It's not my fault," they weren't the happiest about it. They kept pressing, asking, "Okay, if it's not cache, why are my cache requests timing out? Why am I retrying cache requests? You have to have an explanation for this. If you think it's something else, tell us what that thing is!"

That's when I started looking at other people's code because I realized this would keep happening if I didn't tell them exactly what was going on. I did that for a number of years, and after explaining to about 20 or 30 teams—who knows how many—I realized that maybe it wasn't just one, two, or three teams experiencing this issue. Perhaps it was worth telling more people about what actually happens when you see behavior like that.

The reason I'm here at the conference called Yao is that it's literally calling my name. When I got an invitation, I thought, "I have to come here!" To my knowledge, I'm the first Yao speaking at Ya, so I'm very proud of that.

What do I know? I think you need to understand my limitations to take the right message away. I've only really worked on cache in any kind of depth, which shapes my view. If you have a system that's very different, it may behave differently, so this may not matter as much. The other thing to note is that I have been working in Twitter's environment, which may have some similarities to what you're familiar with, but maybe not.

Now, let's look a little more closely at what I mean by those two things. First, what is cache in data centers? Caching data really is just a very large distributed in-memory key-value store, and the operations are extremely simple.

=> 00:05:36

Understanding the limits of your system is key to mastering its performance; embrace simplicity and resilience in the face of complexity.

Speaking at Ya, I am so very proud. However, to provide context, I think you need to know my limitations to take the right message away. I've only really worked on caching in any kind of depth, which shapes my view. If you have a system that's very, very different, it may behave differently. Additionally, I have been working on Twitter's environment, which may have some similarities to what you're familiar with, but maybe not.

Let's look a little bit more closely at what I mean by these two things. First, what is caching in data centers? Caching data really is just a very large distributed in-memory key-value store. The operations are extremely simple: you send a TCP connection, send a request in some very cheap-to-encode format, and you get a response back. The catch here is that we're serving many, many requests in a given time. I'm talking about a single cluster serving tens of millions of requests per second, or collectively hundreds of millions of requests per second. To give you an idea of the scale, we can achieve this because it's really straightforward. I'm not trying to say I'm working on super fancy stuff; it's very simple.

The other important aspect is that it has very tight latency expectations. People expect cache requests to come back usually under a millisecond, but in the worst cases, only a few milliseconds or tens of milliseconds. This means that there is much less room for all kinds of weird things.

Now, what does Twitter's environment mean? Twitter's environment has gone through architectural changes. We started with Ruby on Rails, which we call the "monorail," and then it was broken down into Service-Oriented Architecture (SOA). Today, everybody wants to say that it's called microservices; we will come to that later. Because we own data centers, there are many things we can do right. We have access to physical hardware, and Twitter is known as a Scala shop, but in reality, we use both Java and Scala. The commonality is that most of the heavy lifting happens on top of the Java Virtual Machine (JVM). If you don't have JVM, things may be different.

We have been very aggressively moving to a containerized environment. Our container solution of choice is Aurora Mesos, and I think Kubernetes actually shares a lot of the same constraints when it comes to resources. In terms of scale, what I have seen is thousands of nodes per service—maybe 10,000, but that's the largest we have. To give you an idea of scale, let's take a step back and see why we are talking about timeouts altogether.

It's a very short introduction, so don't worry about it. I think it boils down to the fact that the real world is not perfect. In the 1970s, there were papers and discussions about what is formalized as a two-general problem. What it essentially states is that you cannot have consensus over links that are unreliable. Theoretically, the system can end in any state; it's just not going to be happy. However, that did not stop engineers. Engineers say, "Theoretically, we're not going to reach a consensus, but that doesn't matter in practice; we can get very, very close to the ideal."

The way they do that is by using a timer. They say, "I will bound my attempt by this amount of time; if things don't work out, I'll do it again." This gives us three tries. Of course, the failure could happen in either direction of the communication, so in this case, it doesn't make a difference for the person on the left; she will still retry upon the timeout. There are other systems in which you would just proactively send out many messages, hoping the majority of them, or some number of them, will come back. This is known as quorum. It's not used super well because sending a lot of requests instead of one and waiting is a lot more costly.

So really, that's the essence of the engineering approach to this problem of imperfect communication: timeouts and retries. That's why we're talking about it. Finally, because many of us run services here, nothing is done only once. You keep doing the same thing over and over again. If you try the same request a bunch of times and the remote service doesn't respond...

=> 00:09:56

In distributed systems, handling failure isn't just about retries; it's about making smarter decisions when things go wrong.

In the context of distributed systems, there are various approaches to handling communication failures. One common method is to retry upon the occurrence of a timeout. In some systems, you might proactively send out multiple messages, hoping that a majority or a certain number of them will receive a response. This approach is known as Quorum. However, it is not widely used because sending many requests instead of just one and waiting is significantly more costly.

The essence of the engineering approach to the problem of imperfect communication revolves around timeouts and retries. This is particularly relevant for those of us who run services, as nothing is done only once; we often repeat the same actions multiple times. If you attempt the same request several times and the remote service does not respond, it becomes necessary to adopt a smarter strategy. Instead of continuing to "hit the dead horse," you might consider reaching out to another service or, alternatively, decide to give up altogether. In such cases, you could return a 404 or a 500 error and simply accept that the task cannot be completed at that moment.

In distributed systems, we utilize timeouts, retries, and sometimes take larger-scale actions as mechanisms to cope with failure. Anyone who has worked on large-scale distributed systems understands that while creating a prototype can be accomplished quickly, bringing that system to production involves a significant amount of work. The complexity of distributed systems is often found in handling failure cases. A common saying reflects this reality: "The first 90% of the project takes 90% of the time, and the last 10% of the project also takes 90% of the time." This highlights the challenges in resolving edge cases and ensuring robust failure handling.

To provide a clearer understanding, let's outline a typical failure handling model. You initiate a request to a remote service, set a timer, and plan to retry the request because you intuitively know that a second attempt may succeed, thus improving your overall success rate. If the result remains unsatisfactory, you may choose to take additional actions to prevent future failures.

When characterizing these mechanisms, a few key points stand out. First, due to the inherent uncertainty in communication over a link, a timeout is merely an approximation. While many people understand this concept theoretically, it is often overlooked in practice. When a timeout occurs, it is labeled a failure, but it may not necessarily indicate a true failure; it could simply be that the timeout period was too short or that an unexpected issue arose.

Regarding retries, it is important to recognize that each retry represents an effort, meaning your service is actively working. This work may not have been budgeted for, especially if retries are not typically needed. Additionally, retries can change the state of the system, which is a less obvious but crucial point. When a system is in a retry state, the characteristics of that system may differ from its normal state, potentially leading to different behaviors than what was previously observed.

Finally, failure prevention can be viewed as an escalation. If you make a good decision, it positively impacts the larger system. Conversely, if a poor decision is made, there is a risk of propagating that bad decision throughout the entire system. Therefore, the mechanisms we have observed in our production environments regarding these strategies are essential for understanding what works effectively in managing failures in distributed systems.

=> 00:14:21

In complex systems, what works in theory often leads to chaos in reality; the deeper the stack, the more unpredictable the outcomes.

In retry scenarios, it's important to recognize that it's in a different state. Whatever benchmarking, testing, or reasoning you apply for the normal state may not apply, as the characteristics of the system may change. Consequently, you may not experience the same behavior if the state is different.

Finally, regarding failure prevention, I would characterize it as an escalation. This means that if you made a good decision, you are now making a good decision for a bigger system. Conversely, if you made a bad decision, there is a chance that you are bringing that bad decision to your entire system. Thus, it can go either way.

What we have observed, at least in our production environment, is that mechanisms that work intuitively in our minds or during small local tests often lead to catastrophic outcomes, as exemplified by the case of the Oscar tweet. So, why does this happen?

First, let's examine timeouts. In this illustration, a timeout is depicted as something happening in the middle, which is how many people perceive it: as communication being interrupted between two services. However, one common misconception is that a timeout informs us about the communication to remote services. This assumption requires a deeper look at what a modern service entails.

A modern service rarely runs on bare metal by itself; instead, it operates on a very deep stack. At the top, you have your service application logic, which may utilize several libraries. This runs on a virtual machine—such as a JVM or other VM language VMs—then on the operating system, followed by the kernel, and finally on physical hardware. Underneath that, you have your shared network infrastructure in the data center.

If we scale this out to examine the local service you care about and the service you are communicating with, the construct is often symmetrical. Both the local and remote sides have a very deep stack. For the purposes of this discussion, we don't care so much about the remote side. If the remote stack is compromised, I would simply consider the remote service to be faulty and take appropriate action.

The more interesting aspect here is that your service operates on this very deep stack, and there is little you can do about it. You cannot simply decide, "I don't like the kernel that I'm currently running on; it's doing something stupid, so I will swap it out for a better one." While I have heard of failure prevention mechanisms where a service kills itself in hopes of moving to a different environment, this is not a common practice. Most people do not adopt this approach.

Essentially, this is the family you are born into, and you can never escape it. When a request is made, it has to traverse this deep stack to reach the remote service and then return. If you have a timer, chances are that your timer is at the very top, measuring the entire trip—not just the time spent on the network or the time spent on the remote service.

There are numerous points where things could go wrong in that stack. For instance, consider an application where resources can be contended, such as a queue with a shared lock or a head of line blocking if things are processed in order. If you use libraries, do you ever call malloc? Malloc can have indeterministic completion times; it can be very fast if it retrieves data from local store memory, but it can also be very slow if it has to go all the way to the kernel to get a new page and perform paging.

Thus, many factors can contribute to delays and failures, and when it comes down to the kernel, it does not only support your application; it also needs to handle interrupts and perform I/O—all of which are completely out of your control.

=> 00:18:35

In a multi-tenant environment, your service isn't just competing for resources; it's fighting for survival against hidden complexities and unpredictable neighbors.

In modern application development, there are numerous complexities that can lead to issues, especially when it comes to resource contention. For instance, if you have an application, there may be scenarios where resources can be contended, such as in a queue that has a shared lock or when there is head-of-line blocking. If you utilize libraries, you might encounter situations where you call malloc, which actually has indeterministic completion time. This can be very fast if it comes from local store memory, but it can also be very slow if it requires going all the way to the kernel to get a new page, potentially involving paging.

When it comes to the kernel, it does not only support your application; it also needs to handle interrupts and I/O operations, all of which are completely out of your control and visibility. Additionally, in the realm of hardware, there can be blocking I/O, network congestion, and other unpredictable factors. I recall an incident from years ago where the firmware on a server we were using caused all the CPUs to stop without any good reason. As a result, the server appeared frozen for several seconds, and the kernel was unable to provide any information because it was entirely halted due to the firmware's actions beneath it.

These complexities are often hidden from modern developers. Furthermore, the network can introduce issues such as packet loss, deep queue overflows, and significant delays. The deep stack of interactions can complicate matters even further, particularly in a multi-tenant environment. This is especially true if you are operating within a container that shares hardware with multiple other containers. In such cases, you have no visibility into what others are doing, yet all these services still share the underlying kernel and hardware. Unless you are operating within your own virtual machine, the protections between these services are not as robust as one might assume.

In a multi-tenant environment, it is commonly believed that your service is the good one, while others may be potentially harmful. The reality is that other services can "steal" your CPU cycles. You might think, "No, it's a container; I have assurances that these are the two CPUs and this is the memory that belongs to me." While there are policies in place to enforce these assurances, the enforcement is not as reliable as it seems. For example, both Kubernetes and Mesos utilize cgroups to enforce CPU quotas. In theory, you are supposed to receive the fair share that you declared; however, cgroup CPU policies are enforced by the scheduler every 100 milliseconds. During this time, another process can swoop in and consume all the CPU cycles, leaving you at a disadvantage.

Although the offending process may be penalized and pushed into a lower priority, the damage is done during those 100 milliseconds when your service is not running, and another process is stealing your cycles. In a data center, 100 milliseconds is an incredibly long amount of time. The typical end-to-end round trip time within a single data center is well under one millisecond, often around 400 to 500 microseconds. This means that in ideal conditions, you could send 200 requests back and forth before you are freed from the constraints imposed by cgroups.

Moreover, multi-tenancy exacerbates these issues, leading to what I refer to as a timeout cascade. This phenomenon occurs when you have dependencies that extend beyond a single step. For instance, when the first RPC call goes from service A to service B, it will inevitably wait in some queue, which could include network queues or TCP socket buffers, where the time spent is essentially unaccounted for. Following this, you may need to perform some preparatory computation to handle the request, which could lead to another RPC request for a second dependency. In real-life scenarios, the time spent in these queues can accumulate significantly, leading to cascading delays across dependent services.

=> 00:22:53

Timeouts in distributed systems often mislead us about service health; they don't equal causality. To truly understand issues, we need to look beyond the surface and find the real signals.

In discussing the complexities of timeouts in distributed systems, we encounter what I call a timeout cascade. This phenomenon occurs when there are dependencies that extend beyond a single step. For instance, when the first RPC call goes from service A to service B, it inevitably waits in a queue due to various factors such as network latency and TCP socket buffering. During this time, the duration is often untracked, leading to a lack of visibility regarding how long it takes before the requests are even processed. This lack of visibility contributes to misconceptions about timeouts.

To clarify the situation regarding timeouts, it is important to understand that timeouts often do not indicate the health of a remote service, despite common assumptions. The optimal timeout is a moving target, as changes in the environment can disrupt the predictability of response times. In shared environments, which are often cost-efficient, various aspects can complicate the predictability of response times, leading to gaps in understanding where time is spent. These gaps can create black holes in the timeline of service interactions.

Consider a practical exercise involving a chain of dependencies, labeled A, B, and C. If requests from A are timing out while those from B are not, and you are in a control tower situation, you would need to determine whom to page. In this case, it is clear that C is returning requests to B, but either A or B could be at fault. Therefore, the control tower would page both A and B to investigate the issue.

Now, if both A and B are timing out, the situation becomes more complex. Analyzing just B and C leads to the conclusion that the fault could lie with either B or C, or both. When A is added to the equation, the combinations of potential faults multiply, leading to confusion. At this point, the person in charge might feel compelled to page everyone involved, resulting in a chaotic situation where the root cause is difficult to ascertain.

Modern application architecture often resembles a mesh of dependencies, which is the promise of microservice architecture. While this allows for independent evolution of services, it can also lead to a convoluted web of interactions. An example of this complexity can be seen in a dependency chart from Twitter, which illustrates how intricate these relationships can become. Compared to this, traditional spaghetti code may seem less daunting.

It is crucial for distributed systems engineers to remember that correlation does not equal causation. This principle should be reiterated each time a timeout occurs. To truly understand the underlying causes of issues, one must seek signals from other sources rather than relying solely on timeouts as indicators of service health.

=> 00:27:12

In distributed systems, remember: correlation does not equal causation. Every retry and timeout can create unexpected chaos, turning a healthy service into a bottleneck.

If you go look at the archive, this was in her slides as well. So, compared to this, the spaghetti actually didn't look so bad, did it? What does it mean? It means, you know, that a lot of people have had this drilled into them over a number of years: correlation does not equal causality. For distributed systems engineers, I think it’s important to repeat this to yourself every time something like this happens. Timeout does not equal causality. If you want to figure out what causes what, you really need to find the signal somewhere else.

Now, let's talk about retry. Unfortunately, it doesn't really get better; it's going to get worse. We still have our mental model, and the intuitive understanding is that retries aim to improve the success rate of my service. The first time it failed, the second time it may not. This is true if the failures are entirely independent and you start from the same place, but neither of these two assumptions are necessarily true in your reproduction system.

What ends up happening is that there are a few things to consider. First, we said retry is extra work. Clearly, the person sending the request needs to do twice as much work, if not more, if they want to perform the same request twice. If the communication is sabotaged on the way back, now both sides need to do twice as much work.

Secondly, retry does not equal replay. It's not the same thing twice. Unlike the simple illustration, you may actually change the behavior. For example, when you have an RPC timeout, you might say, "I'm going to close the connection because I have no idea at this point what data is in the pipe." So, you’re starting with a clean slate, which means reopening a new connection, and opening your connection is actually work. This is not the same as just sending a request over an existing connection.

The third point is that the state has changed. You have a new connection, a new object, and guess what? You have more garbage if you are running on top of JVM. Additionally, there are hidden retries in the lower stack, such as TCP retries, which nobody ever thinks about. You might think you're doing one retry at the top, but it could be three packets going out on the wire. All these things are happening without everyone noticing.

Moreover, retry is not as simple a decision as timeouts. With timeouts, you have one number; that’s pretty much it. For retries, you need to consider how many times to retry, how to set the timeout for subsequent retries, how much to wait between retries, and how to write the retry logic exactly because it can be different. You also need to decide where to go, as you may not go to the original service. All these are fiddly tuning knobs that you need to be aware of.

Finally, let's say for retries, it is extra work, and therefore it can create a positive feedback loop where you are loading an overloaded system. This can change the system behavior and the system state, so you’re looking at an entirely different service from the one you are familiar with in steady state. Configuring retry in a sophisticated way actually has to deal with a lot of parameters.

If you have chained dependencies, here's another fun sort of exercise. If C is slow, it’s straightforward: B would time out and retry. If you don't set your timeout correctly, B timing out could actually result in A timing out as well. So, A decides to retry, and now you suddenly have four times as much traffic to a service that is overloaded. If B is slow, things can look strikingly similar to if C is slow. B is slow, and therefore B times out. B thinks, "Oh, C must be bad," so it does twice as much work trying to send more across to C. Of course, because B is slow, A is going to time out, and you still end up with four times as much traffic to C.

Thus, an otherwise healthy service can be dragged into oblivion, even if something just depends on it being slow.

=> 00:31:34

Timeouts can create a perfect storm of cascading failures, dragging healthy services into chaos if not managed carefully.

Set your timeout for a correctly B. Timing out could actually result in a timing out as well, so A decides to retry. Suddenly, you have four times as much traffic to a service that is overloaded. If B is slow, things can actually look strikingly similar to if C is slow. When B is slow and therefore times out, it concludes that C must be bad. Consequently, B does twice as much work trying to send more across to C. Of course, because B is slow, A is going to time out, and you still end up with four times as much traffic to C. An otherwise healthy service can be dragged into oblivion, even if something just depends on it being slow.

Now, we can talk about the famous incident. I think it was interesting because it was essentially a perfect storm that facilitated this kind of positive feedback loop. Let's look at what those conditions were. First, when we have timeouts, at the time we decided we wanted to tear down the connection and have a clean start. The second thing is we had a large number of clients. You can imagine the service responsible for tweets would be one of the more prominent services at Twitter. Each one of these clients actually has multiple connections, with a variable number of connections to the cache because the load can go up and down. We have a full connectivity mesh because we don't want a middleman; it's cheaper to run that way. So, every front end just talks to every cache.

Finally, we had already moved on to container-based deployment, so everything was run on containers subject to the cgroup rules that I talked about earlier. The event unfolded as follows: Allan successfully introduced a hotspot on one of the cache backends, which, of course, caused that cache backend to be a little slower. Consequently, timeouts occurred on the client for that particular tweet. In response, they decided to clean up these connections, leading to an initial connect storm. They still wanted the tweet, so they reconnected and resent requests. Now, you have a lot of outstanding connection requests to the same host, and connecting turns out to be more expensive than the actual requests themselves.

This situation led to even more timeouts. Because we have a full mesh, the timeouts affected every single front-end server trying to communicate with the cache. As more clients did the same thing, we ended up with a bigger storm, and in the end, the network was saturated with just TCP SYN packets. That's when we declared bankruptcy.

In a service mesh, the top-level retries have a lot of power; they can trigger a lot of work deeper down the stack. The lower you are in the stack, the bigger the multiplier that could potentially occur if the timeouts tend to compound. It's very hard to predict where the original bottleneck is because things are connected in such a way that either side of the link could be the cause, and the cause can propagate to other parts. Thus, reasoning about it is really hard.

After all that, we still want to do more; we want to prevent failures from affecting future requests. I won't spend too much time here because how people do failure prevention is highly service-specific. The way you prevent future failures for a stateful service would be very different from a stateless service. What tends to happen is that you have a bit of information, and you try to make a decision on behalf of the entire system with this information. There's always the possibility of propagating what is a local failure to global if you only act on local information. You can make what would otherwise be a transient failure severe enough to promote it to a global failure, making things worse for a longer time.

If you make failure prevention decisions on local nodes individually, they may end up in different places, causing inconsistency in the entire system. Therefore, none of the intuitive understandings of the failure coping mechanisms are actually true when you look closely.

=> 00:35:44

Relying solely on local information can turn small failures into global disasters; to truly understand and prevent issues, we need a broader perspective and multiple feedback loops.

When making decisions on behalf of an entire system, it is crucial to consider the information available. There is always the possibility of propagating what is a local failure to global if one only acts on local information. This can lead to what would otherwise be a transient failure being promoted to a global failure, ultimately making things worse for a longer time. If failure prevention decisions are made on local nodes individually, they may end up in different places, causing inconsistency in the entire system. Thus, none of the intuitive understandings of the failure coping mechanisms hold true upon closer examination.

This situation arises fundamentally from the limitations of having very little input, as well as having your logic dominated by a single feedback loop that is prone to becoming a positive feedback loop. If you are familiar with systems theory, it is evident that if there is only one loop in your system, that loop can sometimes dominate behavior. Equilibrium can only be reached when multiple loops are present, constraining each other. Therefore, the goal should be to bring in more information and incorporate different mechanisms into the system.

To begin, it is essential to establish what I would call the baseline. Understanding the reality of running services in data centers is vital. Most people are familiar with the experience of a machine or rack going down. However, what is often overlooked are the events that occur just as frequently, or even more often, than machine failures. For instance, garbage collection (GC) likely occurs much more frequently than service outages. Have you ever characterized the behavior of GC? How often does it cause your system to freeze or hold up? How frequently do kernel background tasks run, and do they happen more or less than hard failures? Many of these events occur far more often, yet they are not given the attention they deserve. These commonly known failures serve as great baselines for consideration.

Once there is a common understanding of what constitutes common events, it lays a solid foundation for building effective timeouts. The primary goal of good timeouts is to minimize the chance of having false positives. This requires examining the behavior of the library, the kernel, and the hardware to understand how long various anomalies take. Recognizing that one cannot avoid engaging with the layers they depend on is crucial; therefore, timeouts should account for these interactions. For example, you should not penalize a remote server if you know that your local allocation will take around 20 milliseconds; you should be able to tolerate that kind of behavior if it is local.

Additionally, it is beneficial to pull in other information to gain a clearer understanding of what is truly happening. Many services simply state that something has happened, and as a result, the request times out, leading to a timeout exception. However, it is worth considering why it is not framed as, “Oh, I just GC'd for 100 milliseconds, therefore things are slow,” or “Oh, Malo actually returned super slowly this time.” A timeout exception is not particularly informative. Often, there is readily available information that can help clarify the nature of the failure better. This point was also emphasized in a talk earlier in the day.

Understanding what is going on is incredibly powerful. Whenever possible, it is essential to trace events. While I mentioned that timeouts do not equate to causality, tracing the lifecycle of events can provide valuable insights into the system's behavior.

=> 00:39:49

Understanding the nature of failures in microservices is crucial; visibility comes at a cost, so prioritize tracing and thoughtful configuration over blindly copying others.

Exception handling can be quite amusing when you think about it. For instance, why is it not simply an "Oh, I just got a timeout for 100 milliseconds; therefore, it's a slow exception"? Instead, we often encounter vague messages like "Malo actually returns super slowly this time exception." The timeout exception is not very informative, and there are times when information is readily available that can help in understanding the nature of the failure better. This point was also mentioned in a talk earlier in this room this morning.

Understanding what is going on is actually very powerful. Whenever possible, trace the life cycle of a request. It is important to establish the timeline of when a request goes out and when it comes in. This timeline can definitively tell you the causal relationship between all these events. However, if you are using microservices, I have bad news for you: microservices make this type of reasoning much harder. While a monolith can be unwieldy and slow to develop, there are many tools readily available to reason about events happening inside a single monolith. Once you leave the service boundary, you often leave the user space or at least the host, which complicates reasoning about these events.

The visibility in a microservice architecture comes at a much higher cost; you need to put in extra effort to gain that visibility. This aspect of microservice architecture is often overlooked, creating a significant blind spot. If you want to improve retries, I suggest starting from a point of not copying someone else's configuration. This happens surprisingly often. For example, if my favorite service uses three retries, I might think they must have a good reason for it and decide to use three retries as well. However, it is crucial to understand why they use three retries and whether I am talking to the same service or sending the same amount of traffic. This understanding is very close to the heart of distributed systems and their behavior.

Configuration is important and should not be treated as a mere task to complete the day before launching. Instead, it is better to be more conservative rather than aggressive. This might seem counterintuitive because we often think that retrying more increases the chances of success. In reality, it is better to do the least amount of retrying necessary to achieve the same success rate. This approach prevents feeding ammunition into a positive feedback loop too much.

Additionally, if there are rules that are not based on timeouts, consider implementing rules based on the state of the system. For instance, you might ask whether you are going into garbage collection soon, holding too many outstanding requests, or retrying too much in the recent past. These states can help determine the health of the system and serve as brakes to prevent going over the edge.

If you have dependencies, it is beneficial to maintain a global view of them. Start your budget from the top, even if there is an accountability gap where you are unsure how time is spent. Having a budget and a breakdown by dependencies is better than not having that information at all. Furthermore, if you can apply back pressure, definitely do it. Who knows better about a service that is under load than the service itself? If your remote service indicates that it is under load, it is likely true. This information is highly valuable, yet I do not see it incorporated much in modern protocol design. A significant percentage of remote procedure protocols lack built-in back pressure mechanisms. Therefore, the next time you design a protocol, consider adding back pressure as one of the default features, as it may save you from potential issues.

=> 00:44:02

Design your systems with back pressure in mind; it’s the key to scaling effectively and avoiding catastrophic failures.

Breakdown by your dependencies is better than not having that. Another thing is if you can apply back pressure, definitely do it because who knows better about a service that's underload than the local service that's underload itself, right? If your remote service tells you it's underload, it must be true. So, that's actually highly valuable information to have. However, this is something I also don't see too much in modern protocol design. In fact, I would see a very large percentage of remote procedure protocols actually don't have a baked-in back pressure mechanism. Therefore, next time you are designing a protocol, you may consider adding back pressure as one of the default features; it may save you down the road when you scale to a certain level.

In terms of failure prevention, again, there's nothing too specific, but the two things that worked out well in my experience have been: if you want to make a global decision, try not to act on local information only. We definitely made this mistake in our own experience, and it often escalates in the wrong way. The other thing is to suppress the urge of thinking, "Oh, I know something is wrong; I'm going to fix it for everybody." Just wait on it a little longer. You may ask, "My system is suffering while I wait; what do I do?" What could happen is you have a local mechanism that's fast in reacting, but it is short-lived in its effect. You can have another global mechanism that is taking bigger strides but will take much longer and therefore be much surer about the global state of the service. Using a combination of these two generally will give you pretty good reliable results.

Finally, nothing can replace the value of testing. For the first point, I think everybody knows you need to test everything, including pressure files and common failures. However, the last point is something I initially thought was to test catastrophic scenarios, but then I realized that introducing incidents to their services on purpose is not wise. There's a good chance your service may fail. So, one interesting aspect to test is whether you need a certain retry policy or timeouts to achieve a specific service level objective. Can you achieve the same with more relaxed timeouts and retries? This approach actually gives much less fuel to the potential feedback loop that can bite you in an incident.

To give you an idea of what we are doing, the service that I own has evolved over the years. We put all these insights and experiences into what we call a configuration guide for cache, specifically for one type of client, which is fCache, the one that Twitter uses. The guide alone is 2,000 words, not including all the sample code, indicating that a lot of work is required to do this type of configuration properly. Our timeout is also fairly nuanced. We tested that we can achieve 5 milliseconds at P9, but if you look at the timeouts, they are super lenient, on the order of hundreds of milliseconds. Why? Because all our JVM applications can go into garbage collection (GC) for that long, and the C group rules are enforced under a 100-millisecond interval. Therefore, it doesn't make any sense to set anything less than 100 milliseconds.

This is something that's quite counterintuitive to all the people using cache because they expect what's on the right, but actually, they need to do what's on the left. We have different retry policies for read and write operations because the semantics are different, and the purpose is different. We decided that we can protect the system better if we treat them differently, and they have different backoff strategies. Additionally, there is an overall retry budget to break the circuit when we are retrying too much.

We have two levels of failure prevention: the local failure rule is fairly fast in engagement but only lasts for a short amount of time. You can make a global decision by someone else in the system who is a third-party observer, but the change doesn't kick in until after a minute. For something as simple as cache, if you want to make it work reliably, you actually need to do a fairly sophisticated configuration.

I want to close by briefly mentioning something else we did that is not quite along the lines of configuring timeouts and retries. At some point, we realized that properly configuring timeouts and retries is just too hard. This is often an indicator that you need to do something to reduce the variance and improve the reliability of your system. For example, with cache, we changed the application in several ways to remove some of the contentions. We removed the blocking cache, and at the system level, we did a lot of low-level tuning to cap and limit the number of connections. We are not on a shared host because shared hosting simply was not performing well enough for us, and we even set CPU affinities to ensure that the behavior of the kernel is much more deterministic.

So, while timeouts and retries can get you very far, when that becomes too hard, you have more interesting work to do to make the system behave in a more predictable way. In conclusion, we know theoretically that perfection is not possible, and in practice, as engineers, we need to internalize that it's okay to fail sometimes. It's actually better to give up and fail in a deterministic way than to keep retrying until we end up in catastrophe. If you understand the common behavior of your system, including digging down the stack that’s not owned by you but where you depend on, you can have a bigger picture about the health and state of the service and act accordingly.

Introducing a second feedback loop to counteract the primary one that everyone understands will likely put you in a state where you are happy most of the time. That's what has happened with cache; we haven't had a single incident in the past year—knock on wood—that's not going to happen anytime soon. Can it fail? Yes, it can, but compared to a few years ago, we're way better. This improvement is the result of all the exercises we did regarding configuration and understanding what's going on in our system.

Thank you for listening. If you have any questions, I think we have five minutes.