Code4IT

The place for .NET enthusiasts, Azure lovers, and backend developers

Harnessing the Power of Jitter: Enhancing Retry Patterns with a bit of randomness

2025-05-06 8 min read Code and Architecture Notes

Operations may fail for transient reasons. How can you implement retry patterns? And how can a simple Jitter help you stabilize the system?

Table of Contents

Just a second! 🫷
If you are here, it means that you are a software developer. So, you know that storage, networking, and domain management have a cost .

If you want to support this blog, please ensure that you have disabled the adblocker for this site. I configured Google AdSense to show as few ADS as possible - I don't want to bother you with lots of ads, but I still need to add some to pay for the resources for my site.

Thank you for your understanding.
- Davide

When building complex systems, you may encounter situations where you have to retry an operation several times before giving up due to transient errors.

How can you implement proper retry strategies? And how can a little thing called “Jitter” help avoid the so-called “Thundering Herd problem”?.

Retry Patterns and their strategies

Retry patterns are strategies for retrying operations caused by transient, temporary errors, such as packet loss or a temporarily unavailable resource.

Suppose you have a database that can handle up to 3 requests per second (yay! so performant!).

Accidentally, three clients try to execute an operation at the exact same instant. What happens now?

Well, the DB becomes temporarily unavailable, and it won’t be able to serve those requests. So, since this issue occurred by chance, you just have to wait and retry.

How long should we wait before the next tentative?

You can imagine that the timeframe between a tentative and the next one follows a mathematical function, where the wait time (called Backoff) depends on the tentative number:

Backoff = f(RetryAttemptNumber)

With that in mind, we can think of two main retry strategies: linear backoff retries and exponential backoff retries.

Linear backoff retries

The simplest way to handle retries is with Linear backoff.

Let’s continue with the mathematical function analogy. In this case, the function we can use is a linear function.

We can simplify the idea by saying that, regardless of the attempt number, the delay between one retry and the next one stays constant.

Linear backoff

Let’s see an example in C#. Say that you have defined an operation that may fail randomly, stored in an Action instance. You can call the following RetryOperationWithLinearBackoff method to execute the operation passed in input with a linear retry.

static void RetryOperationWithLinearBackoff(Action operation)
{
    int maxRetries = 5;
    double delayInSeconds = 5.0;

    for (int attempt = 0; attempt < maxRetries; attempt++)
    {
        try
        {
            operation();
            return;
        }
        catch (Exception e)
        {
            Console.WriteLine($"Retrying in {delayInSeconds:F2} seconds...");
            Thread.Sleep(TimeSpan.FromSeconds(delayInSeconds));
        }
    }
}

The input opertation will be retried for up to 5 times, and every time an operation fails, the system waits 5 seconds before the next retry.

Linear backoff is simple to implement, as you just saw. However, it falls short when the system is in a faulty state and takes a long time to get back to work. Having linear retries and a fixed amount of maximum retries limits the timespan an operation can be retried. You can end up finishing your attempts while the downstream system is still recovering.

There may be better ways.

Exponential backoff retries

An alternative is to use Exponential Backoff.

With this approach, the backoff becomes longer after every attempt β€” usually, it doubles at every retry, that’s why it is called “exponential” backoff.

This way, if the downstream system takes a long time to recover, the top-level operation has a better chance of being completed successfully.

Exponential Backoff

Of course, the downside of this approach is that to get a response from the operation (did it complete? did it fail?), you will have to wait longer β€” it all depends on the number of retries. So, the top-level operation can go into timeout because it tries to access a resource, but the retries become increasingly diluted.

A simple implementation in C# would be something like this:

static void RetryOperationWithExponentialBackoff(Action operation)
{
    int maxRetries = 5;
    double baseDelayInSeconds = 2.0;

    for (int attempt = 0; attempt < maxRetries; attempt++)
    {
        try
        {
            operation();
            return;
        }
        catch (Exception e)
        {
            double exponentialDelay = baseDelayInSeconds * Math.Pow(2, attempt);
            Console.WriteLine($"Retrying in {exponentialDelay:F2} seconds...");
            Thread.Sleep(TimeSpan.FromSeconds(exponentialDelay));
        }
    }
}

The key to understanding the exponential backoff is how the delay is calculated:

double exponentialDelay = baseDelayInSeconds * Math.Pow(2, attempt);

Understanding the Thundering Herd problem

The “basic” versions of these retry patterns are effective in overcoming temporary service unavailability, but they can inadvertently cause a thundering herd problem. This occurs when multiple clients retry simultaneously, overwhelming the system with a surge of requests, potentially leading to further failures.

Suppose that a hypothetical downstream system becomes unavailable if 5 or more requests occur simultaneously.

What happens when five requests start at the exact same moment? They start, overwhelm the system, and they all fail.

Their retries will always be in sync, since the backoff is fixed (yes, it can grow in time, but it’s still a fixed value).

So, all five requests will wait for a fixed amount of time before the next retry. This means that they will always stay in sync.

Let’s make it more clear with these simple diagrams, where each color represents a different client trying to perform the operation, and the number inside the star represents the attempt number.

In the case of linear backoff, all the requests are always in sync.

Multiple retries with linear backoff

The same happens when using exponential backoff: even if the backoff grows exponentially, all the requests stay in sync, making the system unstable.

Multiple retries with exponential backoff

What is Jitter?

Jitter refers to the introduction of randomness into timing mechanisms: the term was first adopted when talking about network communications, but then became in use for in other areas of system design.

Jitter helps to mitigate the risk of synchronized retries that can lead to spikes in server load, forcing clients that try to simultanously access a resource to perform their operations with a slightly randomized delay.

In fact, by randomizing the delay intervals between retries, jitter ensures that retries are spread out over time, reducing the likelihood of overwhelming a service.

Benefits of Jitter in Distributed Systems

This is where Jitter comes in handy: it adds a random interval around the moment a retry should happen to minimize excessive retries in sync.

Exponential Backoff with Jitter

Jitter introduces randomness to the delay intervals between retries. By staggering these retries, jitter helps distribute the load more evenly over time.

This reduces the risk of server overload and allows backend systems to recover and process requests efficiently. Implementing jitter can transform a simple retry mechanism into a robust strategy that enhances system reliability and performance.

Incorporating jitter into your system design offers several advantages:

  • Reduced Load Spikes: By spreading out retries, Jitter minimizes sudden surges in traffic, preventing server overload.
  • Enhanced System Stability: With less synchronized activity, systems remain more stable, even during peak usage times.
  • Improved Resource Utilization: Jitter allows for more efficient use of resources, as requests are processed more evenly.
  • Greater Resilience: Systems become more resilient to transient errors and network fluctuations, improving overall reliability.
  • Avoiding Synchronization: Jitter prevents multiple clients from retrying at the same time, which can lead to server overload.
  • Improved Resource Utilization: By spreading out retries, jitter helps maintain a more consistent load on servers, improving resource utilization.
  • Enhanced Reliability: Systems become more resilient to transient errors, reducing the likelihood of cascading failures.

Let’s review the retry methods we defined before.

static void RetryOperationWithLinearBackoffAndJitter(Action operation)
{
    int maxRetries = 5;
    double baseDelayInSeconds = 5.0;

    Random random = new Random();

    for (int attempt = 0; attempt < maxRetries; attempt++)
    {
        try
        {
            operation();
            return;
        }
        catch (Exception e)
        {
            double jitter = random.NextDouble() * 4 - 2; // Random jitter between -2 and 2 seconds
            double delay = baseDelayInSeconds + jitter;
            Console.WriteLine($"Retrying in {delay:F2} seconds...");
            Thread.Sleep(TimeSpan.FromSeconds(delay));
        }
    }
}

And, for Exponential Backoff,

static void RetryOperationWithExponentialBackoffAndJitter(Action operation)
{
    int maxRetries = 5;
    double baseDelayInSeconds = 2.0;

    Random random = new Random();

    for (int attempt = 0; attempt < maxRetries; attempt++)
    {
        try
        {
            operation();
            return;
        }
        catch (Exception e)
        {
            // Exponential backoff with jitter
            double exponentialDelay = baseDelayInSeconds * Math.Pow(2, attempt);
            double jitter = random.NextDouble() * (exponentialDelay / 2);
            double delay = exponentialDelay + jitter;
            Console.WriteLine($"Retrying in {delay:F2} seconds...");
            Thread.Sleep(TimeSpan.FromSeconds(delay));
        }
    }
}

In both cases, the key is in creating the delay variable: a random value (the Jitter) is added to the delay.

Notice that the Jitter can also be a negative value!

Further readings

Retry patterns and Jitter make your system more robust, but if badly implemented, they can make your code a mess. So, a question arises: should you focus on improving performances or on writing cleaner code?

πŸ”— Code opinion: performance or clean code? | Code4IT

This article first appeared on Code4IT 🐧

Clearly, if the downstream system is not able to handle too many requests, you may need to implement a way to limit the number of incoming requests in a timeframe. You can choose between 4 well-known algorithms to implement Rate Limiting.

πŸ”— 4 algorithms to implement Rate Limiting, with comparison | Code4IT

Wrapping up

While adding jitter may seem like a minor tweak, its impact on distributed systems can be significant. By introducing randomness into retry patterns, jitter helps create a more balanced, efficient, and robust system.

As we continue to build and scale our systems, incorporating jitter is a best practice that can prevent cascading failures and optimize performance. All in all, a little randomness can be just what your system needs to thrive.

I hope you enjoyed this article! Let’s keep in touch on LinkedIn, Twitter or BlueSky! πŸ€œπŸ€›

Happy coding!

🐧

About the author

Davide Bellone is a Principal Backend Developer with more than 10 years of professional experience with Microsoft platforms and frameworks.

He loves learning new things and sharing these learnings with others: that’s why he writes on this blog and is involved as speaker at tech conferences.

He's a Microsoft MVP πŸ†, conference speaker (here's his Sessionize Profile) and content creator on LinkedIn.