Waiting for something

It works

13-Dec-2023

Introduction

Recently, I was working on a ticket where the application had to create a volume [disk] level snapshot. Basically, there is a pod running in kubernetes, which has a PVC attached, which is an abstraction over an AWS EBS volume, which is basically a software layer on top of real SSD hardware and we want to take a snapshot of the PVC so that all the contents of the disk are backed up and can be used as a restore point later.

The goal is to take a snapshot of the PVC so that the when the next pod that comes up it can reuse all the pre-existing data from the snapshot as a restore mechanism and doesn’t have to replicate from transaction 0. The snapshot is used to stamp and flash the PVC that the new pod uses.

What was interesting about this ticket was that when a snapshot is triggered via the kubernetes client the call returns before the snapshot is actually ready to use, which means there is a chance if I weren’t careful that I could use a snapshot that hadn’t actually finished creating yet by AWS.

The snapshot name is used as a data source for the next pod - and if it [the snapshot] isn’t ready - then the pod creation fails, which is something that I wanted to avoid.

So the goal becomes how do you wait for something that is happening in the background and will eventually finish?

This post is not a comparision about the various approaches available to wait - it is meant to be a code example/template for how I use a polling approach to wait for something that will eventually finish before my method returns to the client.

Polling

Kubernetes provides a nice way to verify whether a snapshot is ready to use by adding a label to the snapshot object called readyToUse which will be set to true if it is ready to be used for a restore.

The easiest way would be to call the checkIfReady method in a while loop with a thread.sleep in it to avoid hammering the k8s api and exiting with an error if the response is invalid after a certain time bound or count.

Whenever there is a thread.sleep in a while loop - my spidey senses start tingling and I try to think of a better way to refactor the code.

In this situation the refactor was to use a resiliency technique of retries. Polling is basically retrying until a certain response is received and retrying the request if the response is invalid or there is an exception.

Code example

The library I used for retrying was - https://github.com/rholder/guava-retrying

And the retryUntilSnapshotIsReadyToUse method looked something as follows

Retryer<Boolean> retryer = RetryerBuilder.<>newBuilder()
    .retryIfExceptionType(Exception.class)
    .retryIfResult(ready -> !ready)
    .retryIfRuntimeException()
    .withWaitStrategy(WaitStrategies.exponentialWait(100, 5, TimeUnit.MINUTES))
    .withStopStrategy(StopStrategies.stopAfterAttempt(100))
    .build()

Callable<Boolean> callable = () -> volumeSnapshotClient.volumeSnapshots().withName(snapshotName)
    .get().getStatus().getReadyToUse()
retryer.call(callable) // call hangs until valid response received or exit criteria met

Now the retryer.call method will hang and poll the snapshot’s ready to use label until a valid response of true is received or an exit criteria is met - which in this case is to try for a max of 100 times or 5 minutes [increasing the time waiting between each call]

This patten has come handy quite a few times across a few various places - I hope it helps you as well.

Conclusion

I like code that expresses my intent instead of impertively writing the all the logic line by line