Fsync and k8s pvc snapshots

It works

18-Dec-2023

Introduction

My last post was about waiting for k8s to finish taking the PVC snapshot before using it for restoration. Well, like most things in software development, the restore didn’t work.

Why?

Why not though? The stateful set is only updated to restore from the volume snapshot once the readyToUse flag is set to true by kubernetes but it is still not ready?

After further debugging, it turned out that the snapshot itself was ready but the data inside snapshot was not clean.

The application in question that was being snapshotted is a database server with multiple logical dataspaces running at any time. When a snapshot is taken the DBs are shutdown, which drain out all the ongoing transactions and then write out their latest state to disk. Once that is done then a PVC volume level snapshot is triggered. The close is done before the PVC snapshot to make sure there are no active mutations happening on disk when the snapshot is triggered - this gives us better consistency and correctness of on disk state in the snapshot.

Each of those dataspaces when they write out their latest state on to disk also write out a small metadata file called snapshot.json in their own respective data folders.

The pseudocode of the flow is something as follows

for(db in Set<Databases>) {
    db.close() // drains transactions and writes out the snapshot.json file
}

triggerPVCSnapshot() // triggers the k8s pvc level snapshot

for(db in Set<Databases>) {
    db.start() // restarts all the DBs to take client requests
}

Once the PVC snapshot is ready it is used to create the next pod’s pvc but the DB would fail to start. Why though!?

The snapshot should be an exact copy of the state of what was on disk at the time the snapshot was triggered then why is the new pod unable to restore and start up?

Even in this world of cloud and kubernetes I did what we did 10 years ago - ssh’d into the server to poke around. Going through the data folders of the database and tracing the server’s start up exception logs I was led to the actual problem. The snapshot.json file was empty. Why though?

I then went back to the snapshot trigger logic and try to double check if the file is actually present on disk when the pvc snapshot is taken and yup it is there just fine. It has the data as expected and can be read back normally at the time the snapshot is triggered but every time that snapshot is restored - the file itself is present at the right location but it is empty - 0 length. Why though??

The best part - all of this started happening on Friday 4pm just as the team was gearing up to do a release to prod. Of course right - such great timing

So here I am with the biggest feature that is supposed to go out that is not working anymore on a Friday at 4pm - now I am under pressure, stressed and frantically trying to debug the issue to figure out if I can patch it in time for the deploy deadline. Moments like these are what make software dev jobs both stressful and exhausting.

Sleep solves all

My initial gut reaction was to try out a sleep before and after the snapshot - maybe something in the down/up methods was messing up the file while the snapshot was being taken. A 2 second sleep was not enough - the file was still blank but a 2 minute sleep worked - the file had the data as expected. Why though!?

The test was enough to give me a lead that eventually led to the fix of the issue.

If a wait leads to the file being there then that means originally the file wasn’t actually on disk, which is why it never made it to the snapshot. And the reason for that is something devs that work closely with the filesystem know all too well - the write wasn’t flushed out to disk i.e. not fsync’d

Don’t forget to fsync

When a write to a file call is made to the operating system, the data isn’t immedietly written out to permenant storage. The write call returns once the OS buffers the changes even if it hasn’t actually persisted the changes out to non-volatile storage. Eventually the data will be written out to disk aka a flush. We can force a flush by calling fsync ourselves.

The file was initially being written out using apache commons’ FileUtils.writeStringToFile - because it is easy and convinient. I then later switched it out to use Google guava’s Files.write because it seemed to do a flush in the source code but neither worked - at restore time the snapshot.json file was still empty.

I changed it to use FileOutputStream which was the only way I knew which gave the ability to force a sync and flush.

The write of the snapshot.json file after the fix looked something as follows

File file = new File("./snapshot.json")
try(FileOutputStream fileOutputStream = new FileOutputStream(file)) {
    fileOutputStream.write(Json.encde(snapshotData).getBytes())
    fileOutputStream.getChannel().force(true)
    fileOutputStream.getFD().sync()
}

Changing out the write method to use a fileOutputStream with a sync and force fixed the issue - the file had the contents as expected at restore time. With the correct snapshot file present on the new pod’s volume the DB was able to startup and perform the restore procedure. It works!

Conclusion

You know how they say “It’s always DNS” - in DB world it should be “It’s always fsync”

I was relieved that everything started working and I was able to fix the issue before the release and we didn’t have to delay the deploy