Fsync and k8s pvc snapshots
Introduction
My last post was about waiting for k8s to finish taking the PVC snapshot before using it for restoration. Well, like most things in software development, the restore didn’t work.
Why?
Why not though? The stateful set is only updated to restore from the volume snapshot once the readyToUse
flag is set to true by kubernetes but it is still not ready?
After further debugging, it turned out that the snapshot itself was ready but the data inside snapshot was not clean.
The application in question that was being snapshotted is a database server with multiple logical dataspaces running at any time. When a snapshot is taken the DBs are shutdown, which drain out all the ongoing transactions and then write out their latest state to disk. Once that is done then a PVC volume level snapshot is triggered. The close is done before the PVC snapshot to make sure there are no active mutations happening on disk when the snapshot is triggered - this gives us better consistency and correctness of on disk state in the snapshot.
Each of those dataspaces when they write out their latest state on to disk also write out a small metadata file called snapshot.json
in their own respective data folders.
The pseudocode of the flow is something as follows
1 | for(db in Set<Databases>) { |
Once the PVC snapshot is ready it is used to create the next pod’s pvc but the DB would fail to start. Why though!?
The snapshot should be an exact copy of the state of what was on disk at the time the snapshot was triggered then why is the new pod unable to restore and start up?
Even in this world of cloud and kubernetes I did what we did 10 years ago - ssh’d into the server to poke around. Going through the data folders of the database and tracing the server’s start up exception logs I was led to the actual problem. The snapshot.json
file was empty. Why though?
I then went back to the snapshot trigger logic and try to double check if the file is actually present on disk when the pvc snapshot is taken and yup it is there just fine. It has the data as expected and can be read back normally at the time the snapshot is triggered but every time that snapshot is restored - the file itself is present at the right location but it is empty - 0 length. Why though??
The best part - all of this started happening on Friday 4pm just as the team was gearing up to do a release to prod. Of course right - such great timing
So here I am with the biggest feature that is supposed to go out that is not working anymore on a Friday at 4pm - now I am under pressure, stressed and frantically trying to debug the issue to figure out if I can patch it in time for the deploy deadline. Moments like these are what make software dev jobs both stressful and exhausting.
Sleep solves all
My initial gut reaction was to try out a sleep before and after the snapshot - maybe something in the down/up methods was messing up the file while the snapshot was being taken. A 2 second sleep was not enough - the file was still blank but a 2 minute sleep worked - the file had the data as expected. Why though!?
The test was enough to give me a lead that eventually led to the fix of the issue.
If a wait leads to the file being there then that means originally the file wasn’t actually on disk, which is why it never made it to the snapshot. And the reason for that is something devs that work closely with the filesystem know all too well - the write wasn’t flushed out to disk i.e. not fsync’d
Don’t forget to fsync
When a write to a file call is made to the operating system, the data isn’t immedietly written out to permenant storage. The write call returns once the OS buffers the changes even if it hasn’t actually persisted the changes out to non-volatile storage. Eventually the data will be written out to disk aka a flush. We can force a flush by calling fsync
ourselves.
The file was initially being written out using apache commons’ FileUtils.writeStringToFile
- because it is easy and convinient. I then later switched it out to use Google guava’s Files.write
because it seemed to do a flush in the source code but neither worked - at restore time the snapshot.json
file was still empty.
I changed it to use FileOutputStream
which was the only way I knew which gave the ability to force a sync and flush.
The write of the snapshot.json
file after the fix looked something as follows
1 | File file = new File("./snapshot.json") |
Changing out the write method to use a fileOutputStream
with a sync and force fixed the issue - the file had the contents as expected at restore time. With the correct snapshot file present on the new pod’s volume the DB was able to startup and perform the restore procedure. It works!
Conclusion
You know how they say “It’s always DNS” - in DB world it should be “It’s always fsync”
I was relieved that everything started working and I was able to fix the issue before the release and we didn’t have to delay the deploy