Fixing a race
Introduction
Recently, I was tasked with fixing a race condition bug. The basic gist of it was that a map in a global singleton was being used externally by a few threads before the singleton thread itself had filled in the details in the map leading to a null pointer exception.
Let’s look at the individual pieces
The first call site
The first call place is directly in the main method during the start
method where all the wiring and orchestration happens. There is a check to see if the currently running self instance is a secondary and if so then call HeartBeatManager.getInstance
The second call site
During the startup phase a meta level server manager is started - it has the responsibility to restore from the latest available snapshot on disk. During that restore process if the restored database has a secondary replicator then the heartbeat manager is also called to get the information from the latest heartbeat from the primary. This is the call site of the NPE that was being thrown.
During the orchestration in main the HeartBeatManager is initialized after the meta level server manager is started because the heartbeat manager is dependent upon the server manager [and server manager needs the heart beat manager sometimes during the restore process]
The singleton
Due to an inability to have dependency injection via constructor parameters the codebase uses the pattern of creating singletons with static getInstance
method to use across the various places it is needed.
The singleton here is a heart beat manager that periodically makes a heartbeat request to the primary database to grab system level details such as transaction ids and nodes available in the primary cluster.
The code looks something like
1 | public class HeartBeatManager { |
Async
At first I thought it was because of an rxJava async subscribe call i.e. the code was returning the object before filling in the details but after adding some logs to trace through that was not happening.
Server manager vs main
If the server manager calls the method before the heart beat manager is initialized in main it should still be fine since the synchronized block makes sure that the data is filled in the map before the instance is returned.
If main calls the method first then the data is filled and server manager should get a properly initialized instance
The subtlety
Then I saw it. The issue is definitely happening because of the race between the initialization in main vs the call coming from server manager during the restore process.
What was happening was main was calling it first and then server manager was calling it but server manager was getting a half initialized instance. Why was that?
the key line here is hbManager = new HeartBeatManager();
This line causes the if guards in the getInstance
method to start returning true.
Main was calling getInstance
first but at the same time server manager was calling it too.
The call from main would get to line hbManager - new HeartBeatManager()
and at that point is when the server manager would call getInstance
at that point in time the first if statement would return true and return the uninitialized heart beat manager [lines 19 to 21 had not executed even in the original call from main]
1 | public class HeartBeatManager { |
The fix
The fix was quite simple - moved the assignment line to after the data in the map had been initialized
1 | public class HeartBeatManager { |
Takeaways
We will be fixing concurrency bugs till we are old and grey. We think adding a thread pool will solve our software problems but all we are left with is a pool of problems.