Fixing a race

It works

25-Aug-2023

concurrency

Introduction

Recently, I was tasked with fixing a race condition bug. The basic gist of it was that a map in a global singleton was being used externally by a few threads before the singleton thread itself had filled in the details in the map leading to a null pointer exception.

Let’s look at the individual pieces

The first call site

The first call place is directly in the main method during the start method where all the wiring and orchestration happens. There is a check to see if the currently running self instance is a secondary and if so then call HeartBeatManager.getInstance

The second call site

During the startup phase a meta level server manager is started - it has the responsibility to restore from the latest available snapshot on disk. During that restore process if the restored database has a secondary replicator then the heartbeat manager is also called to get the information from the latest heartbeat from the primary. This is the call site of the NPE that was being thrown.

During the orchestration in main the HeartBeatManager is initialized after the meta level server manager is started because the heartbeat manager is dependent upon the server manager [and server manager needs the heart beat manager sometimes during the restore process]

The singleton

Due to an inability to have dependency injection via constructor parameters the codebase uses the pattern of creating singletons with static getInstance method to use across the various places it is needed.

The singleton here is a heart beat manager that periodically makes a heartbeat request to the primary database to grab system level details such as transaction ids and nodes available in the primary cluster.

The code looks something like

public class HeartBeatManager {
    private volatile HeartBeatManager hbManager;
    private HashMap<> heartbeatResponseMap;
    private HeartBeatManager() { }

    public static getInstance() {
        if (hbManager != null) {
            return hbManager;
        }

        synchronized(HeartBeatManager.class) {
            if (hbManager != null) {
                return hbManager;
            }

            hbManager = new HeartBeatManager();
            hbManager.initializeEmptyMap();
            HeartBeatResponse response = hbManager.makeHeartBeat();
            hbManager.fillMapWithResponse(response);

            return hbManager;
        }

        return hbManager;
    }

    // other parts of the grab information from the heartbeat using HeartBeatManager.getInstance().getInfromationFromMap()
    // the methods other places in the code are calling to get the information from the last heartbeat response
    // this throws an NPE if the map has not been filled yet i.e. it gets a null value for a key from the map and then calls a method on that
    // null value
    public ParsedData getInformationFromMap() { ... }
}

Async

At first I thought it was because of an rxJava async subscribe call i.e. the code was returning the object before filling in the details but after adding some logs to trace through that was not happening.

Server manager vs main

If the server manager calls the method before the heart beat manager is initialized in main it should still be fine since the synchronized block makes sure that the data is filled in the map before the instance is returned.

If main calls the method first then the data is filled and server manager should get a properly initialized instance

The subtlety

Then I saw it. The issue is definitely happening because of the race between the initialization in main vs the call coming from server manager during the restore process.

What was happening was main was calling it first and then server manager was calling it but server manager was getting a half initialized instance. Why was that?

the key line here is hbManager = new HeartBeatManager(); This line causes the if guards in the getInstance method to start returning true.

Main was calling getInstance first but at the same time server manager was calling it too.

The call from main would get to line hbManager - new HeartBeatManager() and at that point is when the server manager would call getInstance at that point in time the first if statement would return true and return the uninitialized heart beat manager [lines 19 to 21 had not executed even in the original call from main]

public class HeartBeatManager {
    public static getInstance() {
        // since hbMnager was already set in line 19 it was returned from here to server manager
        if (hbManager != null) {
            return hbManager;
        }

        synchronized(HeartBeatManager.class) {
            if (hbManager != null) {
                return hbManager;
            }

            hbManager = new HeartBeatManager(); // main would get to here when server manager called getInstance
            hbManager.initializeEmptyMap();
            HeartBeatResponse response = hbManager.makeHeartBeat();
            hbManager.fillMapWithResponse(response);

            return hbManager;
        }

        return hbManager;
    }
}

The fix

The fix was quite simple - moved the assignment line to after the data in the map had been initialized

public class HeartBeatManager {
    public static getInstance() {
        if (hbManager != null) {
            return hbManager;
        }

        synchronized(HeartBeatManager.class) {
            if (hbManager != null) {
                return hbManager;
            }

            // assign to a local instance variable
            HeartBeatManager localmanager = new HeartBeatManager();
            localmanager.initializeEmptyMap();
            HeartBeatResponse response = localmanager.makeHeartBeat();
            localmanager.fillMapWithResponse(response);
            // assign to singletone member only after map has been filled
            hbManager = localmanager;
            return hbManager;
        }

        return hbManager;
    }
}

Takeaways

We will be fixing concurrency bugs till we are old and grey. We think adding a thread pool will solve our software problems but all we are left with is a pool of problems.