Building my own data structures server

02-Oct-2022

Introduction

Most software projects interact with information or data. The standard way is to use a database such as Postgres that persists all this information to disk. Quite often I find myself in situations where most, if not all, of the dataset can comfortably sit in memory. Hardware has gotten really cheap - DigitalOcean provides instances with up to 256GB of RAM, which is more than sufficient for most projects. Keeping everything in memory gives major performance benefits [SSDs although fast in general are quite slow in cloud environments] and avoids having to bring in an external database which in itself requires operational overhead such as setting up backups, security, and replication for high availability. I am all in on the philosophy that every KB of RAM unused is a KB of RAM wasted specially when working with data storage and retrieval.

Redis

Redis is awesome. Antirez [developer of redis] is a legend. If the first data store that most projects bring in is Postgres then the next most popular on that list is Redis. You cannot have a conversation about in memory systems without talking about Redis. Redis markets itself as a data structures server i.e. it provides in-memory data structures such as lists, and sets over a tcp connection. It also snapshots the data to disk periodically and thus provides some semblance of data durability, which [data durability on restarts] is most people’s concern when in-memory systems are discussed.

Why not redis?

If the data fits in memory and redis is an amazing in memory server then why not just go ahead and use that? Initially, that is what I was doing - spun up redis and used that as the main datastore for the application. However, there are a few things that limit redis.

Redis is single threaded

For the most part redis is single threaded, which is a quasi-feature as it provides an easy way to control concurrency but it gets in the way as the load on the server grows since only 1 cpu can be used. More importantly as the dataset grows and the instance is vertically scaled up - it comes with more CPU as well. It is rare to get a larger instance with just more memory so on the extreme end you can get a server with 256GB of ram and 32 cpu - but redis will only be able to max out one of those cores and the rest will all be relatively unused and under utilized. To utilize the rest of the cores, multiple instances of redis must be deployed and then the requests will need to be load balanced across all of them which adds to complexity.

I want to be able to use all the cores on my server as I scale it up to accommodate a larger dataset.

Redis clients connect over TCP

The network protocol used to communicate with redis is TCP. This has advantages of being quite efficient in terms of the payloads that are sent [compact binary messages] and the connection itself is long lasting. Auth and security is provided via a password and TLS.

However I wanted a server that could be used over HTTPS, since TCP is not available in certain environments such as Cloudflare workers, and provided auth via simpler API key mechanism.

Inverted TTLs

One of my favourite feature of redis is the ability to provide TTLs i.e. after a certain amount of time the data is considered expired and is removed from the dataset. Not only does this allow keeping a check on how many items are kept in memory it also allows modeling various business requirements where data only needs to be kept around for a limited amount of time without having to write cron scripts that delete data periodically.

However what redis does not have is inverted TTL i.e. after a certain amount of time the data that was removed is added back. If you have a key userid:can:sendMsg which defines the permission that a user can send a message - it can be removed temporarily, thereby removing the ability for the user to send messages when they have been reported by multiple users for the specified TTL.

Delay queues

A delay queue is a queue where the items are not visible to the consumer until a pre-specified delay has passed. Redis lists are frequently used as a queue but items added to the queue are immediately visible to all consumers - there is no way to specify visibility.

I wanted the ability to have a data structure that would provide semantics where an item can be kept hidden from consumers for an arbritray amount of time.

Embedded

A pattern I picked up when doing Golang development was the ability to expose the same piece of logic via a CLI tool, http server, as well as being able to bring it in a regular imported dependency. Redis cannot be used as a cli tool or brought in as a dependency - it has to be used in a client-server deployment pattern.

I wanted the ability to bring in the data structures as a simple maven dependency if needed for any next project.

Overview of the design

The server is a Vert.x HTTP server that exposes actions on the data structures. It periodically snapshots all the data to disk [and auto backs it up to an S3 compatible storage] and loads everything back at startup so there is soft durability. All the data structures are concurrency friendly and the server itself is multi threaded and thus can scale to utilize multiple CPU cores as the instance is scaled up. Auth is provided by an x-api-key header and metrics are exposed at /metrics in prometheus format.

The data structures exposed are:

Counters [scalable long counters with both increment and decrement capabilities]
Delay queues [By default messages are not visible for 30 seconds]
Hash maps [includes set if not exist]
Sorted maps [includes set if not exist]
Lists [Double ended queue]
Sets [with ability to do union/intersection/diff between sets]
Sorted sets [with ability to do union/intersection/diff between sets]
Streams [kafka style log]

The code is written in kotlin and deployed as a docker image. The app is the data store and is completely self contained - it does not use anything external and is not a functionality-add wrapper over redis.

Conclusion

The project isn’t meant to compete with redis and is in-fact heavily inspired by redis but tweaked a bit to better suit the requirements of the projects I work on.

What’s next on the road map? Likely build a service that uses counters to count hits, delay queues to send out emails, hash-maps to store user data, sorted maps/sets to keep track of leader boards, and streams to write out events in an ordered log.

Reach out at [email protected] if you are interested in using this