Poor dev's replicated file system

Introduction

Concurrency is difficult. Distributed state is difficult. Dealing with distributed state getting concurrent writes is how you get a staff developer position at Google.

Context

I recently had to work in a situation where multiple instances of compute had to process and modify a single file in S3. The modification was an append to end of file but regardless that sort of operation is not possible in S3 as it is an object store and not a block store. To modify a file in S3, the compute instance needs to pull the file to its own local filesystem, edit it locally, and the re-upload it at the same key [overwriting].

Want

Wouldn’t it be great if we had a highly available cluster of machines that exposed a file system that supported concurrent transactional writes without having to pull the file locally.

Tradeoff

Nothing is free and there is no such thing as a perfect solution in software only trade-offs. The approach presented in this post will have reads be eventually consistent.

Key components

Deterministic ordering using a shared write ahead log

The big problem in distributed transactions is determining and coordinating ordering. There are strategies like last writer wins but answering who wrote last is difficult to determine in a distributed system.

Key idea: If our transactions are deterministic then each instance of the database can apply the transaction log to an initial starting state and they will all reach the same end state. Most databases use a write ahead log internally - we bring that piece out externally and let many consumers write to and read from it.

To have a shared write ahead log we can use redis streams that will act as our transaction log that all the replicas will consume events from and apply locally to their own state. Streams are a log data structure that allows us to do efficient range reads and cursor style reads from a position - they are conceptually similar to kafka.

Transactional File system

Instead of reading and writing raw files and creating a file system ourselves we can interact with an embedded storage manager that gives us a transactional virtual file system such as Xodus by Jetbrains.

API

Write operations:

  • Write file at specified position
  • Rename file
  • Delete file
  • Append to file

Read operations:

  • Read file from specified position
  • Get list of files

Write path

  1. Receive request
  2. Write mutate operation details to redis streams using xadd
  3. Return 200 back to client

Read path

  1. Receive request
  2. Read key [file] from Xodus
  3. Return 200 with value or not found response to client

Background stream consumer

In the background we have a forever loop that consumes write messages from the redis streams and applies them via Xodus. Redis guarantees that they will be delivered in order so we can apply them and get to the same end state on each machine.

Trim and backup

The server itself can backup the filesystem DB, which would include the redis stream offset of the stream. It would then trim the messages in redis up to that offset and upload the backup to remote storage.

Conclusion

Now I can have multiple machines behind a load balancer serving requests each connected to the same redis instance consuming the mutate messages, while still allowing concurrent writes. No need to download the file locally anymore.

Further readings

Deterministic Database: http://www.cs.umd.edu/~abadi/papers/abadi-cacm2018.pdf

Aries and Wal: https://cs.stanford.edu/people/chrismre/cs345/rl/aries.pdf

Xodus VFS: https://github.com/JetBrains/xodus/wiki/Virtual-File-Systems