Pytcask

Pytcask is a toy project aiming at re-implementing Bitcask from scratch in Python. This project is for educational purposes only: it aimed at understanding the core principles behind a simple storage engine.

Getting started

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual env
source .venv/bin/activate

# Install the dependencies
python3 -m pip install -r requirements.txt

# Run tests
pytest

Implementation notes

Overview

Pytcask is a log-structured key-value store mimicking the behavior of Riak's default storage engine: Bitcask.

Main operations

Writes: Because Pytcask is log-structured, key-value records are appended sequentially to data files. Therefore, inserting and updating records is always done in constant time (o(1)).

Reads: During a key lookup, the key is searched in the KeyDir (the in-memory hash table, see below) which contains information about the record location on disk. The corresponding datafile is then read at the correct offset. Thus, reading requires only one single disk seek.

Deletions: Deleting a record cannot be done by removing it from the log file since it is append-only. Instead, a deleted record is marked by a tombstone (= a special record that indicates that this key has been deleted). The tombstone will be used to discard all records corresponding to that key during the merge process. In addition, the key is removed from the KeyDir to indicate that the key has no associated value.

Boot-up process: Since the KeyDir is stored in memory, it will be lost if the server crashes (or even if it stops gracefully). Upon restart, the KeyDir must be rebuilt from the records stored on disk. One way to do it would be to read all data files and build the KeyDir from there. To be more efficient, a hint file is associated to each data file: it contains only the information included in the KeyDir so that it can be built faster.

Merge operations: Since records are only ever appended to data files, anytime a key is assigned a new value, all previous values are never read anymore. To reclaim disk space corresponding to all those obsolete records, a merge process runs in the background: it reads all data files in descending order of creation (from most recent to oldest) and stores the metadata (data file and offset) of the most recently written record for each key in an in-memory hash map. This hash map will then override the current KeyDir. In addition, new merged (and compressed) files are created, and old data files are discarded.

Characteristics and limitations: Writes are made sequentially, and thus in constant time (o(1)). Reads are also made in constant time, requiring one lookup in the KeyDir and one disk seek in the file indicated by the KeyDir lookup. The main limitation is that all keys must hold in memory (inside the KeyDir). Therefore, this storage engine is adapted for applications that are both read and write-heavy, as long as the number of distinct keys remains relatively small.

Main components

KeyDir: Hash table kept in memory that records each key in the dataset and maps them with their offset in data files.
DataFile: Contains all records, i.e. pairs of key-value + metadata: timestamp. Serialization and deserialization of records occur upon insertion into and retrieval from data files.
HintFile: There is one per data file. It contains all the keys from its associated data file and the meta-information (offset of the record within the data file). It is used to allow performant bootups.
MergeWorker: Handles merge operations in the background to reclaim disk space by compacting and merging data files and discarding obsolete records.
Storage: Exposes all commands (get, insert, delete, ...).

References

To build the project, I used the following:

The reference paper explaining the main concepts of Bitcask: https://riak.com/assets/bitcask-intro.pdf
The source code of Bitcask (in Erlang): https://github.com/basho/bitcask

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pytcask

Getting started

Implementation notes

Overview

Main operations

Main components

References

TODO:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MaudGautier/pytcask

Folders and files

Latest commit

History

Repository files navigation

Pytcask

Getting started

Implementation notes

Overview

Main operations

Main components

References

TODO:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages