This project is a simple implementation of OpenAI’s GPT-2 (124M parameters) language model inference written entirely in plain C++ for CPU (unfortunately I don't have any GPU). The main goal is to create a self-contained GPT-2 inference engine without relying on external libraries or dependencies. By building everything from scratch, I want to gain a deeper understanding of how LLMs works and have full control over performance optimizations. The initial focus is on correctness and simplicity. Performance optimizations will be introduced incrementally, aiming to enhance inference speed and efficiency.
- C++ compiler: A compiler that supports C++17 or higher (e.g., GCC, Clang, MSVC).
- CMake: Version 3.15 or higher for building the project.
- Python3 with Pytorch installed: For running the conversion script to prepare model weights.
- OpenMP (Optional): for running inference on multiple threads
Or just use Dockerfile. It will create an image with all requirements (including downloading model weights and vocabulary).
Obtain the GPT-2 model weights (pytorch_model.bin) from Hugging Face. The C++ code expects the model weights in a specific binary format. Use the provided export_weights.py script to convert the pytorch_model.bin file into a binary file containing all weights serialized as 32-bit floats:
python export_weights.py <output_filepath> <pytorch_model.bin_filepath>Then you also need to download vocab.json from the same Hugging Face repo.
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
makeor if you want to compile with openMP:
cmake -DCMAKE_BUILD_TYPE=Release -DWITH_OPENMP=1 ..After building the executable, try an example:
./gpt2_plain_cpp <model_weights_path> <vocab_json_path> -p "I'm learning deep nueral networks" -n 20The following options are supported:
Required:
[model_weights_path] -- path to .bin file generated with export_weights.py script
[vocab_json_path] -- path to vocab.json for GPT2
Optional:
[-p <std::string>] prompt to start generation with
[-n <int>] number of tokens to generate
[-s <int>] seed if you want to reproduce the result
[-t <int>] number of threads (default: 1)
Here I will try to describe all performance optimizations I made so far. Our main metrics will be Time To First Token (TTFT) and Time Per Output Token (TPOT) and Total generation time (TTFT + TPOT x number of generated tokens).
My hardware: MacBook Pro (M2, 2023)
| TTFT (50 tokens promt) | TPOT (200 tokens in total) | Total generation time | |
|---|---|---|---|
| Optimization 0 | 7900 ms | 155 ms | 38.9 s |
| Optimization 1 | 614.8 ms | 12.3 ms | 3.1 s |
| Optimization 2 | 322.19 ms | 7.2 ms | 1.7 s |
Initially I implemented very naive inference without any optimizations (except KV-cache).
Let's profile our code to find a hot path:
As we can see, the program spends the majority of its time (97.5%) in the forward function of the Linear layer,
which actually is quite obvious. We can observe that the current implementation
is not cache-friendly. Since the weight matrices are row-continuous, the memory access pattern is non-sequential.
To address this, we can simply swap the inner and outer loops (see the
corresponding commit).
The first optimization is the easiest,
but at the same time bringing the most significant performance improvement (12.5x!).
We still have room for improvement in the forward function in the Linear layer. Currently, we are using only a single thread, so adding multithreading to the Linear::forward function could help, as this is still a hot path for inference. First, we need to transpose the weight matrices of all linear layers to make them column-continuous, then swap the inner and outer loops. This adjustment maintains sequential memory access while enabling more effective parallelism, as each thread can process its own batch of output channels. (see the corresponding commit) This optimization brings almost a twofold performance improvement with 8 threads enabled. For larger models, the improvement should be even more significant, as the multithreading overhead is relatively high compared to the number of calculations in our forward function.