CodeLLM

Training a coding large language model from scratch using a custom C++ tokenizer and PyTorch.

Example Generation

Given the prompt def fib(, the model generates:

def fib( n ):
    if n == 1:
        return 1
    else:
        return fib( n-1 ) + fib( n-2 )

Generated in 0.2s at 496 tokens/second on NVIDIA RTX 5070 Ti

Training Progress

Trained for 2 days on NVIDIA RTX 5070 Ti on Python code

This project is a work in progress. See the detailed journal at docs/journal.pdf

Getting Started

uv sync

This project uses DVC. In order to execute the pipeline, run:

uv run dvc repro

Tokenizer

The tokenizer is an extra module, it implements BPE (byte-pair-encoding) training and encoding/decoding.

A python binding is available as well as the options to compile the executables:

cmake -S tokenizer -B tokenizer/build -DCMAKE_BUILD_TYPE=Release && cmake --build tokenizer/build -j$(nproc)

This produces the following executables:

./tokenizer/build/tokenize
./tokenizer/build/tokenize-cli
./tokenizer/build/encoding
./tokenizer/build/visualize

Development

Testing

To run the python test, run:

uv run pytest src

In order to run the ./tokenizer tests, make sure you compiled it and then run:

cd ./tokenizer/build
make test

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.dvc		.dvc
docs		docs
out		out
scripts		scripts
src		src
tokenizer		tokenizer
.dvcignore		.dvcignore
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeLLM

Example Generation

Training Progress

Getting Started

Tokenizer

Development

Testing

About

Uh oh!

Releases

Packages

Languages

License

leonardcser/code-llm

Folders and files

Latest commit

History

Repository files navigation

CodeLLM

Example Generation

Training Progress

Getting Started

Tokenizer

Development

Testing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages