This repo contains two (unofficial) implementations of the infini-gram model described in Liu et al. (2024). This branch contains the Golang implementation. The main branch contains a Python implementation.
The tokenizers used here are the Go bindings to the official Rust library.
First, build the rust tokenizers binary:
cd tokenizers
makeThen, you can build the infinigram binary:
cd ../
go build -ldflags "-s"./infinigram --train_file corpus.txt --out_dir output --tokenizer_config tokenizer.json
where corpus.txt contains one document per line. tokenizer.json corresponds to the HuggingFace pretrained Tokenizers file (e.g., for gpt2).
This implementation features:
- Next-token and greedy generation (
--interactive_mode {0,1}) -
mmapto access both the tokenized documents and the suffix array; memory usage during inference should be minimal. - Creating suffix arrays in chunks to further limit memory usage (
--max_mem): you should hypothetically be able to train (and infer) on any sized corpus regardless of how much memory you have - Set the minimum number of continuations needed a for suffix to be valid (
--min_matches). e.g., you may set this at a value >= 2 to avoid sparse predictions where the$(n-1)$ -gram corresponds to only a single document. - A WIP alteration that uses FM-indices + wavelet trees instead of suffix arrays. Uses ~7.5x less disk space, but some queries take longer. See the FM-index branch for more info.
Run ./infinigram --help for more information.
Compare with official APIPile-val with the Llama-2 tokenizer seems to match.- Parallel inference
- Use an external suffix array algo (e.g., fSAIS) to build indices for larger datasets.
I use the text_64 function implemented in the Go suffixarray library---the files under suffixarray/ are from this library with minor modifications.