Infini-gram implementation

This repo contains two (unofficial) implementations of the infini-gram model described in Liu et al. (2024). This branch contains the Golang implementation. The main branch contains a Python implementation.

The tokenizers used here are the Go bindings to the official Rust library.

Build

First, build the rust tokenizers binary:

cd tokenizers
make

Then, you can build the infinigram binary:

cd ../
go build -ldflags "-s"

Run

./infinigram --train_file corpus.txt --out_dir output --tokenizer_config tokenizer.json

where corpus.txt contains one document per line. tokenizer.json corresponds to the HuggingFace pretrained Tokenizers file (e.g., for gpt2).

This implementation features:

Next-token and greedy generation (--interactive_mode {0,1})
mmap to access both the tokenized documents and the suffix array; memory usage during inference should be minimal.
Creating suffix arrays in chunks to further limit memory usage (--max_mem): you should hypothetically be able to train (and infer) on any sized corpus regardless of how much memory you have
Set the minimum number of continuations needed a for suffix to be valid (--min_matches). e.g., you may set this at a value >= 2 to avoid sparse predictions where the $(n-1)$-gram corresponds to only a single document.
A WIP alteration that uses FM-indices + wavelet trees instead of suffix arrays. Uses ~7.5x less disk space, but some queries take longer. See the FM-index branch for more info.

Run ./infinigram --help for more information.

TODO

~~Compare with official API~~ Pile-val with the Llama-2 tokenizer seems to match.
Parallel inference
Use an external suffix array algo (e.g., fSAIS) to build indices for larger datasets.

Third-party libraries

I use the text_64 function implemented in the Go suffixarray library---the files under suffixarray/ are from this library with minor modifications.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
data_scripts		data_scripts
suffixarray		suffixarray
tokenizers @ b258fda		tokenizers @ b258fda
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
data_map.go		data_map.go
files.go		files.go
go.mod		go.mod
go.sum		go.sum
infinigram.go		infinigram.go
sa_map.go		sa_map.go
suffixarray_utils.go		suffixarray_utils.go
tokenize.go		tokenize.go
utils.go		utils.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infini-gram implementation

Build

Run

TODO

Third-party libraries

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AlexWan0/infini-gram

Folders and files

Latest commit

History

Repository files navigation

Infini-gram implementation

Build

Run

TODO

Third-party libraries

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages