Tokenization Multiplicity

This repository contains the code used to in the paper Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service by Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis and Manuel Gomez-Rodriguez.

Contents:

Introduction
Repository structure
Setup & instructions
Contact & attribution

Introduction

Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations---the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that canonical generation is comparable to standard generation in terms of performance and runtime, and it solves the problem of tokenization multiplicity.

Repository structure

├── configs
├── data
├── figures
├── notebooks
├── outputs
│   ├── conflicts
│   └── ...
├── scripts
│   ├── coupled_generation.sh
│   ├── evals.sh
│   ├── multiplicity_open.sh
│   └── multiplicity_proprietary.sh
└── src
    └── ccan

configs contains yaml files that specify the experiment parameters.
data contains the data used for our experiments.
figures contains all the figures presented in the paper.
notebooks contains python notebooks to generate all the figures included in the paper.
outputs\conflicts contains non-reproducible cases of tokenization multiplicity by proprietary models.
outputs\... contains intermediate output files to be generated by the experiments' scripts.
scripts contains a set of scripts used to run all the experiments presented in the paper.
src\ccan contains all the code necessary to reproduce the results in the paper.

Setup & instructions

All the experiments were performed using Python 3.11. In order to create a virtual environment and install the project dependencies you can run the following commands:

python3 -m venv .env
source .env/bin/activate
pip install -e .

The experiments involve calling the OpenAI, Gemini and Claude APIs and require an API key for each.

export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GOOGLE_API_KEY="your-gemini-api-key"

Our experiments use LLMs from the Llama family, which requires licensing to use. You can request to access them at: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. Once you have access, you can download any model in the Llama family. Then, before running the scripts you need to authenticate with your Hugging Face account by running huggingface-cli login in the terminal.

huggingface-cli login

export HF_HOME="/path/to/your/cache"

Experiments on tokenization multiplicity

To obtain outputs for the tokenization multiplicity experiments, run the following scripts:

./scripts/multiplicity_open.sh
./scripts/multiplicity_proprietary.sh 
ccan run -c configs/long-translate.yaml

To recreate the plots, run the notebooks notebooks\tokenization_multiplicity.ipynb and notebooks\repeat.ipynb.

Experiments on canonical generation

To obtain outputs for the canonical generation experiments, run the following scripts:

./scripts/coupled_generation.sh
./scripts/coupled_generation.sh --interventional
./scripts/evals.sh

To recreate the results, run the notebook in notebooks\evaluation.ipynb.

Contact & attribution

In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact Ivi Chatzi.

If you use parts of the code in this repository for your own research, please consider citing:

@article{chatzi2026tokenization,
title={Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service}, 
author={Ivi Chatzi and Nina Corvelo Benz and Stratis Tsirtsis and Manuel Gomez-Rodriguez},
year={2026},
journal={arXiv preprint arXiv:2506.06446}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
figures		figures
notebooks		notebooks
outputs/conflicts		outputs/conflicts
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
example.png		example.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenization Multiplicity

Introduction

Repository structure

Setup & instructions

Experiments on tokenization multiplicity

Experiments on canonical generation

Contact & attribution

About

Uh oh!

Releases

Packages

Languages

License

Human-Centric-Machine-Learning/Tokenization-Multiplicity

Folders and files

Latest commit

History

Repository files navigation

Tokenization Multiplicity

Introduction

Repository structure

Setup & instructions

Experiments on tokenization multiplicity

Experiments on canonical generation

Contact & attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages