Skip to content

This repository contains the code and data for the paper "Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service"

License

Notifications You must be signed in to change notification settings

Human-Centric-Machine-Learning/Tokenization-Multiplicity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenization Multiplicity

This repository contains the code used to in the paper Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service by Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis and Manuel Gomez-Rodriguez.

Contents:

Introduction

Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations---the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that canonical generation is comparable to standard generation in terms of performance and runtime, and it solves the problem of tokenization multiplicity.

Repository structure

├── configs
├── data
├── figures
├── notebooks
├── outputs
│   ├── conflicts
│   └── ...
├── scripts
│   ├── coupled_generation.sh
│   ├── evals.sh
│   ├── multiplicity_open.sh
│   └── multiplicity_proprietary.sh
└── src
    └── ccan
  • configs contains yaml files that specify the experiment parameters.
  • data contains the data used for our experiments.
  • figures contains all the figures presented in the paper.
  • notebooks contains python notebooks to generate all the figures included in the paper.
  • outputs\conflicts contains non-reproducible cases of tokenization multiplicity by proprietary models.
  • outputs\... contains intermediate output files to be generated by the experiments' scripts.
  • scripts contains a set of scripts used to run all the experiments presented in the paper.
  • src\ccan contains all the code necessary to reproduce the results in the paper.

Setup & instructions

All the experiments were performed using Python 3.11. In order to create a virtual environment and install the project dependencies you can run the following commands:

python3 -m venv .env
source .env/bin/activate
pip install -e .

The experiments involve calling the OpenAI, Gemini and Claude APIs and require an API key for each.

export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GOOGLE_API_KEY="your-gemini-api-key"

Our experiments use LLMs from the Llama family, which requires licensing to use. You can request to access them at: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. Once you have access, you can download any model in the Llama family. Then, before running the scripts you need to authenticate with your Hugging Face account by running huggingface-cli login in the terminal.

huggingface-cli login

export HF_HOME="/path/to/your/cache"  

Experiments on tokenization multiplicity

To obtain outputs for the tokenization multiplicity experiments, run the following scripts:

./scripts/multiplicity_open.sh
./scripts/multiplicity_proprietary.sh 
ccan run -c configs/long-translate.yaml

To recreate the plots, run the notebooks notebooks\tokenization_multiplicity.ipynb and notebooks\repeat.ipynb.

Experiments on canonical generation

To obtain outputs for the canonical generation experiments, run the following scripts:

./scripts/coupled_generation.sh
./scripts/coupled_generation.sh --interventional
./scripts/evals.sh

To recreate the results, run the notebook in notebooks\evaluation.ipynb.

Contact & attribution

In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact Ivi Chatzi.

If you use parts of the code in this repository for your own research, please consider citing:

@article{chatzi2026tokenization,
title={Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service}, 
author={Ivi Chatzi and Nina Corvelo Benz and Stratis Tsirtsis and Manuel Gomez-Rodriguez},
year={2026},
journal={arXiv preprint arXiv:2506.06446}
}

About

This repository contains the code and data for the paper "Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published