Infrascale is a tool developed by the Albert API team to estimate GPU requirements to do LLM inference at scale. It makes estimations within a 10% error margin for models between 8b and 27b and for node sizes lower than 8.
While working within the French public administration, we realized that nearly every organization faces the same question: how many GPUs are needed to reliably serve a given number of users? Operating the official GenAI API for the government, we confronted this challenge firsthand. To plan for the near future—scaling to 2.5 million public agents—we needed a way to benchmark and compare different approaches (e.g., commercial services versus self-hosted open-weight models). InfraScale was created to provide a reproducible, open, and practical methodology for estimating GPU requirements at scale.
- Create and activate a python virtual environment
- Install required dependencies with
pip install -r requirements.txt - Run with
streamlit run app.py
We want to find the smallest number of GPUs that manages to do inference with regards to the targets (number of concurrent users, type of GPU, inference model, target throughput, target wait time, etc.) given by the user.
We define the following variables :
- Let
$S \in \mathbb N_+^*$ the number of GPUs in a given node, i.e the number of GPUs that a single copy of the model is loaded onto, corresponding to model parallelism. - Let
$N \in \mathbb N_+^*$ the number of nodes, i.e the number of parallel copies of the model that are loaded, corresponding to data parallelism. - Let
$B \in \mathbb N_+$ such that the max request batch size is equal to$2^B$ . - Let
$Q \in \mathbb R_+$ be the expected ratio of concurrent requests to batch size for any node at any given time.$Q \leq 1$ means that requests will be processed immediately on average.$Q > 1$ means that requests will wait for previous requests to be processed, this is possible without having a queue expending dramatically as long as, on average,$Q \times batch\textunderscore size$ requests can be processed quicker that the time it takes for the same amount of requests to be sent by users, which will be implied by our throughput constraint.
Given the above variables, we lay out four constraints :
With
The throughput in tokens per second for any given request should be higher than the target set by the user. Currently, we use a theoretical model (see constraints.py and calibration.ipynb) of the throughput for one request on a single GPU and appromixate its relation with node size
The whole system should be able to serve the expected number of concurrent users. We currently consider
Wait time before the beginning of an answer should be lower than the target set by the user. The terms
We solve the following constrained optimization problem with Nelder-Mead's algorithm :
We don't look for an analytical solution as we except to make updates to our theoretical model and to this optimization problem in the near future.
Our framework has been tested on models from 1b to 27b in the LLama, Mistral, Qwen and Gemma families, on 1 H100 GPU and on 2 tensor-parallel H100 GPUs with PCIe intercomms (see calibration.ipynb, although it does not cover every test that has been made). Some tests have been made on 1 A100 GPU aswell for mistral-small-24b and llama-3.1-8b. We implicitely make the following hypothesises:
- Requests are questions with a 20/80 prefill to decode tokens ratio. Tests haven't been made with summary or coding requests that have a much higher prefill to decode ratio.
- GPUs of a same node are linked with SXM intercomms.
- The default parameters of VLLM (or suggested parameters for mistral and qwen models) are used.
- Make empirical tests on bigger node sizes (4 and 8), bigger models, varying intercomms bandwidth and different request types.
- Find a more robust approximation of the relation between throughput,
$S$ ,$B$ and$Q$ (related functions arecalculate_scaling_with_Sandcalculate_scaling_with_B_and_Qinsideconstraints.py). - Estimate
$reqs\textunderscore per\textunderscore user$ . - Take GPU intercomms into account for throughput and wait constraints.
The list of available GPUs and models in Infrascale can be found in db/gpu.json and db/models.json. New GPUs/models can be added by appending them directly at the bottom of the json files and following the current data format.
Locales are set in the translations folder. When adding a new language, don't forget to update the LANGUAGES dictionnary in language.py.
The underlying logic is found in solver.py and constraints.py. Notebooks solver.ipynb and calibration/calibration.ipynb may be useful to understand the logic and how it was defined. A comprehensive article is currently being written to further explain how Infrascale works.
Thanks to the authors Jules Pondard, Théo Lartigau, Cyril Lay and Rémy Siahaan--Gensollen.