-
Notifications
You must be signed in to change notification settings - Fork 42
Add support for running vLLM #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
amaslenn
wants to merge
46
commits into
main
Choose a base branch
from
am/vllm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
f699535
Add mock vLLM workload
amaslenn 92755ce
Add get_vllm_serve_command()
amaslenn 2980e23
Add generate_serve_run_and_wait_block()
amaslenn 4bdf09c
Initial implementation for _gen_srun_command()
amaslenn 2df1194
Add acceptance case
amaslenn 90b8eaa
Control flow from the sbatch
amaslenn 8677a02
Add bench runs
amaslenn 3f513fa
Redirect outputs and use --overlap
amaslenn ea48456
Update to use docker cache
amaslenn e9080fd
Log steps and less output files
amaslenn 1390886
Check if run successfull
amaslenn 3544898
Fix tests
amaslenn 573a35f
Prepare for disagg
amaslenn e57bc77
vLLM disagg mode
amaslenn 812c1c5
Add quotation
amaslenn 07074d9
Add wa for conflicting VLLM_NIXL_SIDE_CHANNEL_PORT
amaslenn 0ce7b01
Fix port offset value
amaslenn cf8cbe1
Use --export for per-run env vars
amaslenn a55a969
Support extra args
amaslenn ac7e603
More info on cleanup
amaslenn 6052bd7
Correct list of devices as arg
amaslenn f25e41e
Better env vars handling
amaslenn f1ec11d
Better check for success
amaslenn 11b00ed
Update port offset logic
amaslenn dcee8ec
Use ip instead of localhost
amaslenn f85086f
Configure a git repo for proxy script
amaslenn ce78d85
Override prefill/decode gpu list
amaslenn bc09add
Updates
amaslenn 29271b0
Configure output files for bench, allow any options
amaslenn 271df0f
Add reporting for vLLM
amaslenn 1940f05
More human readable column
amaslenn b501ceb
Control decode/prefill separately
amaslenn a353c06
Fix typo
amaslenn 1e85881
Install HF model and mount into container
amaslenn 34ad553
Merge branch 'main' into am/vllm
amaslenn a5bb85a
Add doc for vLLM
amaslenn 4fd089f
Fix copyright year
amaslenn 6c19287
Update doc/workloads/vllm.rst
amaslenn f2ef121
Update src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
amaslenn eafb045
Update tests/job_status_retrieval_strategy/test_vllm_job_status_retri…
amaslenn a81447e
Make parse_vllm_bench_output safer
amaslenn ad9e6b5
Address review comments
amaslenn cd8c093
Update doc/workloads/vllm.rst
amaslenn d05b7c7
Address review comments
amaslenn 8c71afb
Address review comments
amaslenn 488d1ee
Address review comments
amaslenn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| vLLM | ||
| ==== | ||
|
|
||
| This workload (``test_template_name`` is ``vllm``) allows users to execute vLLM benchmarks within the CloudAI framework. | ||
|
|
||
| vLLM is a high-throughput and memory-efficient inference engine for LLMs. This workload supports both aggregated and disaggregated prefill/decode modes. | ||
|
|
||
| Usage Examples | ||
| -------------- | ||
|
|
||
| Test + Scenario example | ||
| ~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. code-block:: toml | ||
| :caption: test.toml (test definition) | ||
| name = "vllm_test" | ||
| description = "Example vLLM test" | ||
| test_template_name = "vllm" | ||
| [cmd_args] | ||
| docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0" | ||
| model = "Qwen/Qwen3-0.6B" | ||
| [bench_cmd_args] | ||
| random_input_len = 16 | ||
| random_output_len = 128 | ||
| max_concurrency = 16 | ||
| num_prompts = 30 | ||
| .. code-block:: toml | ||
| :caption: scenario.toml (scenario with one test) | ||
| name = "vllm-benchmark" | ||
| [[Tests]] | ||
| id = "vllm.1" | ||
| num_nodes = 1 | ||
| time_limit = "00:10:00" | ||
| test_name = "vllm_test" | ||
| Test-in-Scenario example | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. code-block:: toml | ||
| :caption: scenario.toml (separate test toml is not needed) | ||
| name = "vllm-benchmark" | ||
| [[Tests]] | ||
| id = "vllm.1" | ||
| num_nodes = 1 | ||
| time_limit = "00:10:00" | ||
| name = "vllm_test" | ||
| description = "Example vLLM test" | ||
| test_template_name = "vllm" | ||
| [Tests.cmd_args] | ||
| docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0" | ||
| model = "Qwen/Qwen3-0.6B" | ||
| [Tests.bench_cmd_args] | ||
| random_input_len = 16 | ||
| random_output_len = 128 | ||
| max_concurrency = 16 | ||
| num_prompts = 30 | ||
| Control number of GPUs | ||
| ---------------------- | ||
| The number of GPUs can be controlled using the options below, listed from lowest to highest priority: | ||
| 1. ``gpus_per_node`` system property (scalar value) | ||
| 2. ``CUDA_VISIBLE_DEVICES`` environment variable (comma-separated list of GPU IDs) | ||
| 3. ``gpu_ids`` command argument for ``prefill`` and ``decode`` configurations (comma-separated list of GPU IDs) | ||
|
|
||
|
|
||
| Control disaggregation | ||
| ---------------------- | ||
| By default, vLLM will run without disaggregation as a single process. To enable disaggregation, one needs to set ``prefill`` configuration: | ||
|
|
||
| .. code-block:: toml | ||
| :caption: test.toml (disaggregated prefill/decode) | ||
| [cmd_args] | ||
| docker_image_url = "nvcr.io#nvidia/ai-dynamo/vllm-runtime:0.7.0" | ||
| model = "Qwen/Qwen3-0.6B" | ||
| [cmd_args.prefill] | ||
| [extra_env_vars] | ||
| CUDA_VISIBLE_DEVICES = "0,1,2,3" | ||
amaslenn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The config above will automatically split GPUs specified in ``CUDA_VISIBLE_DEVICES`` into two halves, first half will be used for prefill and second half will be used for decode. | ||
|
|
||
| For more control, one can specify the GPU IDs explicitly in ``prefill`` and ``decode`` configurations: | ||
|
|
||
| .. code-block:: toml | ||
| :caption: test.toml (disaggregated prefill/decode) | ||
| [cmd_args.prefill] | ||
| gpu_ids = "0,1" | ||
| [cmd_args.decode] | ||
| gpu_ids = "2,3" | ||
| In this case ``CUDA_VISIBLE_DEVICES`` will be ignored and only the GPUs specified in ``gpu_ids`` will be used. | ||
|
|
||
|
|
||
| Control ``proxy_script`` | ||
| ------------------------ | ||
| ``proxy_script`` is used to proxy the requests from the client to the prefill and decode instances. It is ignored for non-disaggregated mode. Default value can be found below. | ||
|
|
||
| It can be overridden by setting ``proxy_script`` by using the latest version of the script from vLLM repository: | ||
|
|
||
| .. code-block:: toml | ||
| :caption: test_scenario.toml (override proxy_script) | ||
| [[Tests.git_repos]] | ||
| url = "https://github.com/vllm-project/vllm.git" | ||
| commit = "main" | ||
| mount_as = "/vllm_repo" | ||
| [Tests.cmd_args] | ||
| docker_image_url = "vllm/vllm-openai:v0.14.0-cu130" | ||
| proxy_script = "/vllm_repo/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py" | ||
| In this case the proxy script will be mounted from the vLLM repository (cloned locally) as ``/vllm_repo`` and used for the test. | ||
|
|
||
|
|
||
| API Documentation | ||
| ----------------- | ||
|
|
||
| Command Arguments | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: cloudai.workloads.vllm.vllm.VllmCmdArgs | ||
| :members: | ||
| :show-inheritance: | ||
|
|
||
| Benchmark Command Arguments | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: cloudai.workloads.vllm.vllm.VllmBenchCmdArgs | ||
| :members: | ||
| :show-inheritance: | ||
|
|
||
| Test Definition | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: cloudai.workloads.vllm.vllm.VllmTestDefinition | ||
| :members: | ||
| :show-inheritance: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | ||
| # Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from .report_generation_strategy import VLLMBenchReportGenerationStrategy | ||
| from .slurm_command_gen_strategy import VllmSlurmCommandGenStrategy | ||
| from .vllm import VLLM_BENCH_LOG_FILE, VllmArgs, VllmBenchCmdArgs, VllmCmdArgs, VllmTestDefinition | ||
|
|
||
| __all__ = [ | ||
| "VLLM_BENCH_LOG_FILE", | ||
| "VLLMBenchReportGenerationStrategy", | ||
| "VllmArgs", | ||
| "VllmBenchCmdArgs", | ||
| "VllmCmdArgs", | ||
| "VllmSlurmCommandGenStrategy", | ||
| "VllmTestDefinition", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | ||
| # Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import json | ||
| import logging | ||
| from functools import cache | ||
| from pathlib import Path | ||
|
|
||
| from pydantic import BaseModel, ConfigDict | ||
| from rich.console import Console | ||
| from rich.table import Table | ||
|
|
||
| from cloudai.core import ReportGenerationStrategy | ||
|
|
||
| from .vllm import VLLM_BENCH_JSON_FILE | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| class VLLMBenchReport(BaseModel): | ||
| """Report for vLLM benchmark results.""" | ||
|
|
||
| model_config = ConfigDict(extra="ignore") | ||
|
|
||
| num_prompts: int | ||
| completed: int | ||
| mean_ttft_ms: float | ||
| median_ttft_ms: float | ||
| p99_ttft_ms: float | ||
| mean_tpot_ms: float | ||
| median_tpot_ms: float | ||
| p99_tpot_ms: float | ||
amaslenn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| @cache | ||
| def parse_vllm_bench_output(res_file: Path) -> VLLMBenchReport | None: | ||
amaslenn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Parse the vLLM benchmark output file and return a VLLMBenchReport object.""" | ||
| if not res_file.is_file(): | ||
| return None | ||
|
|
||
| try: | ||
| with res_file.open("r") as f: | ||
| data = json.load(f) | ||
| return VLLMBenchReport.model_validate(data) | ||
| except Exception as e: | ||
| logging.debug(f"Error parsing vLLM benchmark output: {e}") | ||
| return None | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| class VLLMBenchReportGenerationStrategy(ReportGenerationStrategy): | ||
| """Generate a report for vLLM benchmark results.""" | ||
|
|
||
| def can_handle_directory(self) -> bool: | ||
| return parse_vllm_bench_output(self.test_run.output_path / VLLM_BENCH_JSON_FILE) is not None | ||
amaslenn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| def generate_report(self) -> None: | ||
| results = parse_vllm_bench_output(self.test_run.output_path / VLLM_BENCH_JSON_FILE) | ||
| if results is None: | ||
| return | ||
|
|
||
| console = Console() | ||
| table = Table(title=f"vLLM Benchmark Results ({self.test_run.output_path})", title_justify="left") | ||
| table.add_column("Successful prompts", justify="right") | ||
| table.add_column("TTFT Mean, ms", justify="right") | ||
| table.add_column("TTFT Median, ms", justify="right") | ||
| table.add_column("TTFT P99, ms", justify="right") | ||
| table.add_column("TPOT Mean, ms", justify="right") | ||
| table.add_column("TPOT Median, ms", justify="right") | ||
| table.add_column("TPOT P99, ms", justify="right") | ||
| table.add_row( | ||
| f"{results.completed / results.num_prompts * 100:.2f}% ({results.completed} of {results.num_prompts})", | ||
amaslenn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| f"{results.mean_ttft_ms:.4f}", | ||
| f"{results.median_ttft_ms:.4f}", | ||
| f"{results.p99_ttft_ms:.4f}", | ||
| f"{results.mean_tpot_ms:.4f}", | ||
| f"{results.median_tpot_ms:.4f}", | ||
| f"{results.p99_tpot_ms:.4f}", | ||
| ) | ||
|
|
||
| console.print(table) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.