inference-server

An inference server for open-source llms running on TensorRT-LLM.

Setup

Clone this repo in TensorRT-LLM.
Edit docker/Makefile and docker/Dockerfile.multi to expose port 8000.
Build TensorRT-LLM from source. Follow the instructions here.
Run sudo make -C docker release_run to enter the container.
Once inside the container, run pip3 install fastapi uvicorn and export NCCL_P2P_DISABLE=1.

Converting raw models weights to HuggingFace format

Most of the models in the HuggingFace model hub are downloaded in PyTorch format. To use these model weights with TensorRT-LLM, we need to convert them to HuggingFace format. For example, to convert the llama-2-7b-chat model weights to llama-2-7b-chat-hf, run the following command:

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
  --input_dir /path/to/llama-2-7b-chat \
  --model_size 7B \
  --output_dir /llama-2-7b-chat-hf

Prepare TensorRT-LLM engines

More examples in the TensorRT-LLM Llama examples page.

python build.py \
  --model_dir CodeLlama-34b-Instruct-hf/ \
  --dtype float16 \
  --remove_input_padding \
  --use_gpt_attention_plugin float16 \
  --use_gemm_plugin float16 \
  --use_rmsnorm_plugin float16 \
  --enable_context_fmha \
  --use_parallel_embedding \
  --rotary_base 1000000 \
  --vocab_size 32000 \
  --parallel_build \
  --paged_kv_cache \
  --use_inflight_batching \
  --output_dir ./tmp/codellama/34B-Instruct/trt_engines/fp16/4-gpu/ \
  --world_size 4 \
  --tp_size 4 \
  --pp_size 1 \
  --max_input_len 8192 \
  --max_output_len 8192 \
  --max_batch_size 64

To test the build, run inference.py.

Start the server!

Start the server inside the TensorRT-LLM container. The server will be ready on port 8000.

mpirun -n 4 --allow-run-as-root python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
llama_example		llama_example
mistral_example		mistral_example
tests		tests
transformers @ fa21ead		transformers @ fa21ead
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
inference.py		inference.py
inference_types.py		inference_types.py
main.py		main.py
serve_model_decoder.py		serve_model_decoder.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

inference-server

Setup

Converting raw models weights to HuggingFace format

Prepare TensorRT-LLM engines

Start the server!

About

Uh oh!

Releases

Packages

Languages

johnjyang/inference-server

Folders and files

Latest commit

History

Repository files navigation

inference-server

Setup

Converting raw models weights to HuggingFace format

Prepare TensorRT-LLM engines

Start the server!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages