An inference server for open-source llms running on TensorRT-LLM.
- Clone this repo in
TensorRT-LLM. - Edit
docker/Makefileanddocker/Dockerfile.multito expose port 8000. - Build
TensorRT-LLMfrom source. Follow the instructions here. - Run
sudo make -C docker release_runto enter the container. - Once inside the container, run
pip3 install fastapi uvicornandexport NCCL_P2P_DISABLE=1.
Most of the models in the HuggingFace model hub are downloaded in PyTorch format. To use these model weights with TensorRT-LLM, we need to convert them to HuggingFace format. For example, to convert the llama-2-7b-chat model weights to llama-2-7b-chat-hf, run the following command:
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/llama-2-7b-chat \
--model_size 7B \
--output_dir /llama-2-7b-chat-hfMore examples in the TensorRT-LLM Llama examples page.
python build.py \
--model_dir CodeLlama-34b-Instruct-hf/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_rmsnorm_plugin float16 \
--enable_context_fmha \
--use_parallel_embedding \
--rotary_base 1000000 \
--vocab_size 32000 \
--parallel_build \
--paged_kv_cache \
--use_inflight_batching \
--output_dir ./tmp/codellama/34B-Instruct/trt_engines/fp16/4-gpu/ \
--world_size 4 \
--tp_size 4 \
--pp_size 1 \
--max_input_len 8192 \
--max_output_len 8192 \
--max_batch_size 64To test the build, run inference.py.
Start the server inside the TensorRT-LLM container. The server will be ready on port 8000.
mpirun -n 4 --allow-run-as-root python main.py