Skip to content

johnjyang/inference-server

Repository files navigation

inference-server

An inference server for open-source llms running on TensorRT-LLM.

Setup

  1. Clone this repo in TensorRT-LLM.
  2. Edit docker/Makefile and docker/Dockerfile.multi to expose port 8000.
  3. Build TensorRT-LLM from source. Follow the instructions here.
  4. Run sudo make -C docker release_run to enter the container.
  5. Once inside the container, run pip3 install fastapi uvicorn and export NCCL_P2P_DISABLE=1.

Converting raw models weights to HuggingFace format

Most of the models in the HuggingFace model hub are downloaded in PyTorch format. To use these model weights with TensorRT-LLM, we need to convert them to HuggingFace format. For example, to convert the llama-2-7b-chat model weights to llama-2-7b-chat-hf, run the following command:

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
  --input_dir /path/to/llama-2-7b-chat \
  --model_size 7B \
  --output_dir /llama-2-7b-chat-hf

Prepare TensorRT-LLM engines

More examples in the TensorRT-LLM Llama examples page.

python build.py \
  --model_dir CodeLlama-34b-Instruct-hf/ \
  --dtype float16 \
  --remove_input_padding \
  --use_gpt_attention_plugin float16 \
  --use_gemm_plugin float16 \
  --use_rmsnorm_plugin float16 \
  --enable_context_fmha \
  --use_parallel_embedding \
  --rotary_base 1000000 \
  --vocab_size 32000 \
  --parallel_build \
  --paged_kv_cache \
  --use_inflight_batching \
  --output_dir ./tmp/codellama/34B-Instruct/trt_engines/fp16/4-gpu/ \
  --world_size 4 \
  --tp_size 4 \
  --pp_size 1 \
  --max_input_len 8192 \
  --max_output_len 8192 \
  --max_batch_size 64

To test the build, run inference.py.

Start the server!

Start the server inside the TensorRT-LLM container. The server will be ready on port 8000.

mpirun -n 4 --allow-run-as-root python main.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages