Releases: b4rtaz/distributed-llama
0.16.3
This version improves the reliability of dllama-api. The API now runs as a persistent service designed for continuous operation. If any worker crashes, the API automatically attempts to reconnect to the failed node and reinitialize the cluster. The goal is to ensure that the API remains operational within moments after any node failure.
0.16.2
0.16.1
0.16.0
0.15.4
0.15.3
This version fixes a precision issue in multiplication for Qwen models on NVIDIA GPUs. Additionally, it includes several Vulkan shader improvements that increase inference speed.
Prediction
CPU: Xeon® E5-2650 v4, Mainboard: Z10PG-D24 Series, GPU: NVIDIA GeForce RTX 3060 12GB #249
| Model | Tokens/s (version 0.15.0) | Tokens/s (version 0.15.2) | Tokens/s (0.15.3) |
|---|---|---|---|
qwen3_8b_q40 |
12.9 | 13.65 | 16.86 |
0.15.2
0.15.1
This version introduces a small optimization for Vulkan that reduces the number of bytes required to synchronize between the CPU and GPU during prediction.
Tested on NVIDIA GeForce RTX 3060 12GB (with --steps 128):
| Model | Tokens/s (previous version) | Tokens/s (0.15.1) |
|---|---|---|
lama3_1_8b_instruct_q40 |
13.68 | 14.83 |
qwen3_0.6b_q40 |
44.41 | 61.98 |
0.15.0
This version fixes a memory alignment bug that caused incorrect inference results on NVIDIA GPUs (and likely on AMD GPUs as well). Distributed Llama now generates correct output both on a single node and in a distributed setup. I tested the configuration with four Tesla V100-PCIE-16GB GPUs connected to the same mainboard, with each Distributed Llama node using a different GPU, the model was Llama 3.3 70B Q40.
4 x Tesla V100-PCIE-16GB logs
🌋 Device: Tesla V100-PCIE-16GB
🌋 DeviceApiVersion: 1.3.242
🌋 MaxComputeSharedMemory: 48 kB
🌋 Heap[0]: 16384 MB
💿 Loading weights...
💿 Weights loaded
🚁 Network is in non-blocking mode
Hello world!
🔷️ Eval 622 ms Sync 191 ms | Sent 12528 kB Recv 13367 kB | (3 tokens)
🔶 Pred 317 ms Sync 34 ms | Sent 4176 kB Recv 4455 kB | I
🔶 Pred 214 ms Sync 36 ms | Sent 4176 kB Recv 4455 kB | am
🔶 Pred 225 ms Sync 17 ms | Sent 4176 kB Recv 4455 kB | a
🔶 Pred 219 ms Sync 42 ms | Sent 4176 kB Recv 4455 kB | new
🔶 Pred 215 ms Sync 38 ms | Sent 4176 kB Recv 4455 kB | staff
🔶 Pred 203 ms Sync 30 ms | Sent 4176 kB Recv 4455 kB | member
...
🔶 Pred 214 ms Sync 44 ms | Sent 4176 kB Recv 4455 kB | I
🔶 Pred 201 ms Sync 50 ms | Sent 4176 kB Recv 4455 kB | am
🔶 Pred 213 ms Sync 42 ms | Sent 4176 kB Recv 4455 kB | committed
🔶 Pred 210 ms Sync 49 ms | Sent 4176 kB Recv 4455 kB | to
🔶 Pred 228 ms Sync 42 ms | Sent 4176 kB Recv 4455 kB | using
Evaluation
nBatches: 32
nTokens: 3
tokens/s: 3.69 (271.09 ms/tok)
Prediction
nTokens: 125
tokens/s: 3.86 (258.96 ms/tok)
📀 RequiredMemory: 14378212 kB
⭕ Socket[0]: connecting to 127.0.0.1:9999 worker
⭕ Socket[0]: connected
⭕ Socket[1]: connecting to 127.0.0.1:9998 worker
⭕ Socket[1]: connected
⭕ Socket[2]: connecting to 127.0.0.1:9997 worker
⭕ Socket[2]: connected
⭕ Network is initialized
🌋 Device: Tesla V100-PCIE-16GB
🌋 DeviceApiVersion: 1.3.242
🌋 MaxComputeSharedMemory: 48 kB
🌋 Heap[0]: 16384 MB
💿 Loading weights...
💿 Weights loaded
🚁 Network is in non-blocking mode
⭐ Chat template: llama3
🛑 Stop: <|end_of_text|>
🛑 Stop: <|eot_id|>
💻 System prompt (optional):
👱 User
> hello? where is Poland?
🤖 Assistant
Hello! Poland is a country located in Central Europe. It is bordered by:
* Germany to the west
* Czech Republic and Slovakia to the south
* Ukraine and Belarus to the east
* Russia (Kaliningrad Oblast) and Lithuania to the northeast
* Baltic Sea to the north
Poland is a member of the European Union and has a population of around 38 million people. The country has a rich history, beautiful landscapes, and a vibrant culture. Is there anything specific you would like to know about Poland?
👱 User
>
It’s also worth noting that the inference utilized all GPUs at all times.
