Releases · b4rtaz/distributed-llama

26 Oct 11:05

b4rtaz

v0.16.3

96c661e

0.16.3 Latest

Latest

This version improves the reliability of dllama-api. The API now runs as a persistent service designed for continuous operation. If any worker crashes, the API automatically attempts to reconnect to the failed node and reinitialize the cluster. The goal is to ensure that the API remains operational within moments after any node failure.

Assets 2

20 Sep 13:58

b4rtaz

v0.16.2

6372624

0.16.2

Fixed Vulkan support on Raspberry Pi 5. Distributed Llama now runs with Vulkan, though it is slower than CPU-only execution #259.

Assets 2

16 Sep 21:07

b4rtaz

v0.16.1

649649f

0.16.1

This version adds support for Qwen3 MoE models on Vulkan.

Assets 2

05 Sep 17:18

b4rtaz

v0.16.0

5f5adaf

0.16.0

This version adds support for Qwen3 MoE models on CPU. Vulkan support will be added in a future release.

The performance of MOE models is quite impressive: Qwen3-30B-A3B-Q40 achieves 13.04 tok/s during prediction on 4× Raspberry Pi 5 (8GB). Check details here.

Assets 2

20 Aug 17:14

b4rtaz

v0.15.4

b9ec995

0.15.4

This version brings another speedup in Vulkan inference.

Prediction (--steps 128)

RTX 3090 24GB, AMD EPYC 7313 16-Core Processor #252

Model	Tokens/s (version 0.15.1)	Tokens/s (version 0.15.2)	Tokens/s (version 0.15.3)	Tokens/s (This version)
`llama3_1_8b_instruct_q40`	24.80	24.80	33.32	45.33 🚀

Assets 2

17 Aug 10:24

b4rtaz

v0.15.3

8909825

0.15.3

This version fixes a precision issue in multiplication for Qwen models on NVIDIA GPUs. Additionally, it includes several Vulkan shader improvements that increase inference speed.

Prediction

CPU: Xeon® E5-2650 v4, Mainboard: Z10PG-D24 Series, GPU: NVIDIA GeForce RTX 3060 12GB #249

Model	Tokens/s (version 0.15.0)	Tokens/s (version 0.15.2)	Tokens/s (0.15.3)
`qwen3_8b_q40`	12.9	13.65	16.86

Assets 2

13 Aug 23:03

b4rtaz

v0.15.2

eda0684

0.15.2

This version brings another small improvement for Vulkan.

Tested on NVIDIA GeForce RTX 3060 12GB (prediction, with --steps 128) #247:

Model	Tokens/s (previous version)	Tokens/s (0.15.2)
`lama3_1_8b_instruct_q40`	14.87	16.01

Assets 2

12 Aug 21:29

b4rtaz

v0.15.1

01305c9

0.15.1

This version introduces a small optimization for Vulkan that reduces the number of bytes required to synchronize between the CPU and GPU during prediction.

Tested on NVIDIA GeForce RTX 3060 12GB (with --steps 128):

Model	Tokens/s (previous version)	Tokens/s (0.15.1)
`lama3_1_8b_instruct_q40`	13.68	14.83
`qwen3_0.6b_q40`	44.41	61.98

Assets 2

12 Aug 14:11

b4rtaz

v0.15.0

33189ef

0.15.0

This version fixes a memory alignment bug that caused incorrect inference results on NVIDIA GPUs (and likely on AMD GPUs as well). Distributed Llama now generates correct output both on a single node and in a distributed setup. I tested the configuration with four Tesla V100-PCIE-16GB GPUs connected to the same mainboard, with each Distributed Llama node using a different GPU, the model was Llama 3.3 70B Q40.

4 x Tesla V100-PCIE-16GB logs

🌋 Device: Tesla V100-PCIE-16GB
🌋 DeviceApiVersion: 1.3.242
🌋 MaxComputeSharedMemory: 48 kB
🌋 Heap[0]: 16384 MB
💿 Loading weights...
💿 Weights loaded
🚁 Network is in non-blocking mode
Hello world!
🔷️ Eval  622 ms Sync  191 ms | Sent 12528 kB Recv 13367 kB | (3 tokens)
🔶 Pred  317 ms Sync   34 ms | Sent  4176 kB Recv  4455 kB |  I
🔶 Pred  214 ms Sync   36 ms | Sent  4176 kB Recv  4455 kB |  am
🔶 Pred  225 ms Sync   17 ms | Sent  4176 kB Recv  4455 kB |  a
🔶 Pred  219 ms Sync   42 ms | Sent  4176 kB Recv  4455 kB |  new
🔶 Pred  215 ms Sync   38 ms | Sent  4176 kB Recv  4455 kB |  staff
🔶 Pred  203 ms Sync   30 ms | Sent  4176 kB Recv  4455 kB |  member
...
🔶 Pred  214 ms Sync   44 ms | Sent  4176 kB Recv  4455 kB |  I
🔶 Pred  201 ms Sync   50 ms | Sent  4176 kB Recv  4455 kB |  am
🔶 Pred  213 ms Sync   42 ms | Sent  4176 kB Recv  4455 kB |  committed
🔶 Pred  210 ms Sync   49 ms | Sent  4176 kB Recv  4455 kB |  to
🔶 Pred  228 ms Sync   42 ms | Sent  4176 kB Recv  4455 kB |  using

Evaluation
   nBatches: 32
    nTokens: 3
   tokens/s: 3.69 (271.09 ms/tok)
Prediction
    nTokens: 125
   tokens/s: 3.86 (258.96 ms/tok)

📀 RequiredMemory: 14378212 kB
⭕ Socket[0]: connecting to 127.0.0.1:9999 worker
⭕ Socket[0]: connected
⭕ Socket[1]: connecting to 127.0.0.1:9998 worker
⭕ Socket[1]: connected
⭕ Socket[2]: connecting to 127.0.0.1:9997 worker
⭕ Socket[2]: connected
⭕ Network is initialized
🌋 Device: Tesla V100-PCIE-16GB
🌋 DeviceApiVersion: 1.3.242
🌋 MaxComputeSharedMemory: 48 kB
🌋 Heap[0]: 16384 MB
💿 Loading weights...
💿 Weights loaded
🚁 Network is in non-blocking mode
⭐ Chat template: llama3
🛑 Stop: <|end_of_text|>
🛑 Stop: <|eot_id|>
💻 System prompt (optional): 

👱 User
> hello? where is Poland?

🤖 Assistant
Hello! Poland is a country located in Central Europe. It is bordered by:

* Germany to the west
* Czech Republic and Slovakia to the south
* Ukraine and Belarus to the east
* Russia (Kaliningrad Oblast) and Lithuania to the northeast
* Baltic Sea to the north

Poland is a member of the European Union and has a population of around 38 million people. The country has a rich history, beautiful landscapes, and a vibrant culture. Is there anything specific you would like to know about Poland?
👱 User
>

It’s also worth noting that the inference utilized all GPUs at all times.

Assets 2

09 Aug 10:11

b4rtaz

v0.14.2

cba7e37

0.14.2

This version fixes a bug on AVX2 CPUs that caused incorrect inference results for Qwen3 models #239.

Assets 2

Releases: b4rtaz/distributed-llama

0.16.3

Uh oh!

0.16.2

Uh oh!

0.16.1

Uh oh!

0.16.0

Uh oh!

0.15.4

Prediction (--steps 128)

Uh oh!

0.15.3

Prediction

Uh oh!

0.15.2

Uh oh!

0.15.1

Uh oh!

0.15.0

Uh oh!

0.14.2

Uh oh!