Skip to content

MUSA error: operation not supported #26

@yeungtuzi

Description

@yeungtuzi

Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 MUSA devices:
Device 0: MTT S4000, compute capability 2.2, VMM: yes
Device 1: MTT S4000, compute capability 2.2, VMM: yes
Device 2: MTT S4000, compute capability 2.2, VMM: yes
Device 3: MTT S4000, compute capability 2.2, VMM: yes
Device 4: MTT S3000, compute capability 2.1, VMM: yes
Device 5: MTT S3000, compute capability 2.1, VMM: yes
Device 6: MTT S3000, compute capability 2.1, VMM: yes
Device 7: MTT S3000, compute capability 2.1, VMM: yes
version: 4749 (ggml-org/llama.cpp@ee02ad0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems
Linux

GGML backends
Musa

Hardware
Hygon 7385x2 (32C x2)
Moorethreads: S4000 x8
RAM: 512GB

Models
DeepSeek-R1-Distill-Qwen-32B-Q4_K_M

Problem description & steps to reproduce
Command:
./llama-server -t 32 -c 8192 -fa -np 4 -m /root/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf -ngl 100 --no-warmup --port 100
86 --host 0.0.0.0 -v
First Bad Commit
No response

Relevant log output
3. Error log:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 MUSA devices:
Device 0: MTT S4000, compute capability 2.2, VMM: yes
Device 1: MTT S4000, compute capability 2.2, VMM: yes
Device 2: MTT S4000, compute capability 2.2, VMM: yes
Device 3: MTT S4000, compute capability 2.2, VMM: yes
Device 4: MTT S3000, compute capability 2.1, VMM: yes
Device 5: MTT S3000, compute capability 2.1, VMM: yes
Device 6: MTT S3000, compute capability 2.1, VMM: yes
Device 7: MTT S3000, compute capability 2.1, VMM: yes
build: 4749 (ee02ad02) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 32, n_threads_batch = 32, total_threads = 64

system_info: n_threads = 32 (n_threads_batch = 32) / 64 | MUSA : F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 10086, http threads: 63
main: loading model
srv load_model: loading model '/root/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device MUSA0 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA1 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA2 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA3 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA4 (MTT S3000) - 32674 MiB free
llama_model_load_from_file_impl: using device MUSA5 (MTT S3000) - 32674 MiB free
llama_model_load_from_file_impl: using device MUSA6 (MTT S3000) - 32674 MiB free
llama_model_load_from_file_impl: using device MUSA7 (MTT S3000) - 32674 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 771 tensors from /root/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv 3: general.organization str = Deepseek Ai
llama_model_loader: - kv 4: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 5: general.size_label str = 32B
llama_model_loader: - kv 6: qwen2.block_count u32 = 64
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: general.file_type u32 = 15
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 18.48 GiB (4.85 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 64
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 27648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 32B
print_info: model params = 32.76 B
print_info: general.name = DeepSeek R1 Distill Qwen 32B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token = 151643 '<|end▁of▁sentence|>'
print_info: EOT token = 151643 '<|end▁of▁sentence|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device MUSA0
load_tensors: layer 1 assigned to device MUSA0
load_tensors: layer 2 assigned to device MUSA0
load_tensors: layer 3 assigned to device MUSA0
load_tensors: layer 4 assigned to device MUSA0
load_tensors: layer 5 assigned to device MUSA0
load_tensors: layer 6 assigned to device MUSA0
load_tensors: layer 7 assigned to device MUSA0
load_tensors: layer 8 assigned to device MUSA0
load_tensors: layer 9 assigned to device MUSA0
load_tensors: layer 10 assigned to device MUSA1
load_tensors: layer 11 assigned to device MUSA1
load_tensors: layer 12 assigned to device MUSA1
load_tensors: layer 13 assigned to device MUSA1
load_tensors: layer 14 assigned to device MUSA1
load_tensors: layer 15 assigned to device MUSA1
load_tensors: layer 16 assigned to device MUSA1
load_tensors: layer 17 assigned to device MUSA1
load_tensors: layer 18 assigned to device MUSA1
load_tensors: layer 19 assigned to device MUSA1
load_tensors: layer 20 assigned to device MUSA2
load_tensors: layer 21 assigned to device MUSA2
load_tensors: layer 22 assigned to device MUSA2
load_tensors: layer 23 assigned to device MUSA2
load_tensors: layer 24 assigned to device MUSA2
load_tensors: layer 25 assigned to device MUSA2
load_tensors: layer 26 assigned to device MUSA2
load_tensors: layer 27 assigned to device MUSA2
load_tensors: layer 28 assigned to device MUSA2
load_tensors: layer 29 assigned to device MUSA2
load_tensors: layer 30 assigned to device MUSA3
load_tensors: layer 31 assigned to device MUSA3
load_tensors: layer 32 assigned to device MUSA3
load_tensors: layer 33 assigned to device MUSA3
load_tensors: layer 34 assigned to device MUSA3
load_tensors: layer 35 assigned to device MUSA3
load_tensors: layer 36 assigned to device MUSA3
load_tensors: layer 37 assigned to device MUSA3
load_tensors: layer 38 assigned to device MUSA3
load_tensors: layer 39 assigned to device MUSA3
load_tensors: layer 40 assigned to device MUSA4
load_tensors: layer 41 assigned to device MUSA4
load_tensors: layer 42 assigned to device MUSA4
load_tensors: layer 43 assigned to device MUSA4
load_tensors: layer 44 assigned to device MUSA4
load_tensors: layer 45 assigned to device MUSA4
load_tensors: layer 46 assigned to device MUSA5
load_tensors: layer 47 assigned to device MUSA5
load_tensors: layer 48 assigned to device MUSA5
load_tensors: layer 49 assigned to device MUSA5
load_tensors: layer 50 assigned to device MUSA5
load_tensors: layer 51 assigned to device MUSA5
load_tensors: layer 52 assigned to device MUSA5
load_tensors: layer 53 assigned to device MUSA6
load_tensors: layer 54 assigned to device MUSA6
load_tensors: layer 55 assigned to device MUSA6
load_tensors: layer 56 assigned to device MUSA6
load_tensors: layer 57 assigned to device MUSA6
load_tensors: layer 58 assigned to device MUSA6
load_tensors: layer 59 assigned to device MUSA7
load_tensors: layer 60 assigned to device MUSA7
load_tensors: layer 61 assigned to device MUSA7
load_tensors: layer 62 assigned to device MUSA7
load_tensors: layer 63 assigned to device MUSA7
load_tensors: layer 64 assigned to device MUSA7
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: MUSA0 model buffer size = 2905.04 MiB
load_tensors: MUSA1 model buffer size = 2760.66 MiB
load_tensors: MUSA2 model buffer size = 2724.57 MiB
load_tensors: MUSA3 model buffer size = 2724.57 MiB
load_tensors: MUSA4 model buffer size = 1641.96 MiB
load_tensors: MUSA5 model buffer size = 1939.68 MiB
load_tensors: MUSA6 model buffer size = 1714.15 MiB
load_tensors: MUSA7 model buffer size = 2097.71 MiB
load_tensors: CPU_Mapped model buffer size = 417.66 MiB
................................................................................................
llama_init_from_model: n_seq_max = 4
llama_init_from_model: n_ctx = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 48: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 49: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 50: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 51: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 52: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 53: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 54: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 55: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 56: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 57: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 58: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 59: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 60: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 61: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 62: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 63: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: MUSA0 KV buffer size = 320.00 MiB
llama_kv_cache_init: MUSA1 KV buffer size = 320.00 MiB
llama_kv_cache_init: MUSA2 KV buffer size = 320.00 MiB
llama_kv_cache_init: MUSA3 KV buffer size = 320.00 MiB
llama_kv_cache_init: MUSA4 KV buffer size = 192.00 MiB
llama_kv_cache_init: MUSA5 KV buffer size = 224.00 MiB
llama_kv_cache_init: MUSA6 KV buffer size = 192.00 MiB
llama_kv_cache_init: MUSA7 KV buffer size = 160.00 MiB
llama_init_from_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_init_from_model: MUSA_Host output buffer size = 2.32 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model: MUSA0 compute buffer size = 632.01 MiB
llama_init_from_model: MUSA1 compute buffer size = 568.01 MiB
llama_init_from_model: MUSA2 compute buffer size = 568.01 MiB
llama_init_from_model: MUSA3 compute buffer size = 568.01 MiB
llama_init_from_model: MUSA4 compute buffer size = 408.01 MiB
llama_init_from_model: MUSA5 compute buffer size = 448.01 MiB
llama_init_from_model: MUSA6 compute buffer size = 408.01 MiB
llama_init_from_model: MUSA7 compute buffer size = 547.02 MiB
llama_init_from_model: MUSA_Host compute buffer size = 10858.02 MiB
llama_init_from_model: graph nodes = 1991
llama_init_from_model: graph splits = 137
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
srv init: initializing slots, n_slots = 4
slot init: id 0 | task -1 | new slot n_ctx_slot = 2048
slot reset: id 0 | task -1 |
slot init: id 1 | task -1 | new slot n_ctx_slot = 2048
slot reset: id 1 | task -1 |
slot init: id 2 | task -1 | new slot n_ctx_slot = 2048
slot reset: id 2 | task -1 |
slot init: id 3 | task -1 | new slot n_ctx_slot = 2048
slot reset: id 3 | task -1 |
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:10086 - starting the main loop
que start_loop: processing new tasks
que start_loop: update slots
srv update_slots: all slots are idle
srv kv_cache_cle: clearing KV cache
que start_loop: waiting for new tasks
request: {"messages":[{"role":"system","content":"\nCurrent model: gpt-4o\nCurrent date: 2025-02-26T02:45:26.215Z\n\nYou are a helpful assistant."},{"role":"user","content":"你好啊"},{"role":"user","content":"你好"}],"model":"gpt-4o","temperature":0.7,"top_p":0.9,"stream":true}
srv params_from_: Grammar:
srv params_from_: Grammar lazy: false
srv params_from_: Chat format: Content-only
srv add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que post: new task, id = 0/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 0
slot get_availabl: id 0 | task -1 | selected slot by lru, t_last = -1
slot reset: id 0 | task -1 |
slot launch_slot_: id 0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":2048,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.699999988079071,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":8192,"dry_sequence_breakers":["\n",":",""","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_trigger_words":[],"grammar_trigger_tokens":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|begin▁of▁sentence|>\nCurrent model: gpt-4o\nCurrent date: 2025-02-26T02:45:26.215Z\n\nYou are a helpful assistant.\n\n<|User|>你好啊<|User|>你好<|Assistant|>","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
slot launch_slot_: id 0 | task 0 | processing task
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 52
slot update_slots: id 0 | task 0 | prompt token 0: 151646 '<|begin▁of▁sentence|>'
slot update_slots: id 0 | task 0 | prompt token 1: 198 '
'
slot update_slots: id 0 | task 0 | prompt token 2: 5405 'Current'
slot update_slots: id 0 | task 0 | prompt token 3: 1614 ' model'
slot update_slots: id 0 | task 0 | prompt token 4: 25 ':'
slot update_slots: id 0 | task 0 | prompt token 5: 342 ' g'
slot update_slots: id 0 | task 0 | prompt token 6: 417 'pt'
slot update_slots: id 0 | task 0 | prompt token 7: 12 '-'
slot update_slots: id 0 | task 0 | prompt token 8: 19 '4'
slot update_slots: id 0 | task 0 | prompt token 9: 78 'o'
slot update_slots: id 0 | task 0 | prompt token 10: 198 '
'
slot update_slots: id 0 | task 0 | prompt token 11: 5405 'Current'
slot update_slots: id 0 | task 0 | prompt token 12: 2400 ' date'
slot update_slots: id 0 | task 0 | prompt token 13: 25 ':'
slot update_slots: id 0 | task 0 | prompt token 14: 220 ' '
slot update_slots: id 0 | task 0 | prompt token 15: 17 '2'
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 52, n_tokens = 52, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 52, n_tokens = 52
srv update_slots: decoding batch, n_tokens = 52
/root/musa/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: MUSA error
MUSA error: operation not supported
current device: 0, in function alloc at /root/musa/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:443
muMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
request: {"messages":[{"role":"user","content":"Based on the chat history, give this conversation a name.\nKeep it short - 10 characters max, no quotes.\nUse 简体中文.\nJust provide the name, nothing else.\n\nHere's the conversation:\n\n\n你好啊\n\n---------\n\n\n\n---------\n\n你好\n\n---------\n\n...\n\n\nName this conversation in 10 characters or less.\nUse 简体中文.\nOnly give the name, nothing else.\n\nThe name is:"}],"model":"gpt-4o","temperature":0.7,"top_p":0.9,"stream":true}
srv params_from_: Grammar:
srv params_from_: Grammar lazy: false
srv params_from_: Chat format: Content-only
srv add_waiting_: add task 2 to waiting list. current waiting = 1 (before add)
que post: new task, id = 2/1, front = 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions