-
Notifications
You must be signed in to change notification settings - Fork 216
Open
Description
The model works when I use llama-cpp-python locally, but when I try hitting my Inference Process I'm getting error logs and no use-able response.
Wondering if anyone can help me diagnose, thanks.
Tl;dr this is what the error looks like:
Exception ignored on calling ctypes callback function: <function CustomSampler.__init__.<locals>.apply_wrapper at 0x0000022357DC4D30>
Traceback (most recent call last):
File "C:\Users\Rory\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\_internals.py", line 735, in apply_wrapper
self.apply_func(cur_p)
File "C:\Users\Rory\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\llama.py", line 706, in apply_func
for logit_processor in logits_processor:
TypeError: 'BatchLogitsProcessor' object is not iterable
Inference Process:
$ lmql serve-model llama.cpp:C:/Users/Rory/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf --n_ctx 2048 --verbose True
[Serving LMTP endpoint on ws://localhost:8080/]
[Loading llama.cpp model from llama.cpp:C:/Users/Rory/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf with {'n_ctx': 2048, 'verbose': True} ]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3070 Laptop GPU) - 7118 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from C:/Users/Rory/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = llama3.2
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 28
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 3072
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 13: llama.attention.head_count u32 = 24
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.attention.key_length u32 = 128
llama_model_loader: - kv 18: llama.attention.value_length u32 = 128
llama_model_loader: - kv 19: general.file_type u32 = 18
llama_model_loader: - kv 20: llama.vocab_size u32 = 128256
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q6_K: 197 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 2.45 GiB (6.56 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
...
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3072
print_info: n_layer = 28
print_info: n_head = 24
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 3
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 3B
print_info: model params = 3.21 B
print_info: general.name = Llama 3.2 3B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 128 'Ä'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: layer 0 assigned to device CPU
...
load_tensors: layer 28 assigned to device CPU
load_tensors: tensor 'token_embd.weight' (q6_K) (and 282 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 2513.90 MiB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch = 512
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 500000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
...
llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: CPU KV buffer size = 224.00 MiB
llama_init_from_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_init_from_model: CPU output buffer size = 0.49 MiB
llama_init_from_model: CUDA0 compute buffer size = 564.73 MiB
llama_init_from_model: CUDA_Host compute buffer size = 10.01 MiB
llama_init_from_model: graph nodes = 902
llama_init_from_model: graph splits = 313 (with bs=512), 1 (with bs=1)
CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Model metadata: {'general.name': 'Llama 3.2 3B Instruct', 'general.architecture': 'llama', 'general.type': 'model', 'llama.block_count': '28', 'general.basename': 'Llama-3.2', 'general.finetune': 'Instruct', 'general.size_label': '3B', 'general.license': 'llama3.2', 'llama.context_length': '131072', 'llama.embedding_length': '3072', 'llama.feed_forward_length': '8192', 'llama.attention.head_count': '24', 'tokenizer.ggml.eos_token_id': '128009', 'general.file_type': '18', 'llama.attention.head_count_kv': '8', 'llama.rope.freq_base': '500000.000000', 'quantize.imatrix.entries_count': '196', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.attention.key_length': '128', 'llama.attention.value_length': '128', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.pre': 'llama-bpe', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '128000', 'tokenizer.chat_template': '{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now("%d %b %Y") %}\n {%- else %}\n {%- set date_string = "26 Jul 2024" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0][\'role\'] == \'system\' %}\n {%- set system_message = messages[0][\'content\']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message #}\n{{- "<|start_header_id|>system<|end_header_id|>\\n\\n" }}\n{%- if tools is not none %}\n {{- "Environment: ipython\\n" }}\n{%- endif %}\n{{- "Cutting Knowledge Date: December 2023\\n" }}\n{{- "Today Date: " + date_string + "\\n\\n" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}\n {{- \'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.\' }}\n {{- "Do not use variables.\\n\\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\\n\\n" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- "<|eot_id|>" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0][\'content\']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception("Cannot put tools in the first user message when there\'s no first user message!") }}\n{%- endif %}\n {{- \'<|start_header_id|>user<|end_header_id|>\\n\\n\' -}}\n {{- "Given the following functions, please respond with a JSON for a function call " }}\n {{- "with its proper arguments that best answers the given prompt.\\n\\n" }}\n {{- \'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.\' }}\n {{- "Do not use variables.\\n\\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\\n\\n" }}\n {%- endfor %}\n {{- first_user_message + "<|eot_id|>"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == \'ipython\' or message.role == \'tool\' or \'tool_calls\' in message) %}\n {{- \'<|start_header_id|>\' + message[\'role\'] + \'<|end_header_id|>\\n\\n\'+ message[\'content\'] | trim + \'<|eot_id|>\' }}\n {%- elif \'tool_calls\' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception("This model only supports single tool-calls at once!") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- \'<|start_header_id|>assistant<|end_header_id|>\\n\\n\' -}}\n {{- \'{"name": "\' + tool_call.name + \'", \' }}\n {{- \'"parameters": \' }}\n {{- tool_call.arguments | tojson }}\n {{- "}" }}\n {{- "<|eot_id|>" }}\n {%- elif message.role == "tool" or message.role == "ipython" %}\n {{- "<|start_header_id|>ipython<|end_header_id|>\\n\\n" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- "<|eot_id|>" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|start_header_id|>assistant<|end_header_id|>\\n\\n\' }}\n{%- endif %}\n', 'quantize.imatrix.chunks_count': '125', 'quantize.imatrix.file': '/models_out/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct.imatrix', 'quantize.imatrix.dataset': '/training_dir/calibration_datav3.txt'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{- bos_token }}
{%- if custom_tools is defined %}
{%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
{%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
{%- if strftime_now is defined %}
{%- set date_string = strftime_now("%d %b %Y") %}
{%- else %}
{%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
{%- set system_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{%- set system_message = "" %}
{%- endif %}
{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
{{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
{{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}
{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
{#- Extract the first user message so we can plug it in here #}
{%- if messages | length != 0 %}
{%- set first_user_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
{{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
{{- "Given the following functions, please respond with a JSON for a function call " }}
{{- "with its proper arguments that best answers the given prompt.\n\n" }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{{- first_user_message + "<|eot_id|>"}}
{%- endif %}
{%- for message in messages %}
{%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
{%- elif 'tool_calls' in message %}
{%- if not message.tool_calls|length == 1 %}
{{- raise_exception("This model only supports single tool-calls at once!") }}
{%- endif %}
{%- set tool_call = message.tool_calls[0].function %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
{{- '{"name": "' + tool_call.name + '", ' }}
{{- '"parameters": ' }}
{{- tool_call.arguments | tojson }}
{{- "}" }}
{{- "<|eot_id|>" }}
{%- elif message.role == "tool" or message.role == "ipython" %}
{{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
{%- if message.content is mapping or message.content is iterable %}
{{- message.content | tojson }}
{%- else %}
{{- message.content }}
{%- endif %}
{{- "<|eot_id|>" }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>
Exception ignored on calling ctypes callback function: <function CustomSampler.__init__.<locals>.apply_wrapper at 0x0000022357DC4D30>
Traceback (most recent call last):
File "C:\Users\Rory\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\_internals.py", line 735, in apply_wrapper
self.apply_func(cur_p)
File "C:\Users\Rory\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\llama.py", line 706, in apply_func
for logit_processor in logits_processor:
TypeError: 'BatchLogitsProcessor' object is not iterable
Exception ignored on calling ctypes callback function: <function CustomSampler.__init__.<locals>.apply_wrapper at 0x0000022357DC4D30>
Traceback (most recent call last):
File "C:\Users\Rory\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\_internals.py", line 735, in apply_wrapper
self.apply_func(cur_p)
File "C:\Users\Rory\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_cpp\llama.py", line 706, in apply_func
for logit_processor in logits_processor:
TypeError: 'BatchLogitsProcessor' object is not iterable
...
Python code:
import lmql
def main():
chain_of_thought = lmql.query("""
"Q: {question}\\n"
"A: Let's think step by step.\\n"
"[ANSWER]"
""", is_async=False, verbose=True, model="llama.cpp:C:/Users/Rory/Downloads/Llama-3.2-3B-Instruct-Q6_K.gguf")
chain_of_thought("What is 2x3?")
if __name__ == '__main__':
main()
Python Output:
UserWarning: File tokenizer.model not present in the same folder as the model weights. Using default 'huggyllama/llama-7b' tokenizer for all llama.cpp models. To change this, set the 'tokenizer' argument of your lmql.model(...) object.
warnings.warn("File tokenizer.model not present in the same folder as the model weights. Using default '{}' tokenizer for all llama.cpp models. To change this, set the 'tokenizer' argument of your lmql.model(...) object.".format("huggyllama/llama-7b", UserWarning))
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
lmtp generate: [1, 29984, 29901, 1724, 338, 29871, 29906, 29916, 29941, 29973, 13, 29909, 29901, 2803, 29915, 29879, 1348, 4331, 491, 4331, 29889, 13] / "<s>Q: What is 2x3?\nA: Let's think step by step.\n" (22 tokens, temperature=0.0, max_tokens=128)
lmtp generate: [1, 29984, 29901, 1724, 338, 29871, 29906, 29916, 29941, 29973, 13, 29909, 29901, 2803, 29915, 29879, 1348, 4331, 491, 4331, 29889, 13, 220, 2366, 18, 12, 2437, 12, 508, 220, 975, 25, 966, 25, 410, 13, 931, 198, 2366, 18, 12, 2437, 12, 508, 220, 975, 25, 966, 25, 410, 13, 931, 31871, 220, 510, 64222, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220, 510, 3902, 60, 220] / "<s>Q: What is 2x3?\nA: Let's think step by step.��vious\x0f\t inv\t can� over\x16 les\x16 pro\n time�vious\x0f\t inv\t can� over\x16 les\x16 pro\n time論�com��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��com expl��" (150 tokens, temperature=0.0, max_tokens=128)
Metadata
Metadata
Assignees
Labels
No labels