[pull] master from ggml-org:master by pull[bot] · Pull Request #749 · LongLeCE/llama.cpp

pull · 2026-01-06T20:42:02Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* server : add thinking content blocks to Anthropic Messages API Add support for returning reasoning/thinking content in Anthropic API responses when using models with --reasoning-format deepseek and the thinking parameter enabled. - Non-streaming: adds thinking block before text in content array - Streaming: emits thinking_delta events with correct block indices - Partial streaming: tracks reasoning state across chunks via anthropic_has_reasoning member variable Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model. * server : fix Anthropic API streaming for thinking content blocks Add signature field and fix duplicate content_block_start events in Anthropic Messages API streaming responses for reasoning models. * server: refactor Anthropic streaming state to avoid raw pointer Replace raw pointer to task_result_state with direct field copies: - Copy state fields in update() before processing chunk - Use local copies in to_json_anthropic() instead of dereferencing - Pre-compute state updates for next chunk in update() This makes the data flow clearer and avoids unsafe pointer patterns.

* Patch perf regression for mmq kernels in ROCm recover performance regression for #17917 * add n_experts branch like the cdna path * mmq.cu: tune mmq/wmma switching for RDNA * mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3 * Update ggml/src/ggml-cuda/mmq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com> Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabled

* vulkan: support buffer_from_host_ptr * hacky use of buffer_from_host_ptr for directio * disable buffer_from_host_ptr cap * use external memory for ggml_vk_host_malloc, revert model loader changes * disable external_memory_host for MoltenVK * take buffer memory types into account * don't use external_memory_host for ggml_vk_host_malloc

* arg: use CSV escape style for multiple-value args * add test

* ggml : optimize cuda ssm_scan using warp-level reduction * ggml : apply code review suggestions (style, const, constexpr) * ggml : add TODO regarding stride consistency

Change is decoupled from #18641. [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) needs streaming istft for generating output audio. * add streaming ISTFT class (`mtmd_audio_streaming_istft`) with overlap-add for audio reconstruction * replace global audio cache with per-instance cache, the model requires two independent caches, for preprocessing (audio input) and for istft (audio output). * unified templated FFT/IFFT implementation supporting both forward and inverse transforms

686f6c61 and others added 8 commits January 6, 2026 16:17

ggml-cuda: refactor cuda graph usage (#18637)

090b137

* ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabled

arg: use CSV escape style for multiple-value args (#18643)

07fbe19

* arg: use CSV escape style for multiple-value args * add test

ggml : optimize cuda ssm_scan using warp-level reduction (#18505)

24af22f

* ggml : optimize cuda ssm_scan using warp-level reduction * ggml : apply code review suggestions (style, const, constexpr) * ggml : add TODO regarding stride consistency

llama-params-fit: fix last devices with low VRAM (#18494)

68b4d51

pull bot locked and limited conversation to collaborators Jan 6, 2026

pull bot added the ⤵️ pull label Jan 6, 2026

pull bot merged commit ccbc84a into LongLeCE:master Jan 6, 2026
60 of 75 checks passed

github-actions bot added Nvidia GPU testing examples python ggml server Vulkan labels Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#749

[pull] master from ggml-org:master#749
pull[bot] merged 8 commits intoLongLeCE:masterfrom
ggml-org:master

pull bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

pull bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pull bot commented Jan 6, 2026 •

edited

Loading