Skip to content

[pull] master from ggml-org:master#749

Merged
pull[bot] merged 8 commits intoLongLeCE:masterfrom
ggml-org:master
Jan 6, 2026
Merged

[pull] master from ggml-org:master#749
pull[bot] merged 8 commits intoLongLeCE:masterfrom
ggml-org:master

Conversation

@pull
Copy link

@pull pull bot commented Jan 6, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

686f6c61 and others added 8 commits January 6, 2026 16:17
* server : add thinking content blocks to Anthropic Messages API

Add support for returning reasoning/thinking content in Anthropic API
responses when using models with --reasoning-format deepseek and the
thinking parameter enabled.

- Non-streaming: adds thinking block before text in content array
- Streaming: emits thinking_delta events with correct block indices
- Partial streaming: tracks reasoning state across chunks via
  anthropic_has_reasoning member variable

Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model.

* server : fix Anthropic API streaming for thinking content blocks

Add signature field and fix duplicate content_block_start events in
Anthropic Messages API streaming responses for reasoning models.

* server: refactor Anthropic streaming state to avoid raw pointer

Replace raw pointer to task_result_state with direct field copies:
- Copy state fields in update() before processing chunk
- Use local copies in to_json_anthropic() instead of dereferencing
- Pre-compute state updates for next chunk in update()

This makes the data flow clearer and avoids unsafe pointer patterns.
* Patch perf regression for mmq kernels in ROCm

recover performance regression for #17917

* add n_experts branch like the cdna path

* mmq.cu: tune mmq/wmma switching for RDNA

* mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3

* Update ggml/src/ggml-cuda/mmq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com>
Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml-cuda: refactor cuda graph usage

* use is_enabled() instead of enabled
* vulkan: support buffer_from_host_ptr

* hacky use of buffer_from_host_ptr for directio

* disable buffer_from_host_ptr cap

* use external memory for ggml_vk_host_malloc, revert model loader changes

* disable external_memory_host for MoltenVK

* take buffer memory types into account

* don't use external_memory_host for ggml_vk_host_malloc
* arg: use CSV escape style for multiple-value args

* add test
* ggml : optimize cuda ssm_scan using warp-level reduction

* ggml : apply code review suggestions (style, const, constexpr)

* ggml : add TODO regarding stride consistency
Change is decoupled from #18641.

[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B)
needs streaming istft for generating output audio.

* add streaming ISTFT class (`mtmd_audio_streaming_istft`) with overlap-add for audio reconstruction
* replace global audio cache with per-instance cache, the model requires
  two independent caches, for preprocessing (audio input) and for istft
  (audio output).
* unified templated FFT/IFFT implementation supporting both forward and inverse transforms
@pull pull bot locked and limited conversation to collaborators Jan 6, 2026
@pull pull bot added the ⤵️ pull label Jan 6, 2026
@pull pull bot merged commit ccbc84a into LongLeCE:master Jan 6, 2026
60 of 75 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants