Skip to content

Comments

Surface model reasoning as delta.reasoning_content in streaming & sync responses#10

Merged
notabd7-deepshard merged 1 commit intodeepshard:mainfrom
robsltd:feat/reasoning-content
Feb 12, 2026
Merged

Surface model reasoning as delta.reasoning_content in streaming & sync responses#10
notabd7-deepshard merged 1 commit intodeepshard:mainfrom
robsltd:feat/reasoning-content

Conversation

@robsltd
Copy link
Contributor

@robsltd robsltd commented Feb 11, 2026

Summary

  • Rework _StreamFilter to capture thinking/reasoning content instead of silently dropping it, emitting it as delta.reasoning_content in streaming SSE chunks and message.reasoning_content in sync responses
  • Follows the DeepSeek / OpenAI convention for surfacing chain-of-thought in OpenAI-compatible APIs
  • <think> tag split across gRPC chunk boundaries
  • Stream ending mid-think (before </think>)
  • Multiple <think> blocks mid-response
  • Leading \n after <think> tag stripped from reasoning output
  • Non-reasoner models never emit reasoning_content
  • Existing clients that don't read reasoning_content are unaffected (additive field)

Problem

When using a reasoning model through the proxy, the model's chain-of-thought is completely stripped from responses. Clients that want to display or log reasoning have no way to access it.

To reproduce:

  1. Start the proxy with the reasoning model loaded
  2. Send a streaming request:
    curl -N http://127.0.0.1:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"auto","stream":true,"messages":[{"role":"user","content":"What is 2+2?"}]}'
  3. Observe: all <think>...</think> content is silently dropped. Only delta.content chunks appear — no reasoning is surfaced anywhere in the streaming response.
  4. For non-streaming, reasoning is only available via the non-standard debug.reasoning field and only when --debug is enabled.

Solution

Instead of discarding thinking content in _StreamFilter, capture it and emit it as delta.reasoning_content (streaming) / message.reasoning_content (sync), matching the convention used by DeepSeek and OpenAI for reasoning models.

After this change, the same curl now produces:

data: {"choices":[{"delta":{"reasoning_content":"Okay, the user is asking..."}}]}
data: {"choices":[{"delta":{"reasoning_content":"...simple arithmetic..."}}]}
data: {"choices":[{"delta":{"content":"4"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}

Non-streaming responses include both fields on the message:

{
  "message": {
    "role": "assistant",
    "content": "4",
    "reasoning_content": "Okay, the user is asking..."
  }
}

_StreamFilter changes: feed() and finalize() now return (visible, reasoning) tuples instead of a single string. Phase 1 (initial CoT block) and Phase 2 (mid-stream <think> blocks) both capture reasoning instead of discarding it.

Testing

Tested against Qwen3-30B-A3B:

  • Streaming: reasoning_content chunks flow, clean transition to content, no tag leaks
  • Non-streaming: both fields present and clean
  • finish_reason + data: [DONE] emitted correctly

@notabd7-deepshard notabd7-deepshard merged commit 599e9fd into deepshard:main Feb 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants