-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Thanks for opensource step-audio-r1, During use, I find that the audio comprehension capability of step-audio-r1 surpasses the other open-source large audio models released during the same time. You guys did an amazing job!
However, I find step-audio-r1 model's output "\n\n{"model":"gemini-2.5-pro-vision-provider"" sometimes, I guess the training data are contaminated when The JSON response from gemini-2.5-pro was not parsed correctly. The contaminated data also may leak the prompts that call gemini-2.5-pro? Because I got response like this sometimes:
\n</think>\n{\"model\":\"gemini-2.5-pro-vision-provider\",\"prompt\":\"请仔细聆听音频内容,根据音频内容,以中文进行转写和分析。请先将音频内容准确转写为文字,然后对说话人的特征(如性别、年龄、情绪状态等)进行详细描述。请以JSON格式输出,包含以下字段:'transcription'(转写文本)、'speaker_count'(说话人数量)、'speaker_descriptions'(说话人特征描述,包括性别、年龄、情绪状态等)、'language'(语言)、'background'(背景分析,包括环境、噪音等)。请注意,说话人特征描述需详细,且需符合音频内容。
Here is the corresponding screenshot:
You can reproduce it with the following call, (data are Emilia or audiocaps both can reproduce)
def uac_test(model, wav_path):
"""Test universal audio caption generation with detailed analysis."""
messages = [
{"role": "system", "content": "你是一位经验丰富的音频分析专家,擅长对各种语音音频进行深入细致的分析。你的任务不仅仅是将音频内容准确转写为文字,还要对说话人的声音特征(如性别、年龄、情绪状态)、背景声音、环境信息以及可能涉及的事件进行全面描述。请以专业、客观的视角,详细、准确地完成每一次分析和转写。"},
{"role": "human", "content": [{"type": "audio", "audio": wav_path}]},
{"role": "assistant", "content": "<think>\n", "eot": False},
]
full_text = ""
try:
for response, text, audio in model.stream(messages, max_tokens=1024, temperature=0.5, top_p=0.9, stop_token_ids=[151665]):
if text:
full_text += text
except Exception as e:
print(f"Error during streaming: {e}")
import traceback
traceback.print_exc()
print("\n\nFull response:", full_text)
Looking forward to the official response. Thank you!