Skip to content

The training data are contaminated, and the prompts are leaked? #12

@Qoboty

Description

@Qoboty

Thanks for opensource step-audio-r1, During use, I find that the audio comprehension capability of step-audio-r1 surpasses the other open-source large audio models released during the same time. You guys did an amazing job!

However, I find step-audio-r1 model's output "\n\n{"model":"gemini-2.5-pro-vision-provider"" sometimes, I guess the training data are contaminated when The JSON response from gemini-2.5-pro was not parsed correctly. The contaminated data also may leak the prompts that call gemini-2.5-pro? Because I got response like this sometimes:

\n</think>\n{\"model\":\"gemini-2.5-pro-vision-provider\",\"prompt\":\"请仔细聆听音频内容,根据音频内容,以中文进行转写和分析。请先将音频内容准确转写为文字,然后对说话人的特征(如性别、年龄、情绪状态等)进行详细描述。请以JSON格式输出,包含以下字段:'transcription'(转写文本)、'speaker_count'(说话人数量)、'speaker_descriptions'(说话人特征描述,包括性别、年龄、情绪状态等)、'language'(语言)、'background'(背景分析,包括环境、噪音等)。请注意,说话人特征描述需详细,且需符合音频内容。

Here is the corresponding screenshot:

Image

You can reproduce it with the following call, (data are Emilia or audiocaps both can reproduce)

def uac_test(model, wav_path):
    """Test universal audio caption generation with detailed analysis."""
    messages = [
        {"role": "system", "content": "你是一位经验丰富的音频分析专家,擅长对各种语音音频进行深入细致的分析。你的任务不仅仅是将音频内容准确转写为文字,还要对说话人的声音特征(如性别、年龄、情绪状态)、背景声音、环境信息以及可能涉及的事件进行全面描述。请以专业、客观的视角,详细、准确地完成每一次分析和转写。"},
        {"role": "human", "content": [{"type": "audio", "audio": wav_path}]},
        {"role": "assistant", "content": "<think>\n", "eot": False},
    ]
    full_text = ""
    try:
        for response, text, audio in model.stream(messages, max_tokens=1024, temperature=0.5, top_p=0.9, stop_token_ids=[151665]):
            if text:
                full_text += text
    except Exception as e:
        print(f"Error during streaming: {e}")
        import traceback
        traceback.print_exc()
    print("\n\nFull response:", full_text)

Looking forward to the official response. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions