TargetDiarization

TargetDiarization is a deep learning-based audio processing system designed to identify and extract the speech content of a specific target speaker from multi-speaker conversations. The system integrates multiple components including audio processing, speaker separation, automatic speech recognition (ASR), and speaker diarization. It can accurately isolate the target speaker’s speech from complex audio environments and convert it into text.

Homepage • Models • License

Quick Demo

webui_demo.mp4

↑ Please unmute the video above before playing ↑

Unable to play? Click here

Task: Separate the target speaker’s audio from a multi-speaker mixture and perform ASR for all speakers in the audio.
Input: Multi-speaker mixture audio and a pre-recorded target speaker sample.
Output: Per-speaker diarization results and the separated target speaker audio.

Project Highlights

A highly engineered, integrated project built on open-source models, fusing multiple state-of-the-art (SOTA) models to ensure top performance across processing stages (separation, denoising, recognition, etc.).
End-to-end solution from audio preprocessing to transcription, supporting both non-streaming and real-time streaming modes for diverse scenarios.
Multiple access methods provided out of the box: command line, REST API, WebSocket, and web UI.
Parameterized design allowing you to swap or tune models and parameters as needed.
Modular architecture: Python files starting with an uppercase letter can be used as standalone packages and imported into your own projects.

Changelog

2025.9.25: Initial release

Architecture

The system adopts a multi-model fusion architecture:

Endpoint detection: CAM++ Diarization
Overlap detection: Pyannote Diarization
Audio denoising: UVR-MDX-Net
VAD: FSMN-Monophone VAD
Speech separation: MossFormer2 (self-finetune version)
Audio restoration: Apollo
Speaker recognition: ERes2NetV2-Large
ASR: Paraformer / Whisper / SenseVoice
Punctuation restoration: CT-Transformer

Quick Start

Recommended Environment

Python 3.10
NVIDIA CUDA 12.1
16GB+ RAM
8GB+ VRAM

Install Dependencies (Anaconda example)

Create a virtual environment

conda create -n target_diarization python=3.10

Activate the environment
```
conda activate target_diarization
```

Install PyTorch and other dependencies

conda install pytorch==2.2.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

Clone Codebase & Download Models

Clone codebase:

git clone https://github.com/jingzhunxue/TargetDiarization.git

Download all pretrained models:

git-lfs clone https://www.modelscope.cn/models/jzx-ai-lab/target_diarization_models.git

Model directory structure:

TargetDiarization/
├── iic/
│   ├── punc_ct-transformer_zh-cn-common-vocab272727-pytorch/
│   ├── speech_campplus_speaker-diarization_common/
│   ├── speech_campplus_sv_zh-cn_16k-common/
│   ├── speech_eres2netv2w24s4ep4_sv_zh-cn_16k-common/
│   ├── speech_fsmn_vad_zh-cn-16k-common-pytorch/
│   └── speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/
├── pyannote/
│   └── speaker-diarization-3.1/
├── checkpoints/
│   └── mossformer2-finetune/
├── mdx
│   └── weights/
├── JusperLee
│   └── Apollo/
├── main.py
└── ...

Basic Usage

0. Startup Parameters

Copy .env.example in the project to a new file named .env.
Parameters in .env are initialization parameters. After the project starts, parameters are fixed; changes require a restart.
Adjustable items include: model paths, model parameters, GPU device, disabled modules, etc.

1. Command Line

Non-streaming invocation:

from TargetDiarization import TargetDiarization

# Initialize the pipeline
td = TargetDiarization(cuda_device=0)

# Process an audio file
target_spk, results, target_audio = td.infer(
    wav_file="conversation.wav",   # Input audio file to process
    target_file="target_speaker.wav"   # Target speaker sample (optional)
)

# Inspect results
for result in results:
    print(f"Speaker: {result['speaker']}")
    print(f"Time: {result['timerange']}")  
    print(f"Text: {result['text']}")

2. Gradio UI

Non-streaming demo only:

# Start the Gradio UI
python webui.py

# Open in browser: http://localhost:8300/target-diarization

Returned results:

[
  {
    "speaker": "0",  // Speaker ID
    "timerange": [0.031, 1.702],   // Segment time range (seconds)
    "text": "Anyway, it’s just the freshman arrival.",   // ASR text (ASR model can be customized)
    "type": "single",   // Segment type (single = single speaker, overlap = overlapped)
    "score": 0.748   // Similarity score to target speaker (non-target = -1.0)
  },
  {...}
]

3. Web API Service

# Start the API service
python main.py

# Service URL: http://localhost:8000
# API Docs: http://localhost:8000/docs

Test API with curl (non-streaming):

curl -X POST "http://localhost:8000/diarization/infer" \
  -F "audio_file=@conversation.wav" \
  -F "target_file=@target_speaker.wav" \
  -F "sampling_rate=16000"

Open the web demo in a browser (non-streaming + streaming):

# Open: demo.html

Advanced: API Details

Start the Web Service

< Refer to: Basic Usage - Web API Service >

HTTP Non-Streaming

1. Health Check

GET /health

Check service status and model loading state.

Response example:

{
  "status": "healthy",
  "model_loaded": true,
  "timestamp": 1703123456.789
}

2. Audio Inference

POST /diarization/infer

Upload an audio file for speaker separation and ASR.

Request parameters:

audio_file (file, required): Input audio file
target_file (file, optional): Target speaker sample audio
sampling_rate (int, default=16000): Audio sampling rate
is_single (bool, default=false): Whether to use single-speaker mode
output_target_audio (bool, default=true): Whether to return the target speaker audio

Usage example:

JavaScript fetch:

const formData = new FormData();
formData.append('audio_file', audioFile);
formData.append('target_file', targetFile);  // optional
formData.append('sampling_rate', '16000');

fetch('http://localhost:8000/diarization/infer', {
    method: 'POST',
    body: formData
})
.then(response => response.json())
.then(data => console.log(data));

Response format:

{
  "success": true,
  "data": {
    "target_speaker_id": "1",
    "total_speakers": 2,
    "results": [
      {
        "speaker": "1",
        "speaker_type": "target",
        "timerange": [0.0, 3.5],
        "text": "Hello, the weather is nice today.",
        "type": "single"
      },
      {
        "speaker": "0",
        "speaker_type": "other",
        "timerange": [3.5, 6.2],
        "text": "Yes, it’s great for going out.",
        "type": "single"
      }
    ],
    "statistics": {
      "total_duration": 15.3,
      "target_speaker_duration": 8.7,
      "other_speakers_duration": 6.6
    },
    "target_audio_base64": "UklGRiQAAABXQVZFZm10..."
  },
  "processing_time": 2.45
}

WebSocket Streaming

Connection URL

WS /diarization/stream

WebSocket streaming supports real-time audio transmission and immediate result returns.

Connection flow:

Establish connection

const websocket = new WebSocket('ws://localhost:8000/diarization/stream');

Send configuration

const config = {
    type: "config",
    data: {
        sampling_rate: 16000,
        is_single: false,
        output_target_audio: false,
        has_target_file: true  // if a target sample is provided
    }
};
websocket.send(JSON.stringify(config));

Send target audio (optional)

// Convert the audio file to base64
const targetAudioBase64 = await fileToBase64(targetFile);
websocket.send(JSON.stringify({
    type: "target_audio",
    data: targetAudioBase64
}));

Send audio stream data

// Send an audio chunk
websocket.send(JSON.stringify({
    type: "audio_chunk",
    data: audioChunkBase64
}));

// End the audio stream
websocket.send(JSON.stringify({
    type: "audio_end"
}));

Message formats:

Configuration acknowledgment:

{
  "type": "config_ack",
  "data": {
    "config": {...},
    "target_file_loaded": true
  }
}

Real-time result:

{
  "type": "segment_result",
  "data": {
    "target_speaker_id": "1",
    "segment": {
      "speaker": "1",
      "speaker_type": "target",
      "timerange": [2.0, 4.5],
      "text": "This is real-time recognized text",
      "type": "single"
    }
  }
}

Status message:

{
  "type": "status",
  "message": "completed"
}

Error message:

{
  "type": "error",
  "message": "Error details"
}

Acknowledgements

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
checkpoints		checkpoints
iic		iic
look2hear		look2hear
mdx		mdx
pyannote		pyannote
.env.example		.env.example
.gitignore		.gitignore
ASRProcessor.py		ASRProcessor.py
AudioProcessor.py		AudioProcessor.py
README.md		README.md
README_zh.md		README_zh.md
TargetASR.py		TargetASR.py
TargetDiarization.py		TargetDiarization.py
TargetDiarizationStream.py		TargetDiarizationStream.py
demo.html		demo.html
main.py		main.py
requirements.txt		requirements.txt
target_diarization_test.py		target_diarization_test.py
webui.py		webui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TargetDiarization

Quick Demo

Project Highlights

Changelog

Architecture

Quick Start

Recommended Environment

Install Dependencies (Anaconda example)

Clone Codebase & Download Models

Basic Usage

0. Startup Parameters

1. Command Line

2. Gradio UI

3. Web API Service

Advanced: API Details

Start the Web Service

HTTP Non-Streaming

1. Health Check

2. Audio Inference

WebSocket Streaming

Connection URL

Acknowledgements

License

About

Uh oh!

Releases

Languages

jingzhunxue/TargetDiarization

Folders and files

Latest commit

History

Repository files navigation

TargetDiarization

Quick Demo

Project Highlights

Changelog

Architecture

Quick Start

Recommended Environment

Install Dependencies (Anaconda example)

Clone Codebase & Download Models

Basic Usage

0. Startup Parameters

1. Command Line

2. Gradio UI

3. Web API Service

Advanced: API Details

Start the Web Service

HTTP Non-Streaming

1. Health Check

2. Audio Inference

WebSocket Streaming

Connection URL

Acknowledgements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages