Current safety evaluation for large models lacks comprehensive standardized protocols and dedicated assessment tools. DeepSafe is the first all-in-one framework integrating 25+ safety datasets and the specialized ProGuard evaluation model, supporting full-modal LLM/VLM assessment.
DeepSafe is part of DeepSight and works best with π DeepScan (LLM/MLLM diagnosis toolkit). See the unified evaluationβdiagnosis workflow on the
HomePage.
DeepSafe features a modular, configuration-driven elastic architecture, enabling a full-link automated closed loop from inference and generation to judgment and deep evaluation reporting. It provides a deeply evaluable, reproducible, and highly scalable evolving security infrastructure for AI Safety research, aiming to drive safety assessment from superficial testing to in-depth analysis and accelerate the construction of Trustworthy AI. π
- High Extensibility: Powered by a Registry mechanism, new components (datasets, metrics, etc.) can be integrated with minimal code. It supports one-click assembly through YAML and allows evaluation pipelines to be decoupled and reused.
- Streamlined Usability: Adopts a "Config-as-Execution" paradigm. Simply provide a single YAML file, and the framework automatically completes the full cycle and generates standardized reports.
- Comprehensive Coverage: Provides granular outputsβincluding evaluation scores, detailed response logs, error sampling, and human-readable Markdown reportsβfacilitating in-depth analysis and reproduction.
- Proactive Risk Identification: Introduces a pioneering proactive detection paradigm capable of reasoning about and describing unknown risks, transcending the rigid constraints of predefined classification systems.
- Eradicating Modality Bias: Implements a hierarchical multimodal safety taxonomy trained on a balanced dataset of 87,000 samples, ensuring fair and precise risk assessment across both text and visual modalities.
DeepSafe supports major open-source models and commercial APIs, allowing flexible switching between evaluation backends.
| Open-source Models (via vLLM/HF) | API Models |
|---|---|
| β’ Llama / Llama3 / Alpaca / Vicuna | β’ OpenAI (GPT-4/3.5) |
| β’ Qwen / Qwen2 / Qwen2.5 / Qwen3 | β’ Gemini |
| β’ GLM / ChatGLM2 / ChatGLM3 | β’ Claude |
| β’ InternLM / InternLM2.5 | β’ ZhipuAI (ChatGLM) |
| β’ Baichuan / Baichuan2 | β’ Baichuan API |
| β’ Yi / Yi-1.5 / Yi-VL | β’ ByteDance (YunQue) |
| β’ Mistral / Mixtral | β’ Huawei (PanGu) |
| β’ Gemma / Gemma 2 | β’ Baidu (ERNIEBot) |
| β’ DeepSeek (Coder/Math) | β’ 360 / MiniMax / SenseTime |
| β’ BlueLM / TigerBot / WizardLM | β’ Xunfei (Spark) |
| β’ ...... | β’ ...... |
| Name | Description |
|---|---|
| Salad-Bench | Joint safety benchmark covering multi-dimensional and multi-lingual evaluation. |
| HarmBench | Standardized benchmark for model robustness against Jailbreak attacks. |
| Do-Not-Answer | Evaluates model refusal capabilities for harmful prompts. |
| BeaverTails | Large-scale safety dataset for human preference alignment. |
| MM-SafetyBench | Multi-dimensional benchmark for multi-modal large models. |
| VLSBench | Vision-language safety benchmark focusing on image-text alignment. |
| FLAMES | Fine-grained safety alignment and evaluation framework. |
| XSTest | Benchmark for measuring "Exaggerated Safety" (over-refusal) tendencies. |
| SIUO | Multi-modal hidden harmful intent discernment test. |
| Uncontrolled-AIRD | AI risk detection in uncontrolled scenarios. |
| TruthfulQA | Measures content truthfulness and resistance to misleading prompts. |
| HaluEval-QA | Hallucination evaluation for question-answering scenarios. |
| MedHallu | Hallucination benchmark for the medical domain. |
| MossBench | Comprehensive benchmark for safety and capabilities. |
| Fake-Alignment | Differentiation between true and deceptive alignment. |
| Sandbagging | Measures intentional hiding of model capabilities. |
| Evaluation-Faking | Assessment of cheating/manipulation during evaluation. |
| WMDP | Safety measurement in hazardous knowledge domains (e.g., bio-chemical). |
| MASK | Evaluation for model Deceptive Alignment. |
| MSSBench | Multi-stage fine-grained safety standard tests. |
| BeHonest | Honesty and self-knowledge evaluation for LLMs. |
| Deception-Bench | Specialized tests for deceptive model behaviors. |
| Ch3EF | Multi-level and multi-dimensional safety capability assessment. |
| Manipulation-Persuasion-Conv | Tests resistance to manipulation in conversations. |
| Reason-Under-Pressure | Logic reasoning tests under high-pressure constraints. |
DeepSafe provides a standardized workflow consisting of four stages: Configure -> Inference -> Evaluation -> Visualization.
# Recommended to use a virtual environment
pip install -r requirements.txt
# huggingface-cli is required for downloading datasets
pip install -U huggingface_hub
# Verify if the environment is set up correctly
python smoke_test.pyMost datasets supported by DeepSafe are hosted on Hugging Face. You can use huggingface-cli to download them.
Download Example (e.g., Do-Not-Answer):
# Download dataset to a local directory
huggingface-cli download --repo-type dataset --resume-download LibrAI/do-not-answer --local-dir data/do-not-answer --local-dir-use-symlinks FalseAfter downloading, please update the dataset.path field in the corresponding YAML configuration file (e.g., configs/eval_tasks/do_not_answer_v01.yaml) to point to your local path.
The Salad-Bench dataset needs to be manually downloaded from the official repository. You can use the following command:
# Download the dataset
huggingface-cli download --repo-type dataset --resume-download OpenSafetyLab/Salad-Data --local-dir Salad-Data --local-dir-use-symlinks FalseThen move or copy base_set.json to your specified local directory. Fill in your local path in the dataset.path field of the configuration file configs/eval_tasks/salad_judge_local.yaml:
dataset:
type: SaladDataset
path: /your/path/Salad-Data/base_set.jsonmdjudge (MD-Judge-v0.1) is the safety evaluator officially recommended by Salad-Bench. The model weights can be downloaded via HuggingFace:
# Download safetensors weight files and config files
huggingface-cli download --resume-download OpenSafetyLab/MD-Judge-v0.1 --local-dir MD-Judge-v0.1 --local-dir-use-symlinks FalsePlace the MD-Judge-v0.1 folder in any specified local directory. Then, in the evaluator.judge_model_cfg.model_name field, replace it with the local path, for example:
evaluator:
judge_model_cfg:
type: VLLMLocalModel
model_name: /your/local/dir/MD-Judge-v0.1
# Other configurations remain unchangedNote: For the first-time use, ensure the evaluation machine can load the weight files, and the local configuration path matches the actual download path; otherwise, the evaluation will fail to find the model.
dataset:
type: SaladDataset
path: /your/path/to/Salad-Data/base_set_sample_100.json
evaluator:
type: ScorerBasedEvaluator
batch_size: 32
template_name: md_judge_v0_1
judge_model_cfg:
type: VLLMLocalModel
model_name: /your/path/to/MD-Judge-v0.1
tensor_parallel_size: 1
gpu_memory_utilization: 0.4
trust_remote_code: true
temperature: 0.0
max_tokens: 64If there are other dependencies or execution errors, please verify the config paths and model formats.
DeepSafe provides one-shot local scripts. Simply run:
# Execute from the project root
bash scripts/run_salad_local.sh configs/eval_tasks/salad_judge_local.yaml- Configure: Specify target model, dataset paths, and judge model parameters in the YAML file.
- Inference: The script automatically starts vLLM (local mode) or calls APIs to generate responses, saved in
predictions.jsonl. - Evaluation: Launches ProGuard or other judge models to automate scoring and categorization.
- Visualization: Summarizes metrics and generates a human-readable
report.mdin the output directory.
Parameters explained using salad_judge_v01_qwen1.5-0.5b_vllm_local.yaml:
| Module | Parameter | Description |
|---|---|---|
| model | type |
Loader class (APIModel / VLLMLocalModel / HuggingFaceModel) |
model_name |
Local path or HF ID | |
api_base |
Service URL (Starts vLLM automatically if set to localhost) |
|
concurrency |
Parallel inference requests | |
max_tokens |
Maximum generation length | |
| dataset | type |
Dataset class name |
path |
Path to dataset file | |
limit |
(Optional) Sample count for quick smoke tests | |
| evaluator | type |
Evaluation mode (e.g., ScorerBasedEvaluator) |
template_name |
Prompt template ID | |
judge_model_cfg |
Judge Model Config (Same structure as model module) |
|
| metrics | type |
Metric class (e.g., SaladCategoryMetric) |
| runner | output_dir |
Path to save results and Markdown reports |
Integrate your own dataset in three simple steps:
Ensure your data file follows this JSONL structure:
{"id": "001", "prompt": "User Input", "reference": "Ground Truth", "category": "Label"}
- Dataset (
uni_eval/datasets/): InheritBaseDatasetand overrideload()to read your JSONL. - Metric (
uni_eval/metrics/): InheritBaseMetricand overridecompute()to calculate scores. - Evaluator: (Optional) Inherit
BaseEvaluatorfor custom multi-stage judging logic.
Register your modules in the respective __init__.py files, create your YAML config, and you are ready!
DeepSafe/
βββ uni_eval/ # Core Evaluation Framework
β βββ datasets/ # Dataset Loader Implementations
β βββ models/ # Model Adapters (API/HF/vLLM)
β βββ evaluators/ # Evaluation Logic Controllers
β βββ metrics/ # Evaluation Metric Implementations
β βββ runners/ # Task Execution Management
β βββ summarizers/ # Result Summarization and Reporting
β βββ cli/ # CLI Parsing Tools
β βββ registry.py # Registry Center
βββ configs/ # YAML Configuration Files
βββ scripts/ # Launch Scripts (Shell)
βββ tools/ # Utility Scripts
βββ results/ # Evaluation Outputs (JSON/Markdown)
Contributions are welcome! Please submit an Issue or Pull Request to help us build DeepSafe. π
- Email:
shaojing@pjlab.org.cn