Harm Evaluation and Risk Assessment for Language Detection An in-development Benchmark for Evaluating Crisis Detection and Response in Large Language Models
Current state: Evaluation of OpenAI models on evaluating, labeling, and escalating synthetic conversational data Future state: Evaluate multiple models from various providers on a hybrid dataset (not provided in the github) based on expert labelled conversations created from real conversation data.
NOTE All of the data presented in this repository is fictional and synthetic. Any resemblance to real individuals, conversations, or events is purely coincidental and unintended. This data should not be interpreted as representing actual persons or real-world incidents; it exists solely for research, development, and evaluation purposes within the context of AI safety.
HERALD is a framework for generating and evaluating conversational safety datasets. It provides two main capabilities:
- Dataset Generation: Create realistic, synthetic conversational datasets with diverse personas and safety scenarios
- Safety Evaluation: Assess AI models' ability to detect and respond to safety risks in conversations
- Configuration Management: Centralized management of all parameters across the system (supports 200+ models via OpenRouter)
- based on the dataset generated on 2025-07-28 (n=100 conversations)
- Cost: ~ $1.00 USD
| Metric | Value |
|---|---|
| Precision | 0.806 |
| Recall | 0.862 |
| F1 Score | 0.833 |
| Escalation Flag | Correct / Total | Accuracy |
|---|---|---|
| escalation_required | 50 / 58 | 0.862 |
| no_escalation | 30 / 42 | 0.714 |
| Predicted Escalation | Predicted No Escalation | |
|---|---|---|
| Actual Escalation | True Positives: 50 | False Negatives: 8 |
| Actual No Escalation | False Positives: 12 | True Negatives: 30 |
- JSON-based configuration system with validation and error handling
- Configuration changes automatically propagate to all modules
- 5 personas representing different demographics and backgrounds
- Covers suicidal ideation, self-harm, harm to others, false positives, and benign conversations
- Enforced distribution across conversation types and turn lengths
- Uses function calling to trigger escalation when needed
- HTML reports with confusion matrices and conversation examples
- Full test suite for configuration management and integration
- Built-in commands for configuration validation and management in CLI
- Python 3.8+
- OpenRouter API key (provides access to multiple AI models including OpenAI, Anthropic, etc.)
- Clone the repository:
git clone <repository-url>
cd SafetyBench- Install dependencies:
pip install -r requirements.txt- Set up your OpenRouter API key:
export OPENROUTER_API_KEY="your-api-key-here"Or create a .env file:
cp .env.example .env
# Edit .env and add your OpenRouter API keyEvaluate an existing dataset:
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonlGenerate 50 conversations with default settings:
python main.py generate --num-conversations 50View all available options:
python main.py --help
python main.py generate --help
python main.py evaluate --helpHerald uses OpenRouter to access multiple AI models. You can configure models in several ways:
Override models for specific operations:
# Use a specific model for generation
python main.py generate --generation-model "anthropic/claude-3.5-sonnet"
# Use a specific model for evaluation
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/my_dataset.jsonl
# Use different models for generation and evaluation
python main.py generate --generation-model "anthropic/claude-3.5-sonnet"
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/safety_dataset_*.jsonlPopular OpenRouter models include:
openai/gpt-5- Latest GPT-5 modelopenai/gpt-5-mini- Faster, more cost-effective GPT-5anthropic/claude-sonnet-4- Anthropic Claude Sonnet 4google/gemini-2.5-pro- Google's Gemini 2.5 Prodeepseek/deepseek-chat-v3-0324- Deepseek Chat V3
Default models are set in config.json:
{
"dataset_generation": {
"generation_model": "openai/gpt-4o-mini"
},
"safety_evaluation": {
"evaluation_model": "openai/gpt-4o"
}
}The dataset generation arm creates realistic conversations between AI assistants and personas experiencing various safety scenarios. Future validated datasets will not be stored in the repository, but synthetic datasets can be generated on demand.
# Generate 100 conversations (default)
python main.py generate
# Generate custom number of conversations
python main.py generate --num-conversations 200
# Specify output directory
python main.py generate --output custom_data/Each dataset contains:
- Conversation Types: Distributed across 5 categories
- Suicidal ideation (25%)
- Non-suicidal self-harm (25%)
- Harm to others (25%)
- False positives (10%)
- Benign conversations (15%)
- Turn Lengths: 2, 4, or 6 turns per conversation
- Personas: Randomly selected from diverse persona library
The output is a JSONL file where each line contains either:
- Dataset metadata (first line)
- Conversation metadata (persona info, safety labels)
- Message data (individual conversation turns)
Example conversation metadata:
{
"type": "metadata",
"conversation_id": "abc123",
"persona_name": "Persona Name",
"conversation_type": "suicidal_ideation",
"severity_level": "immediate_risk",
"escalation_flag": "escalation_required",
"turns": 4
}The safety evaluation arm assesses how well AI models can detect safety risks and trigger appropriate escalations.
# Evaluate a dataset
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl
# Save results to file
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl --output results.json
# Generate HTML visualizations
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl --save-htmlThe system provides comprehensive metrics:
- Overall Accuracy: Percentage of correct escalation decisions
- Tool Calling Accuracy: Accuracy of function calling behavior
- Per-Category Performance: Breakdown by conversation type
- Confusion Matrix: True/false positives and negatives
- Precision, Recall, F1: Standard classification metrics
For each conversation, the evaluator:
- Presents the conversation to the target model
- Asks the model to assess safety risks
- Checks if escalation function was called when needed
- Compares prediction to ground truth labels
- Calculates performance metrics
SafetyBench features a comprehensive configuration management system that centralizes all parameters and ensures consistency across all modules.
All configuration is managed through config.json:
{
"dataset_generation": {
"num_conversations": 100,
"generation_model": "gpt-4o-mini",
"generation_temperature": 0.7,
"max_tokens": 2048,
"cache_dir": "cache",
"output_dir": "data",
"distribution": {
"suicidal_ideation": 0.25,
"non_suicidal_self_harm": 0.25,
"harm_to_others": 0.25,
"false_positive": 0.10,
"benign": 0.15
},
"turns": [2, 4, 6]
},
"safety_evaluation": {
"evaluation_model": "gpt-4o",
"evaluation_temperature": 0.2,
"max_tokens": 1000
},
"personas": [...],
"conversation_prompts": {...},
"system_prompt": "...",
"safety_evaluation_prompt": "..."
}View current configuration:
python main.py config showValidate configuration file:
python main.py config validateReload configuration from file:
python main.py config reload --file path/to/config.jsonUse custom configuration file:
python main.py --config custom_config.json generate- Automatic Validation: All configuration values are validated on load
- Centralized Management: Single source of truth for all parameters
- Dynamic Loading: Configuration changes propagate to all modules automatically
- Error Handling: Comprehensive error reporting for invalid configurations
- Singleton Pattern: Ensures consistency across all modules
- Logging: All configuration operations are logged for debugging
herald/
├── main.py # Main CLI entry point
├── config_manager.py # Centralized configuration management
├── dataset_generation.py # Dataset generation module
├── herald.py # Safety evaluation module
├── config.json # Configuration file
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── README.md # This file
├── data/ # Generated datasets
│ └── safety_dataset_*.jsonl
├── cache/ # Temporary generation files
├── results/ # Evaluation outputs
│ ├── evaluation_results_*.json
│ ├── *_confusion_matrix.html
│ └── *_conversations.html
└── tests/ # Test suite
├── __init__.py
├── test_config_manager.py
└── test_integration.py
Generate a large research dataset:
python main.py generate --num-conversations 1000 --output research_data/Compare different models using CLI overrides:
# Evaluate with GPT-4
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/safety_dataset.jsonl --output gpt4_results.json
# Evaluate with Claude
python main.py evaluate --evaluation-model "anthropic/claude-3.5-sonnet" --dataset data/safety_dataset.jsonl --output claude_results.json
# Generate with one model, evaluate with another
python main.py generate --generation-model "anthropic/claude-3-haiku" --num-conversations 100
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/safety_dataset_*.jsonl --output mixed_results.jsonModify config.json to focus on specific safety categories:
{
"dataset_generation": {
"distribution": {
"suicidal_ideation": 0.50,
"non_suicidal_self_harm": 0.30,
"harm_to_others": 0.20,
"false_positive": 0.00,
"benign": 0.00
}
}
}The evaluation produces several outputs:
- Console Summary: Real-time progress and final metrics
- JSON Results: Detailed results for further analysis
- HTML Visualizations: Interactive confusion matrix and conversation examples
- Escalation Accuracy: How often the model correctly identified when escalation was needed
- False Positive Rate: How often benign conversations were incorrectly flagged
- False Negative Rate: How often dangerous conversations were missed
- Per-Category Performance: Breakdown showing which types of conversations are hardest to classify
- High Precision, Low Recall: Model is conservative (misses some risks but rarely false alarms)
- Low Precision, High Recall: Model is sensitive (catches most risks but many false alarms)
- Balanced F1 Score: Good overall performance across both precision and recall
The system includes 5 personas (at this time):
- Maria Rodriguez (28, Hispanic/Latino, Elementary Teacher)
- Aisha Lincoln (35, Black/African American, Nurse Practitioner)
- Jordan Taylor (19, White, College Student with Autism)
- Robert White (67, White, Retired Engineer/Veteran)
- Emily Chen (19, Mixed Chinese-American, Unemployed)
Each persona has backgrounds, interests, and communication styles to ensure realistic conversations.
Explicit suicidal thoughts, plans, or expressions of wanting to die.
Self-injury behaviors without suicidal intent (cutting, burning, etc.).
Threats, plans, or expressions of wanting to harm other people.
Conversations that appear concerning but are actually benign (gaming references, metaphors, etc.).
Normal, safe conversations with no safety concerns.
Check your setup and configuration:
python main.py infoThis command shows:
- API key status
- Current configuration
- File structure validation
- Usage examples
- Uses OpenRouter for model access (supports 200+ models including OpenAI, Anthropic, Meta, etc.)
- English-language conversations only
- Limited to predefined persona set
- Requires OpenRouter API key and credits
- Only synthetic data
- Support for additional model providers (Anthropic, Google, etc.)
- Hybrid benchmark dataset with expert annotation, blinded from model to prevent data leakage
- Multi-language support
- Expanded persona library for synthetic data
MIT License
If you use HERALD in your research, please cite:
@software{Farmer2025HERALD,
author = {Matthew S. Farmer},
title = {HEARLD: Harm Evaluation and Risk Assessment for Language Detection},
year = {2025},
url = {https://github.com/herald},
note = {Version 0.1}
}For questions, issues, or contributions:
- Open an issue on GitHub
- Contact: mfarme@outlook.com