HERALD

Harm Evaluation and Risk Assessment for Language Detection An in-development Benchmark for Evaluating Crisis Detection and Response in Large Language Models

Current state: Evaluation of OpenAI models on evaluating, labeling, and escalating synthetic conversational data Future state: Evaluate multiple models from various providers on a hybrid dataset (not provided in the github) based on expert labelled conversations created from real conversation data.

NOTE All of the data presented in this repository is fictional and synthetic. Any resemblance to real individuals, conversations, or events is purely coincidental and unintended. This data should not be interpreted as representing actual persons or real-world incidents; it exists solely for research, development, and evaluation purposes within the context of AI safety.

Overview

HERALD is a framework for generating and evaluating conversational safety datasets. It provides two main capabilities:

Dataset Generation: Create realistic, synthetic conversational datasets with diverse personas and safety scenarios
Safety Evaluation: Assess AI models' ability to detect and respond to safety risks in conversations
Configuration Management: Centralized management of all parameters across the system (supports 200+ models via OpenRouter)

Results for GPT-4.1

based on the dataset generated on 2025-07-28 (n=100 conversations)
Cost: ~ $1.00 USD

Safety Evaluation Results

Classification Metrics

Metric	Value
Precision	0.806
Recall	0.862
F1 Score	0.833

Per Escalation Flag

Escalation Flag	Correct / Total	Accuracy
escalation_required	50 / 58	0.862
no_escalation	30 / 42	0.714

Confusion Matrix

	Predicted Escalation	Predicted No Escalation
Actual Escalation	True Positives: 50	False Negatives: 8
Actual No Escalation	False Positives: 12	True Negatives: 30

Features

JSON-based configuration system with validation and error handling
Configuration changes automatically propagate to all modules
5 personas representing different demographics and backgrounds
Covers suicidal ideation, self-harm, harm to others, false positives, and benign conversations
Enforced distribution across conversation types and turn lengths
Uses function calling to trigger escalation when needed
HTML reports with confusion matrices and conversation examples
Full test suite for configuration management and integration
Built-in commands for configuration validation and management in CLI

Installation

Prerequisites

Python 3.8+
OpenRouter API key (provides access to multiple AI models including OpenAI, Anthropic, etc.)

Setup

Clone the repository:

git clone <repository-url>
cd SafetyBench

Install dependencies:

pip install -r requirements.txt

Set up your OpenRouter API key:

export OPENROUTER_API_KEY="your-api-key-here"

Or create a .env file:

cp .env.example .env
# Edit .env and add your OpenRouter API key

Quick Start

Evaluate a Dataset

Evaluate an existing dataset:

python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl

Generate a Dataset

Generate 50 conversations with default settings:

python main.py generate --num-conversations 50

Get Help

View all available options:

python main.py --help
python main.py generate --help
python main.py evaluate --help

Model Configuration

Herald uses OpenRouter to access multiple AI models. You can configure models in several ways:

Using CLI Arguments

Override models for specific operations:

# Use a specific model for generation
python main.py generate --generation-model "anthropic/claude-3.5-sonnet"

# Use a specific model for evaluation  
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/my_dataset.jsonl

# Use different models for generation and evaluation
python main.py generate --generation-model "anthropic/claude-3.5-sonnet"
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/safety_dataset_*.jsonl

Available Models

Popular OpenRouter models include:

openai/gpt-5 - Latest GPT-5 model
openai/gpt-5-mini - Faster, more cost-effective GPT-5
anthropic/claude-sonnet-4 - Anthropic Claude Sonnet 4
google/gemini-2.5-pro - Google's Gemini 2.5 Pro
deepseek/deepseek-chat-v3-0324 - Deepseek Chat V3

Configuration File

Default models are set in config.json:

{
  "dataset_generation": {
    "generation_model": "openai/gpt-4o-mini"
  },
  "safety_evaluation": {
    "evaluation_model": "openai/gpt-4o"
  }
}

Usage Guide

Dataset Generation

The dataset generation arm creates realistic conversations between AI assistants and personas experiencing various safety scenarios. Future validated datasets will not be stored in the repository, but synthetic datasets can be generated on demand.

Basic Generation

# Generate 100 conversations (default)
python main.py generate

# Generate custom number of conversations
python main.py generate --num-conversations 200

# Specify output directory
python main.py generate --output custom_data/

Generated Dataset Structure

Each dataset contains:

Conversation Types: Distributed across 5 categories
- Suicidal ideation (25%)
- Non-suicidal self-harm (25%)
- Harm to others (25%)
- False positives (10%)
- Benign conversations (15%)
Turn Lengths: 2, 4, or 6 turns per conversation
Personas: Randomly selected from diverse persona library

Dataset Format

The output is a JSONL file where each line contains either:

Dataset metadata (first line)
Conversation metadata (persona info, safety labels)
Message data (individual conversation turns)

Example conversation metadata:

{
  "type": "metadata",
  "conversation_id": "abc123",
  "persona_name": "Persona Name",
  "conversation_type": "suicidal_ideation",
  "severity_level": "immediate_risk",
  "escalation_flag": "escalation_required",
  "turns": 4
}

Safety Evaluation

The safety evaluation arm assesses how well AI models can detect safety risks and trigger appropriate escalations.

Basic Evaluation

# Evaluate a dataset
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl

# Save results to file
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl --output results.json

# Generate HTML visualizations
python main.py evaluate --dataset data/safety_dataset_20250728T121437Z.jsonl --save-html

Evaluation Metrics

The system provides comprehensive metrics:

Overall Accuracy: Percentage of correct escalation decisions
Tool Calling Accuracy: Accuracy of function calling behavior
Per-Category Performance: Breakdown by conversation type
Confusion Matrix: True/false positives and negatives
Precision, Recall, F1: Standard classification metrics

Evaluation Process

For each conversation, the evaluator:

Presents the conversation to the target model
Asks the model to assess safety risks
Checks if escalation function was called when needed
Compares prediction to ground truth labels
Calculates performance metrics

Configuration System

SafetyBench features a comprehensive configuration management system that centralizes all parameters and ensures consistency across all modules.

Configuration File Structure

All configuration is managed through config.json:

{
  "dataset_generation": {
    "num_conversations": 100,
    "generation_model": "gpt-4o-mini",
    "generation_temperature": 0.7,
    "max_tokens": 2048,
    "cache_dir": "cache",
    "output_dir": "data",
    "distribution": {
      "suicidal_ideation": 0.25,
      "non_suicidal_self_harm": 0.25,
      "harm_to_others": 0.25,
      "false_positive": 0.10,
      "benign": 0.15
    },
    "turns": [2, 4, 6]
  },
  "safety_evaluation": {
    "evaluation_model": "gpt-4o",
    "evaluation_temperature": 0.2,
    "max_tokens": 1000
  },
  "personas": [...],
  "conversation_prompts": {...},
  "system_prompt": "...",
  "safety_evaluation_prompt": "..."
}

Configuration Management Commands

View current configuration:

python main.py config show

Validate configuration file:

python main.py config validate

Reload configuration from file:

python main.py config reload --file path/to/config.json

Use custom configuration file:

python main.py --config custom_config.json generate

Configuration Features

Automatic Validation: All configuration values are validated on load
Centralized Management: Single source of truth for all parameters
Dynamic Loading: Configuration changes propagate to all modules automatically
Error Handling: Comprehensive error reporting for invalid configurations
Singleton Pattern: Ensures consistency across all modules
Logging: All configuration operations are logged for debugging

File Structure

herald/
├── main.py                  # Main CLI entry point
├── config_manager.py        # Centralized configuration management
├── dataset_generation.py    # Dataset generation module
├── herald.py                # Safety evaluation module
├── config.json              # Configuration file
├── requirements.txt         # Python dependencies
├── .env.example             # Environment template
├── README.md                # This file
├── data/                    # Generated datasets
│   └── safety_dataset_*.jsonl
├── cache/                   # Temporary generation files
├── results/                 # Evaluation outputs
│   ├── evaluation_results_*.json
│   ├── *_confusion_matrix.html
│   └── *_conversations.html
└── tests/                   # Test suite
    ├── __init__.py
    ├── test_config_manager.py
    └── test_integration.py

Examples

Example 1: Research Dataset

Generate a large research dataset:

python main.py generate --num-conversations 1000 --output research_data/

Example 2: Model Comparison

Compare different models using CLI overrides:

# Evaluate with GPT-4
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/safety_dataset.jsonl --output gpt4_results.json

# Evaluate with Claude
python main.py evaluate --evaluation-model "anthropic/claude-3.5-sonnet" --dataset data/safety_dataset.jsonl --output claude_results.json

# Generate with one model, evaluate with another
python main.py generate --generation-model "anthropic/claude-3-haiku" --num-conversations 100
python main.py evaluate --evaluation-model "openai/gpt-4o" --dataset data/safety_dataset_*.jsonl --output mixed_results.json

Example 3: Custom Distribution

Modify config.json to focus on specific safety categories:

{
  "dataset_generation": {
    "distribution": {
      "suicidal_ideation": 0.50,
      "non_suicidal_self_harm": 0.30,
      "harm_to_others": 0.20,
      "false_positive": 0.00,
      "benign": 0.00
    }
  }
}

Understanding Results

Evaluation Output

The evaluation produces several outputs:

Console Summary: Real-time progress and final metrics
JSON Results: Detailed results for further analysis
HTML Visualizations: Interactive confusion matrix and conversation examples

Key Metrics

Escalation Accuracy: How often the model correctly identified when escalation was needed
False Positive Rate: How often benign conversations were incorrectly flagged
False Negative Rate: How often dangerous conversations were missed
Per-Category Performance: Breakdown showing which types of conversations are hardest to classify

Interpreting Results

High Precision, Low Recall: Model is conservative (misses some risks but rarely false alarms)
Low Precision, High Recall: Model is sensitive (catches most risks but many false alarms)
Balanced F1 Score: Good overall performance across both precision and recall

Personas

The system includes 5 personas (at this time):

Maria Rodriguez (28, Hispanic/Latino, Elementary Teacher)
Aisha Lincoln (35, Black/African American, Nurse Practitioner)
Jordan Taylor (19, White, College Student with Autism)
Robert White (67, White, Retired Engineer/Veteran)
Emily Chen (19, Mixed Chinese-American, Unemployed)

Each persona has backgrounds, interests, and communication styles to ensure realistic conversations.

Conversation Categories

1. Suicidal Ideation

Explicit suicidal thoughts, plans, or expressions of wanting to die.

2. Non-Suicidal Self-Harm

Self-injury behaviors without suicidal intent (cutting, burning, etc.).

3. Harm to Others

Threats, plans, or expressions of wanting to harm other people.

4. False Positives

Conversations that appear concerning but are actually benign (gaming references, metaphors, etc.).

5. Benign

Normal, safe conversations with no safety concerns.

System Information

Check your setup and configuration:

python main.py info

This command shows:

API key status
Current configuration
File structure validation
Usage examples

Limitations

Uses OpenRouter for model access (supports 200+ models including OpenAI, Anthropic, Meta, etc.)
English-language conversations only
Limited to predefined persona set
Requires OpenRouter API key and credits
Only synthetic data

Future Plans

Support for additional model providers (Anthropic, Google, etc.)
Hybrid benchmark dataset with expert annotation, blinded from model to prevent data leakage
Multi-language support
Expanded persona library for synthetic data

License

MIT License

Citation

If you use HERALD in your research, please cite:

@software{Farmer2025HERALD,
  author    = {Matthew S. Farmer},
  title     = {HEARLD: Harm Evaluation and Risk Assessment for Language Detection},
  year      = {2025},
  url       = {https://github.com/herald},
  note      = {Version 0.1}
}

Support

For questions, issues, or contributions:

Open an issue on GitHub
Contact: mfarme@outlook.com

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
findings		findings
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
config_manager.py		config_manager.py
dataset_generation.py		dataset_generation.py
evaluation-notebook.ipynb		evaluation-notebook.ipynb
herald.py		herald.py
main.py		main.py
requirements.txt		requirements.txt
synthetic-conversation-generation.ipynb		synthetic-conversation-generation.ipynb
testing_notebook.ipynb		testing_notebook.ipynb

License

mfarme/herald

Folders and files

Latest commit

History

Repository files navigation

HERALD

Overview

Results for GPT-4.1

Safety Evaluation Results

Classification Metrics

Per Escalation Flag

Confusion Matrix

Features

Installation

Prerequisites

Setup

Quick Start

Evaluate a Dataset

Generate a Dataset

Get Help

Model Configuration

Using CLI Arguments

Available Models

Configuration File

Usage Guide

Dataset Generation

Basic Generation

Generated Dataset Structure

Dataset Format

Safety Evaluation

Basic Evaluation

Evaluation Metrics

Evaluation Process

Configuration System

Configuration File Structure

Configuration Management Commands

Configuration Features

File Structure

Examples

Example 1: Research Dataset

Example 2: Model Comparison

Example 3: Custom Distribution

Understanding Results

Evaluation Output

Key Metrics

Interpreting Results

Personas

Conversation Categories

1. Suicidal Ideation

2. Non-Suicidal Self-Harm

3. Harm to Others

4. False Positives

5. Benign

System Information

Limitations

Future Plans

License

Citation

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages