SafeHarbor

Safe Harbor is a lightweight defense mechanism against Data Structure Injection (DSI) attacks on Large Language Models, developed by Zenity based on research into the choice architecture in LLMs, which followed DSI, and Structured Self-Modeling.

Safe Harbor has been mainly tested in lab environments against three pre-defined environment configurations. In these tests, the mechanism has sharply reduced the efficacy of DSI attacks. However, in certain configurations, there have been considerable false-positive rates.

This has indeed been tested in lab environments, so we invite the community to provide feedback and suggestions based on real-world deployment.

Overview

SafeHarbor works by injecting a special "safe tool" into LLM tool calls. When the model detects a malicious, dangerous, or unsafe request, it can invoke this safe tool instead of executing harmful operations. This provides a layer of defense against prompt injection and other adversarial attacks.

Supported Models

Provider	Models
OpenAI	gpt-4o, gpt-4o-mini, gpt-5-mini, gpt-5
Anthropic	claude-3-opus, claude-haiku-4.5, claude-opus-4
Google Gemini (Vertex AI)	gemini-1.5-flash, gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro
Google Gemini (Native API)	gemini-2.0-flash-exp, gemini-2.5-flash, gemini-2.5-pro

Installation

Copy safe_harbor.py into your project, then import it:

from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

Dependencies

SafeHarbor has no additional dependencies beyond the LLM client libraries you're already using:

OpenAI: openai
Anthropic: anthropic
Gemini Native: google-generativeai
Gemini Vertex AI: vertexai

Usage

OpenAI

import openai
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

client = openai.OpenAI(api_key="your-api-key")
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}],
        model="gpt-4o",
        temperature=0.7
    )
    print(response.choices[0].message.content)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Anthropic

import anthropic
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

client = anthropic.Anthropic(api_key="your-api-key")
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}],
        model="claude-3-opus-20240229",
        max_tokens=1024
    )
    print(response.content[0].text)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Google Gemini (Native API)

import google.generativeai as genai
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

genai.configure(api_key="your-api-key")
client = genai.GenerativeModel('gemini-2.5-flash')
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}]
    )
    print(response.text)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Google Gemini (Vertex AI)

import vertexai
from vertexai.generative_models import GenerativeModel
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

vertexai.init(project="your-project", location="us-central1")
client = GenerativeModel("gemini-2.5-flash")
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}]
    )
    print(response.text)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Using with Existing Tools

SafeHarbor automatically injects its safe tool alongside your existing tools:

my_tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the database for records",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            }
        }
    }
]

response = wrapper.chat(
    messages=[{"role": "user", "content": "Search for user data"}],
    model="gpt-4o",
    tools=my_tools  # Your tools + safe tool will be available
)

Configuration

Silent Mode

By default, SafeHarbor raises a SafeHarborTriggered exception when an attack is detected. To handle it silently instead:

wrapper = SafeHarborWrapper(client, raise_on_safe_tool=False)

response = wrapper.chat(
    messages=[{"role": "user", "content": "..."}],
    model="gpt-4o"
)

if isinstance(response, dict) and response.get("safety_triggered"):
    print(f"Attack blocked: {response['reason']}")
    # Access raw response if needed: response['raw_response']
else:
    # Normal response handling
    print(response.choices[0].message.content)

How It Works

Tool Injection: When you call wrapper.chat(), SafeHarbor automatically adds a literal_safe_tool to the list of available tools.
Model Decision: If the LLM detects that the request is dangerous, malicious, or unsafe, it can choose to call literal_safe_tool instead of proceeding with potentially harmful actions.
Detection & Response: SafeHarbor monitors the model's response for calls to the safe tool. When detected, it either raises an exception or returns a safety dict (based on configuration).

API Reference

`SafeHarborWrapper`

SafeHarborWrapper(client, raise_on_safe_tool=True)

Parameter	Type	Default	Description
`client`	Any	required	Your LLM client instance (OpenAI, Anthropic, or Gemini)
`raise_on_safe_tool`	bool	`True`	If `True`, raises `SafeHarborTriggered` on detection. If `False`, returns a dict.

`SafeHarborWrapper.chat()`

wrapper.chat(messages, tools=None, **kwargs)

Parameter	Type	Default	Description
`messages`	List[Dict]	required	List of message dicts with `role` and `content`
`tools`	List[Dict]	`None`	Optional list of tools (provider-specific format)
`**kwargs`	Any	-	Additional arguments passed to the underlying client

`SafeHarborTriggered`

Exception raised when the model calls the safe tool (when raise_on_safe_tool=True).

try:
    response = wrapper.chat(...)
except SafeHarborTriggered as e:
    reason = str(e)  # Contains the model's explanation

Notes

GPT-5 Compatibility: SafeHarbor automatically removes unsupported parameters (like temperature) when using GPT-5 models.
Streaming: This wrapper does not currently support streaming responses.
Multi-turn Gemini: For Gemini, messages are concatenated into a single prompt string. For complex multi-turn conversations, consider using native Gemini Content objects.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CMD Injection		CMD Injection
SQL Injection		SQL Injection
Workflows		Workflows
Workflows_Extended		Workflows_Extended
XSS		XSS
README.md		README.md
experiment_abc_tool_comparison copy.py		experiment_abc_tool_comparison copy.py
safe_harbor.py		safe_harbor.py
unified_experiment.py		unified_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SafeHarbor

Overview

Supported Models

Installation

Dependencies

Usage

OpenAI

Anthropic

Google Gemini (Native API)

Google Gemini (Vertex AI)

Using with Existing Tools

Configuration

Silent Mode

How It Works

API Reference

`SafeHarborWrapper`

`SafeHarborWrapper.chat()`

`SafeHarborTriggered`

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

zenitysec/safe-harbor

Folders and files

Latest commit

History

Repository files navigation

SafeHarbor

Overview

Supported Models

Installation

Dependencies

Usage

OpenAI

Anthropic

Google Gemini (Native API)

Google Gemini (Vertex AI)

Using with Existing Tools

Configuration

Silent Mode

How It Works

API Reference

SafeHarborWrapper

SafeHarborWrapper.chat()

SafeHarborTriggered

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`SafeHarborWrapper`

`SafeHarborWrapper.chat()`

`SafeHarborTriggered`

Packages