Skip to content

safe harbor is a defensive mechanism for LLMs to self-defend against malicious input

Notifications You must be signed in to change notification settings

zenitysec/safe-harbor

Repository files navigation

SafeHarbor

Safe Harbor is a lightweight defense mechanism against Data Structure Injection (DSI) attacks on Large Language Models, developed by Zenity based on research into the choice architecture in LLMs, which followed DSI, and Structured Self-Modeling.

Safe Harbor has been mainly tested in lab environments against three pre-defined environment configurations. In these tests, the mechanism has sharply reduced the efficacy of DSI attacks. However, in certain configurations, there have been considerable false-positive rates.

This has indeed been tested in lab environments, so we invite the community to provide feedback and suggestions based on real-world deployment.

Overview

SafeHarbor works by injecting a special "safe tool" into LLM tool calls. When the model detects a malicious, dangerous, or unsafe request, it can invoke this safe tool instead of executing harmful operations. This provides a layer of defense against prompt injection and other adversarial attacks.

Supported Models

Provider Models
OpenAI gpt-4o, gpt-4o-mini, gpt-5-mini, gpt-5
Anthropic claude-3-opus, claude-haiku-4.5, claude-opus-4
Google Gemini (Vertex AI) gemini-1.5-flash, gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro
Google Gemini (Native API) gemini-2.0-flash-exp, gemini-2.5-flash, gemini-2.5-pro

Installation

Copy safe_harbor.py into your project, then import it:

from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

Dependencies

SafeHarbor has no additional dependencies beyond the LLM client libraries you're already using:

  • OpenAI: openai
  • Anthropic: anthropic
  • Gemini Native: google-generativeai
  • Gemini Vertex AI: vertexai

Usage

OpenAI

import openai
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

client = openai.OpenAI(api_key="your-api-key")
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}],
        model="gpt-4o",
        temperature=0.7
    )
    print(response.choices[0].message.content)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Anthropic

import anthropic
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

client = anthropic.Anthropic(api_key="your-api-key")
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}],
        model="claude-3-opus-20240229",
        max_tokens=1024
    )
    print(response.content[0].text)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Google Gemini (Native API)

import google.generativeai as genai
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

genai.configure(api_key="your-api-key")
client = genai.GenerativeModel('gemini-2.5-flash')
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}]
    )
    print(response.text)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Google Gemini (Vertex AI)

import vertexai
from vertexai.generative_models import GenerativeModel
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered

vertexai.init(project="your-project", location="us-central1")
client = GenerativeModel("gemini-2.5-flash")
wrapper = SafeHarborWrapper(client)

try:
    response = wrapper.chat(
        messages=[{"role": "user", "content": "Your prompt here"}]
    )
    print(response.text)
except SafeHarborTriggered as e:
    print(f"Attack detected: {e}")

Using with Existing Tools

SafeHarbor automatically injects its safe tool alongside your existing tools:

my_tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the database for records",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            }
        }
    }
]

response = wrapper.chat(
    messages=[{"role": "user", "content": "Search for user data"}],
    model="gpt-4o",
    tools=my_tools  # Your tools + safe tool will be available
)

Configuration

Silent Mode

By default, SafeHarbor raises a SafeHarborTriggered exception when an attack is detected. To handle it silently instead:

wrapper = SafeHarborWrapper(client, raise_on_safe_tool=False)

response = wrapper.chat(
    messages=[{"role": "user", "content": "..."}],
    model="gpt-4o"
)

if isinstance(response, dict) and response.get("safety_triggered"):
    print(f"Attack blocked: {response['reason']}")
    # Access raw response if needed: response['raw_response']
else:
    # Normal response handling
    print(response.choices[0].message.content)

How It Works

  1. Tool Injection: When you call wrapper.chat(), SafeHarbor automatically adds a literal_safe_tool to the list of available tools.

  2. Model Decision: If the LLM detects that the request is dangerous, malicious, or unsafe, it can choose to call literal_safe_tool instead of proceeding with potentially harmful actions.

  3. Detection & Response: SafeHarbor monitors the model's response for calls to the safe tool. When detected, it either raises an exception or returns a safety dict (based on configuration).

API Reference

SafeHarborWrapper

SafeHarborWrapper(client, raise_on_safe_tool=True)
Parameter Type Default Description
client Any required Your LLM client instance (OpenAI, Anthropic, or Gemini)
raise_on_safe_tool bool True If True, raises SafeHarborTriggered on detection. If False, returns a dict.

SafeHarborWrapper.chat()

wrapper.chat(messages, tools=None, **kwargs)
Parameter Type Default Description
messages List[Dict] required List of message dicts with role and content
tools List[Dict] None Optional list of tools (provider-specific format)
**kwargs Any - Additional arguments passed to the underlying client

SafeHarborTriggered

Exception raised when the model calls the safe tool (when raise_on_safe_tool=True).

try:
    response = wrapper.chat(...)
except SafeHarborTriggered as e:
    reason = str(e)  # Contains the model's explanation

Notes

  • GPT-5 Compatibility: SafeHarbor automatically removes unsupported parameters (like temperature) when using GPT-5 models.
  • Streaming: This wrapper does not currently support streaming responses.
  • Multi-turn Gemini: For Gemini, messages are concatenated into a single prompt string. For complex multi-turn conversations, consider using native Gemini Content objects.

License

MIT License

About

safe harbor is a defensive mechanism for LLMs to self-defend against malicious input

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages