Safe Harbor is a lightweight defense mechanism against Data Structure Injection (DSI) attacks on Large Language Models, developed by Zenity based on research into the choice architecture in LLMs, which followed DSI, and Structured Self-Modeling.
Safe Harbor has been mainly tested in lab environments against three pre-defined environment configurations. In these tests, the mechanism has sharply reduced the efficacy of DSI attacks. However, in certain configurations, there have been considerable false-positive rates.
This has indeed been tested in lab environments, so we invite the community to provide feedback and suggestions based on real-world deployment.
SafeHarbor works by injecting a special "safe tool" into LLM tool calls. When the model detects a malicious, dangerous, or unsafe request, it can invoke this safe tool instead of executing harmful operations. This provides a layer of defense against prompt injection and other adversarial attacks.
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-5-mini, gpt-5 |
| Anthropic | claude-3-opus, claude-haiku-4.5, claude-opus-4 |
| Google Gemini (Vertex AI) | gemini-1.5-flash, gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro |
| Google Gemini (Native API) | gemini-2.0-flash-exp, gemini-2.5-flash, gemini-2.5-pro |
Copy safe_harbor.py into your project, then import it:
from safe_harbor import SafeHarborWrapper, SafeHarborTriggeredSafeHarbor has no additional dependencies beyond the LLM client libraries you're already using:
- OpenAI:
openai - Anthropic:
anthropic - Gemini Native:
google-generativeai - Gemini Vertex AI:
vertexai
import openai
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered
client = openai.OpenAI(api_key="your-api-key")
wrapper = SafeHarborWrapper(client)
try:
response = wrapper.chat(
messages=[{"role": "user", "content": "Your prompt here"}],
model="gpt-4o",
temperature=0.7
)
print(response.choices[0].message.content)
except SafeHarborTriggered as e:
print(f"Attack detected: {e}")import anthropic
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered
client = anthropic.Anthropic(api_key="your-api-key")
wrapper = SafeHarborWrapper(client)
try:
response = wrapper.chat(
messages=[{"role": "user", "content": "Your prompt here"}],
model="claude-3-opus-20240229",
max_tokens=1024
)
print(response.content[0].text)
except SafeHarborTriggered as e:
print(f"Attack detected: {e}")import google.generativeai as genai
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered
genai.configure(api_key="your-api-key")
client = genai.GenerativeModel('gemini-2.5-flash')
wrapper = SafeHarborWrapper(client)
try:
response = wrapper.chat(
messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.text)
except SafeHarborTriggered as e:
print(f"Attack detected: {e}")import vertexai
from vertexai.generative_models import GenerativeModel
from safe_harbor import SafeHarborWrapper, SafeHarborTriggered
vertexai.init(project="your-project", location="us-central1")
client = GenerativeModel("gemini-2.5-flash")
wrapper = SafeHarborWrapper(client)
try:
response = wrapper.chat(
messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.text)
except SafeHarborTriggered as e:
print(f"Attack detected: {e}")SafeHarbor automatically injects its safe tool alongside your existing tools:
my_tools = [
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the database for records",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
}
]
response = wrapper.chat(
messages=[{"role": "user", "content": "Search for user data"}],
model="gpt-4o",
tools=my_tools # Your tools + safe tool will be available
)By default, SafeHarbor raises a SafeHarborTriggered exception when an attack is detected. To handle it silently instead:
wrapper = SafeHarborWrapper(client, raise_on_safe_tool=False)
response = wrapper.chat(
messages=[{"role": "user", "content": "..."}],
model="gpt-4o"
)
if isinstance(response, dict) and response.get("safety_triggered"):
print(f"Attack blocked: {response['reason']}")
# Access raw response if needed: response['raw_response']
else:
# Normal response handling
print(response.choices[0].message.content)-
Tool Injection: When you call
wrapper.chat(), SafeHarbor automatically adds aliteral_safe_toolto the list of available tools. -
Model Decision: If the LLM detects that the request is dangerous, malicious, or unsafe, it can choose to call
literal_safe_toolinstead of proceeding with potentially harmful actions. -
Detection & Response: SafeHarbor monitors the model's response for calls to the safe tool. When detected, it either raises an exception or returns a safety dict (based on configuration).
SafeHarborWrapper(client, raise_on_safe_tool=True)| Parameter | Type | Default | Description |
|---|---|---|---|
client |
Any | required | Your LLM client instance (OpenAI, Anthropic, or Gemini) |
raise_on_safe_tool |
bool | True |
If True, raises SafeHarborTriggered on detection. If False, returns a dict. |
wrapper.chat(messages, tools=None, **kwargs)| Parameter | Type | Default | Description |
|---|---|---|---|
messages |
List[Dict] | required | List of message dicts with role and content |
tools |
List[Dict] | None |
Optional list of tools (provider-specific format) |
**kwargs |
Any | - | Additional arguments passed to the underlying client |
Exception raised when the model calls the safe tool (when raise_on_safe_tool=True).
try:
response = wrapper.chat(...)
except SafeHarborTriggered as e:
reason = str(e) # Contains the model's explanation- GPT-5 Compatibility: SafeHarbor automatically removes unsupported parameters (like
temperature) when using GPT-5 models. - Streaming: This wrapper does not currently support streaming responses.
- Multi-turn Gemini: For Gemini, messages are concatenated into a single prompt string. For complex multi-turn conversations, consider using native Gemini
Contentobjects.
MIT License