Skip to content

Allow specifying different models for 'Attacker' and 'Target' locally #4

@peschull

Description

@peschull

Is your feature request related to a problem? Please describe.

When running MasterKey locally, the "Attacker" model (the one generating the jailbreaks) often needs to be an uncensored model (e.g., dolphin-llama3, wizardlm), whereas the "Target" model might be a standard aligned model (e.g., llama3, mistral).
Currently, it is difficult to configure the framework to use two different local model names that run on the same local server or different local ports.

Describe the solution you'd like

Please add a configuration option to explicitly set the model name for the attacker and the target separately, which is passed directly to the API call.
Example config structure:

{
"attacker": {
"model_name": "dolphin-llama3",
"api_base": "http://localhost:11434/v1"
},
"target": {
"model_name": "llama3",
"api_base": "http://localhost:11434/v1"
}
}

Additional context

Using standard models (like GPT-4 or standard Llama-3) as the 'Attacker' often fails locally because they refuse to generate the jailbreak prompts due to their own safety alignment. Supporting uncensored local models as attackers is essential for the framework's effectiveness in a local setup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions