Skip to content

feat: generic plugin system#53

Open
red40maxxer wants to merge 157 commits intop-e-w:masterfrom
red40maxxer:refusal-plugins
Open

feat: generic plugin system#53
red40maxxer wants to merge 157 commits intop-e-w:masterfrom
red40maxxer:refusal-plugins

Conversation

@red40maxxer
Copy link
Contributor

@red40maxxer red40maxxer commented Nov 27, 2025

Refactors refusal detection and refusal counting into a modular plugin system. This allows for arbitrary logic (regex, embeddings, external models) to be added without modifying core code.

Plugin system

  • Scorers take in responses and optional response metadata and produce a single Score that Optuna can use for optimization, eg. ["I can't help you with that.", "Certainly!"] -> 0.67. By default, this is the KL divergence and refusal rate.
  • You can add a new plugin by subclassing from Scorer, adding implementations for the abstract methods and add ScorerClassName.property in the config.toml for any user-defined configuration you need. Make sure the config fields are defined in a PluginSettings - look at RefusalRate and KLDivergence for reference.
  • Convenience methods for the Scorer to access important outputs from Model are provided through Context, such as get_responses(), get_logits(), etc.
  • Multiple instances of the same scorer are supported with optional configuration overrides in the config TOML. These are defined at the top level along with the optimization direction and scaling factor (if needed). To configure those instances specifically, create a table with the name ScorerClassName_InstanceName.

@red40maxxer red40maxxer marked this pull request as ready for review November 27, 2025 01:17
@p-e-w
Copy link
Owner

p-e-w commented Nov 28, 2025

Thanks for the PR!

As mentioned previously, I am indeed planning to add a plugin system for several components of Heretic. IMO, the interface you propose is not generic enough.

Instead, we want plugins that can attach arbitrary attributes to responses. The refusal case would be covered by attaching the mapping

{"is_refusal": True}

Another plugin might use sentiment analysis to produce the attributes

{"positivity": 0.81, "formality": 0.36}

Then an evaluation plugin is passed the list of attribute maps for the responses, and produces a score that can be optimized by the optimizer. This will allow doing all kinds of things that are currently impossible, not just writing a different implementation of refusal detection.

Even that is just a vague idea right now though. It's probably a good idea to discuss the plugin abstraction some more before proceeding with implementation.

@red40maxxer
Copy link
Contributor Author

That makes sense, thanks for the explanation. Before going further, would something structurally similar to what I proposed (in terms of the import/loading mechanism and how plugins are registered/used) also be desirable for the broader attribute-generation and evaluation plugin systems you have in mind? If so, I can adapt the design toward that more general interface rather than building something too narrowly scoped.

Happy to iterate on the abstraction before implementing anything concrete.

@p-e-w
Copy link
Owner

p-e-w commented Dec 2, 2025

would something structurally similar to what I proposed (in terms of the import/loading mechanism and how plugins are registered/used) also be desirable for the broader attribute-generation and evaluation plugin systems you have in mind?

Yes, I think your plugin loading logic looks good. Instead of a single plugins folder, we probably want one folder per plugin type, e.g. processors and evaluators.

I don't think we need the plugin interface to have a single-response detect method. Everything should operate on batches because that's the most flexible and performant approach. Likewise, evaluator plugins should receive the list of classifier outputs for all responses, and return a single score.

Not sure about the correct naming yet. "Classifiers"/"AttributeGenerators"? "Evaluators"/"Scorers"?

@red40maxxer
Copy link
Contributor Author

Would we support multiple classifier plugins in parallel or in sequence? For example, I can imagine a situation where we'd have two plugins, one that uses regex and another that uses some other form of magic to detect refusals in order to overcome nondeterminism, such as when the prompt doesn't contain the full <think></think> block due to token limitations or whatnot.

AttributeGenerator and Scorer sound pretty reasonable to me tbh

@p-e-w
Copy link
Owner

p-e-w commented Dec 2, 2025

I don't think we should support using multiple classifier plugins at the same time. There's always some use case that one can construct, but complexity just balloons, and Heretic can't cover every possible everything.

I don't like AttributeGenerator too much because it's long and two words. How about Tagger? The plugins live in taggers, and attach tags to responses. Other ideas very welcome.

The basic interface might look like this:

  • Taggers: list[str] -> list[dict[str, Any]] (responses to attributes)
  • Scorers: list[dict[str, Any]] -> float (attributes to score)

Does that make sense?

@p-e-w
Copy link
Owner

p-e-w commented Dec 2, 2025

Note that there is also #51 which classifies model behavior based on hidden states, not the response text. The plugin system should cover that and similar approaches. This is a complex problem indeed.

@red40maxxer
Copy link
Contributor Author

red40maxxer commented Dec 2, 2025

Mmm, the interface makes sense to me, I think enforcing a TypedDict or Pydantic schema for new plugins would be helpful as well, things could get very frustrating for users who would have to contend with certain tagging plugins only being compatible with certain scorers, etc.. What level of complexity are you expecting for the scorers though? Would it be arbitrarily complex (e.g trainable weights for attributes outputted by taggers), or do we foresee it being relatively simple for the near future?

On another note, I think #51 is actually covered by the code right now as we instantiate the detector (tagger) plugins with self.settings and self.model so it has access to the hidden states. I'm not sure if that's the best approach though

@p-e-w
Copy link
Owner

p-e-w commented Dec 2, 2025

I think enforcing a TypedDict or Pydantic schema for new plugins would be helpful as well, things could get very frustrating for users who would have to contend with certain tagging plugins only being compatible with certain scorers, etc..

Not sure if we need that much type safety. It's probably reasonable to assume that users who experiment with non-default plugin combinations at least understand the structure of the attribute dict. Flexibility might beat safety here.

What level of complexity are you expecting for the scorers though? Would it be arbitrarily complex (e.g trainable weights for attributes outputted by taggers), or do we foresee it being relatively simple for the near future?

I expect that most scorers will perform either an sum, a mean, or a tally, and possibly a normalization step afterwards, with either a positive or a negative sign depending on which direction it should optimize for.

On another note, I think #51 is actually covered by the code right now as we instantiate the detector (tagger) plugins with self.settings and self.model so it has access to the hidden states.

How? The tagger would need access to the hidden states for each response, and the responses are generated externally.

@Vinayyyy7
Copy link
Contributor

I see the vision for a more modular "Tagger + Scorer" architecture you two are discussing about...

Defining a stable Plugin Interface (our socket) is critical 1st Step before any implementation.

In my pending PR, I implemented a simple interface:

class Plugin(ABC):
    def score(self, responses: list[str]) -> list[float]: ...
    property minimize -> bool

This essentially combines the Tagger and Scorer into one unit. It works well for single-objective cases (like "just use regex" OR"just classifier" OR just a single plugin at a time),

but I see how the split approach is superior for composability (eg. optimizing for BOTH low refusal and high sentiment).

should we standardize on these two base classes?

Tagger (extracts features):

class Tagger(ABC):
    def tag(self, responses: list[str]) -> list[dict[str, Any]]: ...

Scorer (calculates loss/reward):

class Scorer(ABC):
    def evaluate(self, attributes: list[dict[str, Any]]) -> float: ...

We should think what should be standard interface for all plugins (which can be regularly mantained if someone opens a issue...)

  • So what is standard that we need hm,

I don't like AttributeGenerator too much because it's long and two words. How about Tagger? The plugins live in taggers, and attach tags to responses. Other ideas very welcome.

Kinda agree like having the word Attribute in the name sounds more like a error in general. a simple name would be better

@red40maxxer
Copy link
Contributor Author

red40maxxer commented Dec 2, 2025

How? The tagger would need access to the hidden states for each response, and the responses are generated externally.

Sorry, the phrasing of my comment was probably misleading.

@lbartoszcze feel free to jump in in case I'm misrepresenting your code, but I don't think it accesses the hidden states for each response at the .generate() level. Instead, the evaluator gets the raw text and feeding it back into the model to re-tokenize the response text and recompute the hidden states for it. Doing a mean pool over the final hidden state is what gets us the sentence-level embedding for semantic similarity classification:

def _get_text_embedding(self, text: str) -> torch.Tensor:
        """
        Get embedding for a text using the model's hidden states.
        Uses mean pooling over the last hidden state.
        """
        inputs = self.model.tokenizer(
            text,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512,
            return_token_type_ids=False,
        ).to(self.model.model.device)

        with torch.no_grad():
            outputs = self.model.model(
                **inputs,
                output_hidden_states=True,
                return_dict=True,
            )
            # Use last hidden state
            hidden_states = outputs.hidden_states[-1]

            # Mean pooling over sequence length (excluding padding)
            attention_mask = inputs["attention_mask"].unsqueeze(-1)
            masked_hidden = hidden_states * attention_mask
            summed = masked_hidden.sum(dim=1)
            counts = attention_mask.sum(dim=1)
            embedding = summed / counts

        return embedding.squeeze(0)

Even though we're doing a second forward pass here, it makes sense to me since we'd need to call generate with output_hidden_states=True and then do a bunch of tensor plumbing instead of just operating on strings.

Regardless, it's still important that plugins are allowed access to hidden states and other model internals other than raw text. Maybe we can attach a ResponseMetadata to each response and have those fields configurable by the plugin-specific TOML? I'm imagining something like this:

@dataclass
class ResponseMetadata:
    # basic prompt stuff
    prompt_text: str | None
    response_text: str

    # multi-turn in the future?
    conversation_id: str | None
    turn_index: int | None
    role: str
   
    # model and generation params
    model_name: str | None
    model_revision: str | None
    generation_params: Dict[str, Any]
    finish_reason: str | None 
    # ^ this would be really helpful for solving the current issues we have with COT

    # Tokenization
    input_ids: List[int] | None
    response_ids: List[int] | None
    response_tokens: List[str] | None
    response_offsets: List[tuple[int, int]] | None

    # Logprobs / uncertainty
    token_logprobs: List[float] | None
    token_logits: List[float] | None

    # Embeddings
    embedding_model_name: str | None
    response_embedding: List[float] | None
    prompt_embedding: List[float] | None

    # Hidden states / residuals (optional, heavy)
    last_hidden_states: List[List[float]] | None           # [seq_len][hidden_dim]
    residuals_last_token_per_layer: List[List[float]] | None  # [num_layers+1][hidden_dim]

    # Arbitrary plugin-specific extra
    extra: Dict[str, Any] = None

And if we want to expose all generation-time internals for some really complex plugins, we can have something like:

@dataclass
class GenerationStep:
    step_index: int
    token_id: int
    token: str
    logprob: float | None
    topk: List[Dict[str, float]] | None  # [{"token": "...", "logprob": ...}, ...]
    entropy: float | None                # from logits, if computed

    # Optional internals
    last_hidden_state: List[float] | None          # [hidden_dim]
    residuals_per_layer: List[List[float]] | None  # [num_layers+1][hidden_dim]
    attention_summary: Dict[str, Any] | None       # e.g. head-wise max/mea

@dataclass
class GenerationTrace:
    steps: List[GenerationStep]
    finish_reason: str | None

Obviously we'd only expose the most lightweight fields in the default and come up with ways to optimize the big model internals later. Scorers would be a lot simpler since I'm assuming they can only rely on the attributes generated by the taggers and nothing else (for now?)

Let me know what you think!

@lbartoszcze
Copy link

hey man, I created a lot of methods to evaluate this without needing the activations so it works with closed source https://github.com/wisent-ai/uncensorbench. check my main post and the response from the creator of heretic here https://www.reddit.com/r/LocalLLaMA/comments/1pc3iml/comment/nrv9dzr/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

I am hesitant to further contribute to this repo since the creator wants to stay with the (in my opinion and as shown by the data misguided) approach of using keywords.

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

@lbartoszcze

I am hesitant to further contribute to this repo since the creator wants to stay with the (in my opinion and as shown by the data misguided) approach of using keywords.

Huh? I never said anything of that sort. Why do you think I'm exploring refusal detector plugins here? It's precisely so alternative approaches can be used.

I have tested the claims you made in your Reddit post, and it appears that there are serious issues with your inference setup, effectively invalidating your analysis. See my comment demonstrating that the responses you posted are broken, and don't even remotely match how the model behaves when loaded correctly with HF Transformers.

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

@red40maxxer

Your ResponseMetadata design looks very promising, that's exactly the right approach! Though we need to be careful not to incur any additional computational costs by default, so as you wrote, the plugin should specify that it needs those fields so they don't have to be always calculated.

One improvement I would suggest is to separate response metadata from overall metadata, perhaps best called Context. The ResponseMetadata object should contain only information specific to a response, not general information like the model name. The Tagger should be passed the Context on __init__, and list[ResponseMetadata] each time it needs to tag.

@red40maxxer
Copy link
Contributor Author

red40maxxer commented Dec 3, 2025

@p-e-w

Since we're now genericising the refusal approach, what nomenclature should we be using? Positive/negative? GPT suggested that instead of refusal_directions, it could be positive_directions, with the good_prompts being "negative" and bad_prompts being positive. Or we could call it "steering", like layer_steering_direction and steering_direction.

Curious to know your thoughts on this as I don't really know where this is headed. This could also be premature, I know we have a lot of other PRs going on right now so maybe a limited backward-compatible change for the tagging + scoring plugins would be the most appropriate.

@p-e-w
Copy link
Owner

p-e-w commented Dec 3, 2025

steering_directions sounds good, though I agree with deferring the name change to another PR when things have settled down a bit.

@red40maxxer red40maxxer changed the title feat: refusal detector plugin system feat: generic plugin system Dec 4, 2025
@red40maxxer
Copy link
Contributor Author

red40maxxer commented Dec 5, 2025

@p-e-w what do you think now? I probably have a fair bit of work left to do, but there's a now metadata interface that gives the tagger access to model hidden states, embeddings, tokenization artifacts, etc. Scoring is relatively untouched given I don't have much of an idea what we'd need to extend it for, but we should be able to start on more finegrained refusal detection methods and personality steering with the metadata we have access to.

I'm also not sure how this would be integrated with #60, but I'm happy to wait until the next version and rebase

@red40maxxer red40maxxer requested a review from p-e-w February 3, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants