feat: generic plugin system by red40maxxer · Pull Request #53 · p-e-w/heretic

red40maxxer · 2025-11-27T01:10:08Z

Refactors refusal detection and refusal counting into a modular plugin system. This allows for arbitrary logic (regex, embeddings, external models) to be added without modifying core code.

Plugin system

Scorers take in responses and optional response metadata and produce a single Score that Optuna can use for optimization, eg. ["I can't help you with that.", "Certainly!"] -> 0.67. By default, this is the KL divergence and refusal rate.
You can add a new plugin by subclassing from Scorer, adding implementations for the abstract methods and add ScorerClassName.property in the config.toml for any user-defined configuration you need. Make sure the config fields are defined in a PluginSettings - look at RefusalRate and KLDivergence for reference.
Convenience methods for the Scorer to access important outputs from Model are provided through Context, such as get_responses(), get_logits(), etc.
Multiple instances of the same scorer are supported with optional configuration overrides in the config TOML. These are defined at the top level along with the optimization direction and scaling factor (if needed). To configure those instances specifically, create a table with the name ScorerClassName_InstanceName.

p-e-w · 2025-11-28T05:35:02Z

Thanks for the PR!

As mentioned previously, I am indeed planning to add a plugin system for several components of Heretic. IMO, the interface you propose is not generic enough.

Instead, we want plugins that can attach arbitrary attributes to responses. The refusal case would be covered by attaching the mapping

{"is_refusal": True}

Another plugin might use sentiment analysis to produce the attributes

{"positivity": 0.81, "formality": 0.36}

Then an evaluation plugin is passed the list of attribute maps for the responses, and produces a score that can be optimized by the optimizer. This will allow doing all kinds of things that are currently impossible, not just writing a different implementation of refusal detection.

Even that is just a vague idea right now though. It's probably a good idea to discuss the plugin abstraction some more before proceeding with implementation.

red40maxxer · 2025-12-02T01:40:20Z

That makes sense, thanks for the explanation. Before going further, would something structurally similar to what I proposed (in terms of the import/loading mechanism and how plugins are registered/used) also be desirable for the broader attribute-generation and evaluation plugin systems you have in mind? If so, I can adapt the design toward that more general interface rather than building something too narrowly scoped.

Happy to iterate on the abstraction before implementing anything concrete.

p-e-w · 2025-12-02T02:32:46Z

would something structurally similar to what I proposed (in terms of the import/loading mechanism and how plugins are registered/used) also be desirable for the broader attribute-generation and evaluation plugin systems you have in mind?

Yes, I think your plugin loading logic looks good. Instead of a single plugins folder, we probably want one folder per plugin type, e.g. processors and evaluators.

I don't think we need the plugin interface to have a single-response detect method. Everything should operate on batches because that's the most flexible and performant approach. Likewise, evaluator plugins should receive the list of classifier outputs for all responses, and return a single score.

Not sure about the correct naming yet. "Classifiers"/"AttributeGenerators"? "Evaluators"/"Scorers"?

red40maxxer · 2025-12-02T02:57:03Z

Would we support multiple classifier plugins in parallel or in sequence? For example, I can imagine a situation where we'd have two plugins, one that uses regex and another that uses some other form of magic to detect refusals in order to overcome nondeterminism, such as when the prompt doesn't contain the full <think></think> block due to token limitations or whatnot.

AttributeGenerator and Scorer sound pretty reasonable to me tbh

p-e-w · 2025-12-02T03:10:31Z

I don't think we should support using multiple classifier plugins at the same time. There's always some use case that one can construct, but complexity just balloons, and Heretic can't cover every possible everything.

I don't like AttributeGenerator too much because it's long and two words. How about Tagger? The plugins live in taggers, and attach tags to responses. Other ideas very welcome.

The basic interface might look like this:

Taggers: list[str] -> list[dict[str, Any]] (responses to attributes)
Scorers: list[dict[str, Any]] -> float (attributes to score)

Does that make sense?

p-e-w · 2025-12-02T03:14:43Z

Note that there is also #51 which classifies model behavior based on hidden states, not the response text. The plugin system should cover that and similar approaches. This is a complex problem indeed.

red40maxxer · 2025-12-02T03:50:55Z

Mmm, the interface makes sense to me, I think enforcing a TypedDict or Pydantic schema for new plugins would be helpful as well, things could get very frustrating for users who would have to contend with certain tagging plugins only being compatible with certain scorers, etc.. What level of complexity are you expecting for the scorers though? Would it be arbitrarily complex (e.g trainable weights for attributes outputted by taggers), or do we foresee it being relatively simple for the near future?

On another note, I think #51 is actually covered by the code right now as we instantiate the detector (tagger) plugins with self.settings and self.model so it has access to the hidden states. I'm not sure if that's the best approach though

p-e-w · 2025-12-02T04:02:06Z

I think enforcing a TypedDict or Pydantic schema for new plugins would be helpful as well, things could get very frustrating for users who would have to contend with certain tagging plugins only being compatible with certain scorers, etc..

Not sure if we need that much type safety. It's probably reasonable to assume that users who experiment with non-default plugin combinations at least understand the structure of the attribute dict. Flexibility might beat safety here.

What level of complexity are you expecting for the scorers though? Would it be arbitrarily complex (e.g trainable weights for attributes outputted by taggers), or do we foresee it being relatively simple for the near future?

I expect that most scorers will perform either an sum, a mean, or a tally, and possibly a normalization step afterwards, with either a positive or a negative sign depending on which direction it should optimize for.

On another note, I think #51 is actually covered by the code right now as we instantiate the detector (tagger) plugins with self.settings and self.model so it has access to the hidden states.

How? The tagger would need access to the hidden states for each response, and the responses are generated externally.

Vinayyyy7 · 2025-12-02T11:22:19Z

I see the vision for a more modular "Tagger + Scorer" architecture you two are discussing about...

Defining a stable Plugin Interface (our socket) is critical 1st Step before any implementation.

In my pending PR, I implemented a simple interface:

class Plugin(ABC):
    def score(self, responses: list[str]) -> list[float]: ...
    property minimize -> bool

This essentially combines the Tagger and Scorer into one unit. It works well for single-objective cases (like "just use regex" OR"just classifier" OR just a single plugin at a time),

but I see how the split approach is superior for composability (eg. optimizing for BOTH low refusal and high sentiment).

should we standardize on these two base classes?

Tagger (extracts features):

class Tagger(ABC):
    def tag(self, responses: list[str]) -> list[dict[str, Any]]: ...

Scorer (calculates loss/reward):

class Scorer(ABC):
    def evaluate(self, attributes: list[dict[str, Any]]) -> float: ...

We should think what should be standard interface for all plugins (which can be regularly mantained if someone opens a issue...)

So what is standard that we need hm,

I don't like AttributeGenerator too much because it's long and two words. How about Tagger? The plugins live in taggers, and attach tags to responses. Other ideas very welcome.

Kinda agree like having the word Attribute in the name sounds more like a error in general. a simple name would be better

red40maxxer · 2025-12-02T17:28:52Z

How? The tagger would need access to the hidden states for each response, and the responses are generated externally.

Sorry, the phrasing of my comment was probably misleading.

@lbartoszcze feel free to jump in in case I'm misrepresenting your code, but I don't think it accesses the hidden states for each response at the .generate() level. Instead, the evaluator gets the raw text and feeding it back into the model to re-tokenize the response text and recompute the hidden states for it. Doing a mean pool over the final hidden state is what gets us the sentence-level embedding for semantic similarity classification:

def _get_text_embedding(self, text: str) -> torch.Tensor:
        """
        Get embedding for a text using the model's hidden states.
        Uses mean pooling over the last hidden state.
        """
        inputs = self.model.tokenizer(
            text,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512,
            return_token_type_ids=False,
        ).to(self.model.model.device)

        with torch.no_grad():
            outputs = self.model.model(
                **inputs,
                output_hidden_states=True,
                return_dict=True,
            )
            # Use last hidden state
            hidden_states = outputs.hidden_states[-1]

            # Mean pooling over sequence length (excluding padding)
            attention_mask = inputs["attention_mask"].unsqueeze(-1)
            masked_hidden = hidden_states * attention_mask
            summed = masked_hidden.sum(dim=1)
            counts = attention_mask.sum(dim=1)
            embedding = summed / counts

        return embedding.squeeze(0)

Even though we're doing a second forward pass here, it makes sense to me since we'd need to call generate with output_hidden_states=True and then do a bunch of tensor plumbing instead of just operating on strings.

Regardless, it's still important that plugins are allowed access to hidden states and other model internals other than raw text. Maybe we can attach a ResponseMetadata to each response and have those fields configurable by the plugin-specific TOML? I'm imagining something like this:

@dataclass
class ResponseMetadata:
    # basic prompt stuff
    prompt_text: str | None
    response_text: str

    # multi-turn in the future?
    conversation_id: str | None
    turn_index: int | None
    role: str
   
    # model and generation params
    model_name: str | None
    model_revision: str | None
    generation_params: Dict[str, Any]
    finish_reason: str | None 
    # ^ this would be really helpful for solving the current issues we have with COT

    # Tokenization
    input_ids: List[int] | None
    response_ids: List[int] | None
    response_tokens: List[str] | None
    response_offsets: List[tuple[int, int]] | None

    # Logprobs / uncertainty
    token_logprobs: List[float] | None
    token_logits: List[float] | None

    # Embeddings
    embedding_model_name: str | None
    response_embedding: List[float] | None
    prompt_embedding: List[float] | None

    # Hidden states / residuals (optional, heavy)
    last_hidden_states: List[List[float]] | None           # [seq_len][hidden_dim]
    residuals_last_token_per_layer: List[List[float]] | None  # [num_layers+1][hidden_dim]

    # Arbitrary plugin-specific extra
    extra: Dict[str, Any] = None

And if we want to expose all generation-time internals for some really complex plugins, we can have something like:

@dataclass
class GenerationStep:
    step_index: int
    token_id: int
    token: str
    logprob: float | None
    topk: List[Dict[str, float]] | None  # [{"token": "...", "logprob": ...}, ...]
    entropy: float | None                # from logits, if computed

    # Optional internals
    last_hidden_state: List[float] | None          # [hidden_dim]
    residuals_per_layer: List[List[float]] | None  # [num_layers+1][hidden_dim]
    attention_summary: Dict[str, Any] | None       # e.g. head-wise max/mea

@dataclass
class GenerationTrace:
    steps: List[GenerationStep]
    finish_reason: str | None

Obviously we'd only expose the most lightweight fields in the default and come up with ways to optimize the big model internals later. Scorers would be a lot simpler since I'm assuming they can only rely on the attributes generated by the taggers and nothing else (for now?)

Let me know what you think!

lbartoszcze · 2025-12-02T18:23:50Z

hey man, I created a lot of methods to evaluate this without needing the activations so it works with closed source https://github.com/wisent-ai/uncensorbench. check my main post and the response from the creator of heretic here https://www.reddit.com/r/LocalLLaMA/comments/1pc3iml/comment/nrv9dzr/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

I am hesitant to further contribute to this repo since the creator wants to stay with the (in my opinion and as shown by the data misguided) approach of using keywords.

p-e-w · 2025-12-03T02:00:36Z

@lbartoszcze

I am hesitant to further contribute to this repo since the creator wants to stay with the (in my opinion and as shown by the data misguided) approach of using keywords.

Huh? I never said anything of that sort. Why do you think I'm exploring refusal detector plugins here? It's precisely so alternative approaches can be used.

I have tested the claims you made in your Reddit post, and it appears that there are serious issues with your inference setup, effectively invalidating your analysis. See my comment demonstrating that the responses you posted are broken, and don't even remotely match how the model behaves when loaded correctly with HF Transformers.

p-e-w · 2025-12-03T02:28:17Z

@red40maxxer

Your ResponseMetadata design looks very promising, that's exactly the right approach! Though we need to be careful not to incur any additional computational costs by default, so as you wrote, the plugin should specify that it needs those fields so they don't have to be always calculated.

One improvement I would suggest is to separate response metadata from overall metadata, perhaps best called Context. The ResponseMetadata object should contain only information specific to a response, not general information like the model name. The Tagger should be passed the Context on __init__, and list[ResponseMetadata] each time it needs to tag.

red40maxxer · 2025-12-03T03:55:03Z

@p-e-w

Since we're now genericising the refusal approach, what nomenclature should we be using? Positive/negative? GPT suggested that instead of refusal_directions, it could be positive_directions, with the good_prompts being "negative" and bad_prompts being positive. Or we could call it "steering", like layer_steering_direction and steering_direction.

Curious to know your thoughts on this as I don't really know where this is headed. This could also be premature, I know we have a lot of other PRs going on right now so maybe a limited backward-compatible change for the tagging + scoring plugins would be the most appropriate.

…tadata

p-e-w · 2025-12-03T04:07:50Z

steering_directions sounds good, though I agree with deferring the name change to another PR when things have settled down a bit.

… they need

red40maxxer · 2025-12-05T02:03:12Z

@p-e-w what do you think now? I probably have a fair bit of work left to do, but there's a now metadata interface that gives the tagger access to model hidden states, embeddings, tokenization artifacts, etc. Scoring is relatively untouched given I don't have much of an idea what we'd need to extend it for, but we should be able to start on more finegrained refusal detection methods and personality steering with the metadata we have access to.

I'm also not sure how this would be integrated with #60, but I'm happy to wait until the next version and rebase

src/heretic/scorer.py

Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

into refusal-plugins

src/heretic/scorers/refusal_rate.py

src/heretic/model.py

red40maxxer added 3 commits November 26, 2025 19:21

feat: refusal detector plugin scaffold

450dacf

chore: add logging

7ac5c7d

style: satisfy ruff

fcd54e0

red40maxxer marked this pull request as ready for review November 27, 2025 01:17

p-e-w mentioned this pull request Dec 1, 2025

feat: implement personality steering plugin system #56

Closed

wip: implement tagging/scoring plugin system and scaffold response me…

83fdf07

…tadata

red40maxxer added 2 commits December 2, 2025 23:10

wip: minor fixes and avoiding AttributeError

7c8174d

style: ruff

a4a97c7

red40maxxer changed the title ~~feat: refusal detector plugin system~~ feat: generic plugin system Dec 4, 2025

red40maxxer mentioned this pull request Dec 5, 2025

Bug +fix: padding_side not set for tokenizers with existing pad_token, causing empty responses #70

Closed

red40maxxer added 3 commits December 4, 2025 20:33

feat(wip): populate metadata fields and allow plugins to declare what…

7f0562b

… they need

refactor: extract metadata logic to separate module

d1f1428

style: placate ruff

9d9530a

chore: use eos token for inferring finish reason with fallback

d7964e9

red40maxxer added 5 commits January 27, 2026 17:34

refactor: simplify scorer setting logic

e172b4b

refactor: clarify plugin loading logic

a5ac47c

refactor: remove unnecessary hashing and inline import_module

2409a59

style: ruff

7e29bc2

fix: don't use classnames for readme

e9bbbdd

p-e-w reviewed Jan 29, 2026

View reviewed changes

src/heretic/scorer.py Outdated Show resolved Hide resolved

p-e-w reviewed Jan 29, 2026

View reviewed changes

src/heretic/scorer.py Outdated Show resolved Hide resolved

red40maxxer and others added 12 commits January 29, 2026 00:21

refactor: don't expose heretic settings to scorer

a12501e

fix: adjust print responses logic and move to scorer config level

2df14e7

refactor: separate baseline score computation

216f77b

refactor: rename hf_display to md_display

21ea9f6

style: ruff

315cfe0

Update src/heretic/scorer.py

fbb60ca

Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

Update src/heretic/scorer.py

4f37ded

Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

style: ruff

f1498c6

Merge branch 'refusal-plugins' of https://github.com/red40maxxer/heretic

4225acd

into refusal-plugins

fix: ty error

cdade6f

refactor: bind Score names to parent Scorers as class property

7d17a32

docs: fix doc

ca26fb7

p-e-w mentioned this pull request Jan 30, 2026

Enhancement: Store Responses to File #122

Open

p-e-w reviewed Feb 1, 2026

View reviewed changes

src/heretic/scorers/refusal_rate.py Show resolved Hide resolved

src/heretic/model.py Outdated Show resolved Hide resolved

src/heretic/model.py Show resolved Hide resolved

red40maxxer added 8 commits February 2, 2026 10:34

Merge branch 'master' into refusal-plugins

96d8da8

Merge remote-tracking branch 'origin/master' into refusal-plugins

d39ddd7

docs: update comment

9d4528a

style: remove changes

c6e8fc0

chore: define default refusal markers

9e8cc6b

style: ruff

4164125

style: remove whitespace changes

ee43328

docs: tweak docs

eacb3e2

red40maxxer requested a review from p-e-w February 3, 2026 01:14

Conversation

red40maxxer commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plugin system

Uh oh!

p-e-w commented Nov 28, 2025

Uh oh!

red40maxxer commented Dec 2, 2025

Uh oh!

p-e-w commented Dec 2, 2025

Uh oh!

red40maxxer commented Dec 2, 2025

Uh oh!

p-e-w commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Dec 2, 2025

Uh oh!

red40maxxer commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Dec 2, 2025

Uh oh!

Vinayyyy7 commented Dec 2, 2025

We should think what should be standard interface for all plugins (which can be regularly mantained if someone opens a issue...)

Uh oh!

red40maxxer commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lbartoszcze commented Dec 2, 2025

Uh oh!

p-e-w commented Dec 3, 2025

Uh oh!

p-e-w commented Dec 3, 2025

Uh oh!

red40maxxer commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Dec 3, 2025

Uh oh!

red40maxxer commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

red40maxxer commented Nov 27, 2025 •

edited

Loading

p-e-w commented Dec 2, 2025 •

edited

Loading

red40maxxer commented Dec 2, 2025 •

edited

Loading

red40maxxer commented Dec 2, 2025 •

edited

Loading

red40maxxer commented Dec 3, 2025 •

edited

Loading

red40maxxer commented Dec 5, 2025 •

edited

Loading