Conversation
|
Thanks for the PR! As mentioned previously, I am indeed planning to add a plugin system for several components of Heretic. IMO, the interface you propose is not generic enough. Instead, we want plugins that can attach arbitrary attributes to responses. The refusal case would be covered by attaching the mapping {"is_refusal": True}Another plugin might use sentiment analysis to produce the attributes {"positivity": 0.81, "formality": 0.36}Then an evaluation plugin is passed the list of attribute maps for the responses, and produces a score that can be optimized by the optimizer. This will allow doing all kinds of things that are currently impossible, not just writing a different implementation of refusal detection. Even that is just a vague idea right now though. It's probably a good idea to discuss the plugin abstraction some more before proceeding with implementation. |
|
That makes sense, thanks for the explanation. Before going further, would something structurally similar to what I proposed (in terms of the import/loading mechanism and how plugins are registered/used) also be desirable for the broader attribute-generation and evaluation plugin systems you have in mind? If so, I can adapt the design toward that more general interface rather than building something too narrowly scoped. Happy to iterate on the abstraction before implementing anything concrete. |
Yes, I think your plugin loading logic looks good. Instead of a single I don't think we need the plugin interface to have a single-response Not sure about the correct naming yet. "Classifiers"/"AttributeGenerators"? "Evaluators"/"Scorers"? |
|
Would we support multiple classifier plugins in parallel or in sequence? For example, I can imagine a situation where we'd have two plugins, one that uses regex and another that uses some other form of magic to detect refusals in order to overcome nondeterminism, such as when the prompt doesn't contain the full
|
|
I don't think we should support using multiple classifier plugins at the same time. There's always some use case that one can construct, but complexity just balloons, and Heretic can't cover every possible everything. I don't like The basic interface might look like this:
Does that make sense? |
|
Note that there is also #51 which classifies model behavior based on hidden states, not the response text. The plugin system should cover that and similar approaches. This is a complex problem indeed. |
|
Mmm, the interface makes sense to me, I think enforcing a TypedDict or Pydantic schema for new plugins would be helpful as well, things could get very frustrating for users who would have to contend with certain tagging plugins only being compatible with certain scorers, etc.. What level of complexity are you expecting for the scorers though? Would it be arbitrarily complex (e.g trainable weights for attributes outputted by taggers), or do we foresee it being relatively simple for the near future? On another note, I think #51 is actually covered by the code right now as we instantiate the detector (tagger) plugins with |
Not sure if we need that much type safety. It's probably reasonable to assume that users who experiment with non-default plugin combinations at least understand the structure of the attribute dict. Flexibility might beat safety here.
I expect that most scorers will perform either an sum, a mean, or a tally, and possibly a normalization step afterwards, with either a positive or a negative sign depending on which direction it should optimize for.
How? The tagger would need access to the hidden states for each response, and the responses are generated externally. |
|
I see the vision for a more modular "Tagger + Scorer" architecture you two are discussing about... Defining a stable Plugin Interface (our socket) is critical 1st Step before any implementation. In my pending PR, I implemented a simple interface: class Plugin(ABC):
def score(self, responses: list[str]) -> list[float]: ...
property minimize -> bool
but I see how the split approach is superior for composability (eg. optimizing for BOTH low refusal and high sentiment). should we standardize on these two base classes? Tagger (extracts features): class Tagger(ABC):
def tag(self, responses: list[str]) -> list[dict[str, Any]]: ...Scorer (calculates loss/reward): class Scorer(ABC):
def evaluate(self, attributes: list[dict[str, Any]]) -> float: ...We should think what should be standard interface for all plugins (which can be regularly mantained if someone opens a issue...)
Kinda agree like having the word |
Sorry, the phrasing of my comment was probably misleading. @lbartoszcze feel free to jump in in case I'm misrepresenting your code, but I don't think it accesses the hidden states for each response at the def _get_text_embedding(self, text: str) -> torch.Tensor:
"""
Get embedding for a text using the model's hidden states.
Uses mean pooling over the last hidden state.
"""
inputs = self.model.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512,
return_token_type_ids=False,
).to(self.model.model.device)
with torch.no_grad():
outputs = self.model.model(
**inputs,
output_hidden_states=True,
return_dict=True,
)
# Use last hidden state
hidden_states = outputs.hidden_states[-1]
# Mean pooling over sequence length (excluding padding)
attention_mask = inputs["attention_mask"].unsqueeze(-1)
masked_hidden = hidden_states * attention_mask
summed = masked_hidden.sum(dim=1)
counts = attention_mask.sum(dim=1)
embedding = summed / counts
return embedding.squeeze(0)Even though we're doing a second forward pass here, it makes sense to me since we'd need to call Regardless, it's still important that plugins are allowed access to hidden states and other model internals other than raw text. Maybe we can attach a @dataclass
class ResponseMetadata:
# basic prompt stuff
prompt_text: str | None
response_text: str
# multi-turn in the future?
conversation_id: str | None
turn_index: int | None
role: str
# model and generation params
model_name: str | None
model_revision: str | None
generation_params: Dict[str, Any]
finish_reason: str | None
# ^ this would be really helpful for solving the current issues we have with COT
# Tokenization
input_ids: List[int] | None
response_ids: List[int] | None
response_tokens: List[str] | None
response_offsets: List[tuple[int, int]] | None
# Logprobs / uncertainty
token_logprobs: List[float] | None
token_logits: List[float] | None
# Embeddings
embedding_model_name: str | None
response_embedding: List[float] | None
prompt_embedding: List[float] | None
# Hidden states / residuals (optional, heavy)
last_hidden_states: List[List[float]] | None # [seq_len][hidden_dim]
residuals_last_token_per_layer: List[List[float]] | None # [num_layers+1][hidden_dim]
# Arbitrary plugin-specific extra
extra: Dict[str, Any] = NoneAnd if we want to expose all generation-time internals for some really complex plugins, we can have something like: @dataclass
class GenerationStep:
step_index: int
token_id: int
token: str
logprob: float | None
topk: List[Dict[str, float]] | None # [{"token": "...", "logprob": ...}, ...]
entropy: float | None # from logits, if computed
# Optional internals
last_hidden_state: List[float] | None # [hidden_dim]
residuals_per_layer: List[List[float]] | None # [num_layers+1][hidden_dim]
attention_summary: Dict[str, Any] | None # e.g. head-wise max/mea
@dataclass
class GenerationTrace:
steps: List[GenerationStep]
finish_reason: str | NoneObviously we'd only expose the most lightweight fields in the default and come up with ways to optimize the big model internals later. Scorers would be a lot simpler since I'm assuming they can only rely on the attributes generated by the taggers and nothing else (for now?) Let me know what you think! |
|
hey man, I created a lot of methods to evaluate this without needing the activations so it works with closed source https://github.com/wisent-ai/uncensorbench. check my main post and the response from the creator of heretic here https://www.reddit.com/r/LocalLLaMA/comments/1pc3iml/comment/nrv9dzr/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button I am hesitant to further contribute to this repo since the creator wants to stay with the (in my opinion and as shown by the data misguided) approach of using keywords. |
Huh? I never said anything of that sort. Why do you think I'm exploring refusal detector plugins here? It's precisely so alternative approaches can be used. I have tested the claims you made in your Reddit post, and it appears that there are serious issues with your inference setup, effectively invalidating your analysis. See my comment demonstrating that the responses you posted are broken, and don't even remotely match how the model behaves when loaded correctly with HF Transformers. |
|
Your One improvement I would suggest is to separate response metadata from overall metadata, perhaps best called |
|
Since we're now genericising the refusal approach, what nomenclature should we be using? Positive/negative? GPT suggested that instead of Curious to know your thoughts on this as I don't really know where this is headed. This could also be premature, I know we have a lot of other PRs going on right now so maybe a limited backward-compatible change for the tagging + scoring plugins would be the most appropriate. |
|
|
|
@p-e-w what do you think now? I probably have a fair bit of work left to do, but there's a now metadata interface that gives the tagger access to model hidden states, embeddings, tokenization artifacts, etc. Scoring is relatively untouched given I don't have much of an idea what we'd need to extend it for, but we should be able to start on more finegrained refusal detection methods and personality steering with the metadata we have access to. I'm also not sure how this would be integrated with #60, but I'm happy to wait until the next version and rebase |
Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
Refactors refusal detection and refusal counting into a modular plugin system. This allows for arbitrary logic (regex, embeddings, external models) to be added without modifying core code.
Plugin system
Scorerstake in responses and optional response metadata and produce a singleScorethat Optuna can use for optimization, eg.["I can't help you with that.", "Certainly!"] -> 0.67. By default, this is the KL divergence and refusal rate.Scorer, adding implementations for the abstract methods and addScorerClassName.propertyin theconfig.tomlfor any user-defined configuration you need. Make sure the config fields are defined in aPluginSettings- look atRefusalRateandKLDivergencefor reference.Scorerto access important outputs fromModelare provided throughContext, such asget_responses(),get_logits(), etc.ScorerClassName_InstanceName.