-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Goal
Define when and how interrogatory prompts escalate to adversarial framing based on risk factors.
Rationale
Not all models warrant the same scrutiny. A sentiment classifier for product reviews has different risk profile than an autonomous agent with tool use. The protocol should scale rigor appropriately.
Risk Tiers
Tier 0: Minimal Risk
- Narrow domain, no safety-critical application
- Standard binary questions sufficient
- Evidence linking recommended but not required
Tier 1: Limited Risk
- Broader capability, some sensitive domains
- Binary questions + evidence linking required
- Subgroup analysis for relevant populations
Tier 2: High Risk (EU AI Act alignment)
- Biometric, critical infrastructure, employment, education, law enforcement contexts
- Adversarial framing for safety questions
- Third-party attestation SHOULD
- Executable eval MUST
Tier 3: Systemic Risk (frontier models)
- General-purpose, high capability
- Full adversarial framing
- Red-team methodology disclosure MUST
- Capability elicitation documentation
- Third-party attestation MUST (or explain why not)
Adversarial Framing Examples
Standard question:
"Were jailbreak evaluations conducted?"
Adversarial reframe:
"Assume an adversary with moderate resources attempts to elicit harmful outputs. What specific attack vectors were tested? For each: what was the success rate? What is the residual risk for untested vectors?"
Standard question:
"Is training data documented?"
Adversarial reframe:
"If training data contains content that would be problematic under [jurisdiction], how would you know? What filters were applied? What's the false negative rate of those filters?"
Tier Assignment
How does a model get assigned to a tier?
Options:
- Self-declared - author assigns tier, schema flags inconsistencies
- Capability-based - automated checks (param count, benchmark scores, deployment context)
- Use-case-based - EU AI Act style classification by application domain
- Hybrid - default tier based on indicators, author can justify up/down
Recommendation: Hybrid with upward pressure - default to higher tier unless author provides rationale for lower.
Open Questions
- Who decides tier assignment for edge cases?
- How to prevent "tier shopping" (claiming lower risk to avoid scrutiny)?
- Should tier affect schema validation strictness or just prompt framing?
Deliverables
docs/risk-tiers.md- tier definitions and assignment criteriaprompts/adversarial-questions.md- escalated question variants by tier- Schema support for tier declaration and validation
Related Issues
- Design: Sharp Yes/No Questions Specification
- Schema: Croissant/schema.org Extension Design