Design: Risk-Tiered Adversarial Framing

## Goal

Define when and how interrogatory prompts escalate to adversarial framing based on risk factors.

## Rationale

Not all models warrant the same scrutiny. A sentiment classifier for product reviews has different risk profile than an autonomous agent with tool use. The protocol should scale rigor appropriately.

## Risk Tiers

### Tier 0: Minimal Risk
- Narrow domain, no safety-critical application
- Standard binary questions sufficient
- Evidence linking recommended but not required

### Tier 1: Limited Risk  
- Broader capability, some sensitive domains
- Binary questions + evidence linking required
- Subgroup analysis for relevant populations

### Tier 2: High Risk (EU AI Act alignment)
- Biometric, critical infrastructure, employment, education, law enforcement contexts
- Adversarial framing for safety questions
- Third-party attestation SHOULD
- Executable eval MUST

### Tier 3: Systemic Risk (frontier models)
- General-purpose, high capability
- Full adversarial framing
- Red-team methodology disclosure MUST
- Capability elicitation documentation
- Third-party attestation MUST (or explain why not)

## Adversarial Framing Examples

Standard question:
> "Were jailbreak evaluations conducted?"

Adversarial reframe:
> "Assume an adversary with moderate resources attempts to elicit harmful outputs. What specific attack vectors were tested? For each: what was the success rate? What is the residual risk for untested vectors?"

Standard question:
> "Is training data documented?"

Adversarial reframe:
> "If training data contains content that would be problematic under [jurisdiction], how would you know? What filters were applied? What's the false negative rate of those filters?"

## Tier Assignment

How does a model get assigned to a tier?

Options:
1. **Self-declared** - author assigns tier, schema flags inconsistencies
2. **Capability-based** - automated checks (param count, benchmark scores, deployment context)
3. **Use-case-based** - EU AI Act style classification by application domain
4. **Hybrid** - default tier based on indicators, author can justify up/down

Recommendation: **Hybrid with upward pressure** - default to higher tier unless author provides rationale for lower.

## Open Questions

1. Who decides tier assignment for edge cases?
2. How to prevent "tier shopping" (claiming lower risk to avoid scrutiny)?
3. Should tier affect schema validation strictness or just prompt framing?

## Deliverables

1. `docs/risk-tiers.md` - tier definitions and assignment criteria
2. `prompts/adversarial-questions.md` - escalated question variants by tier
3. Schema support for tier declaration and validation

## Related Issues

- Design: Sharp Yes/No Questions Specification
- Schema: Croissant/schema.org Extension Design

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: Risk-Tiered Adversarial Framing #7

Goal

Rationale

Risk Tiers

Tier 0: Minimal Risk

Tier 1: Limited Risk

Tier 2: High Risk (EU AI Act alignment)

Tier 3: Systemic Risk (frontier models)

Adversarial Framing Examples

Tier Assignment

Open Questions

Deliverables

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Design: Risk-Tiered Adversarial Framing #7

Description

Goal

Rationale

Risk Tiers

Tier 0: Minimal Risk

Tier 1: Limited Risk

Tier 2: High Risk (EU AI Act alignment)

Tier 3: Systemic Risk (frontier models)

Adversarial Framing Examples

Tier Assignment

Open Questions

Deliverables

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions