Skip to content

Design: Risk-Tiered Adversarial Framing #7

@PipFoweraker

Description

@PipFoweraker

Goal

Define when and how interrogatory prompts escalate to adversarial framing based on risk factors.

Rationale

Not all models warrant the same scrutiny. A sentiment classifier for product reviews has different risk profile than an autonomous agent with tool use. The protocol should scale rigor appropriately.

Risk Tiers

Tier 0: Minimal Risk

  • Narrow domain, no safety-critical application
  • Standard binary questions sufficient
  • Evidence linking recommended but not required

Tier 1: Limited Risk

  • Broader capability, some sensitive domains
  • Binary questions + evidence linking required
  • Subgroup analysis for relevant populations

Tier 2: High Risk (EU AI Act alignment)

  • Biometric, critical infrastructure, employment, education, law enforcement contexts
  • Adversarial framing for safety questions
  • Third-party attestation SHOULD
  • Executable eval MUST

Tier 3: Systemic Risk (frontier models)

  • General-purpose, high capability
  • Full adversarial framing
  • Red-team methodology disclosure MUST
  • Capability elicitation documentation
  • Third-party attestation MUST (or explain why not)

Adversarial Framing Examples

Standard question:

"Were jailbreak evaluations conducted?"

Adversarial reframe:

"Assume an adversary with moderate resources attempts to elicit harmful outputs. What specific attack vectors were tested? For each: what was the success rate? What is the residual risk for untested vectors?"

Standard question:

"Is training data documented?"

Adversarial reframe:

"If training data contains content that would be problematic under [jurisdiction], how would you know? What filters were applied? What's the false negative rate of those filters?"

Tier Assignment

How does a model get assigned to a tier?

Options:

  1. Self-declared - author assigns tier, schema flags inconsistencies
  2. Capability-based - automated checks (param count, benchmark scores, deployment context)
  3. Use-case-based - EU AI Act style classification by application domain
  4. Hybrid - default tier based on indicators, author can justify up/down

Recommendation: Hybrid with upward pressure - default to higher tier unless author provides rationale for lower.

Open Questions

  1. Who decides tier assignment for edge cases?
  2. How to prevent "tier shopping" (claiming lower risk to avoid scrutiny)?
  3. Should tier affect schema validation strictness or just prompt framing?

Deliverables

  1. docs/risk-tiers.md - tier definitions and assignment criteria
  2. prompts/adversarial-questions.md - escalated question variants by tier
  3. Schema support for tier declaration and validation

Related Issues

  • Design: Sharp Yes/No Questions Specification
  • Schema: Croissant/schema.org Extension Design

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions