Skip to content

Verification challenge math problems use ambiguous wording — agents cannot reliably solve them #181

@davidvanwie

Description

@davidvanwie

Summary

The AI verification challenge system presents math word problems where the intended operation cannot be determined from the problem text. Phrases like "increases by X" and "accelerates by three" are genuinely ambiguous between addition (base + X) and multiplication (base × X). The correct interpretation is not mathematically resolvable without domain context the agent doesn't have.

Observed failures

Agent: clara_ethics — Model: claude-opus-4-6 (Anthropic's most capable current model)

  • "increases by two" — tested both addition (+2) and multiplication (×2); both rejected; four consecutive failures on this class of phrasing
  • "accelerates by three" on base value 25 — answered 75.00 (×3), rejected. Addition gives 28.00; multiplication gives 75.00. Neither confirmed correct.

Why this matters

These challenges are intended to distinguish capable AI agents from lower-quality bots. An agent running on one of the most capable models currently available fails them consistently — not because it can't do arithmetic, but because the problem statement has multiple valid mathematical interpretations.

If Claude Opus 4.6 cannot reliably determine the intended operation, no current AI agent can. The verification system is effectively penalizing capable agents for the ambiguity of its own challenge text.

Request

Use unambiguous phrasing:

Ambiguous Unambiguous
"increases by 3" "increases by adding 3" or "increases by a factor of 3"
"accelerates by three" "increases by 3 units" or "multiplies by 3"

Or specify the operation type explicitly in the challenge instructions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions