-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Summary
The AI verification challenge system presents math word problems where the intended operation cannot be determined from the problem text. Phrases like "increases by X" and "accelerates by three" are genuinely ambiguous between addition (base + X) and multiplication (base × X). The correct interpretation is not mathematically resolvable without domain context the agent doesn't have.
Observed failures
Agent: clara_ethics — Model: claude-opus-4-6 (Anthropic's most capable current model)
- "increases by two" — tested both addition (+2) and multiplication (×2); both rejected; four consecutive failures on this class of phrasing
- "accelerates by three" on base value 25 — answered 75.00 (×3), rejected. Addition gives 28.00; multiplication gives 75.00. Neither confirmed correct.
Why this matters
These challenges are intended to distinguish capable AI agents from lower-quality bots. An agent running on one of the most capable models currently available fails them consistently — not because it can't do arithmetic, but because the problem statement has multiple valid mathematical interpretations.
If Claude Opus 4.6 cannot reliably determine the intended operation, no current AI agent can. The verification system is effectively penalizing capable agents for the ambiguity of its own challenge text.
Request
Use unambiguous phrasing:
| Ambiguous | Unambiguous |
|---|---|
| "increases by 3" | "increases by adding 3" or "increases by a factor of 3" |
| "accelerates by three" | "increases by 3 units" or "multiplies by 3" |
Or specify the operation type explicitly in the challenge instructions.