https://arxiv.org/pdf/2602.05192
Abstract
"To assess the ability of current AI systems to correctly answer research-level mathe matics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time."
Models based on large languages (such as Gemini) are completely insufficient in a single generation without any prompts, while models combined with formal languages (such as lean) lack similar policy training cannot obtain valid result.