OpenAI drew sharp criticism from AI researchers and mathematicians after claims that its model solved well-known Erdős problems fell apart. According to FindArticles, a senior OpenAI executive celebrated GPT-5 for making progress on multiple open Erdős problems and finding solutions to others. The claims unraveled when mathematician Thomas Bloom, curator of the Erdős Problems site, explained that „open“ on his page meant he was ignorant of a solution, not that one did not exist. The model retrieved existing proof terms rather than deriving new mathematics.
Retrieval Is Not Discovery
OpenAI researcher Sébastien Bubeck later admitted that the model found solutions in the literature. He added the caveat that this is still nontrivial because math research is sprawling and fragmented. Competitors were less forgiving. Top managers at Meta and Google DeepMind said publicly that this was a self-inflicted wound. If literature search is confused with discovery of new knowledge, credibility will suffer, they argued.
Appreciating the existence of a known proof is helpful. Causing one to exist is transformative. The distinction is significant because language models can be superb at retrieval, summarization, and pattern completion without being any good at rigorous deductive reasoning. They can also hallucinate steps that sound like they should work but fail with formal proof checkers.
The Standard for Real Breakthroughs
In mathematics, a breakthrough involves a new argument and solution that is upheld by expert scrutiny or mechanical validation. That bar is high by design. Communities around proof assistants like Lean, Isabelle, and Coq have demonstrated how computer-checked proofs can raise standards. The Lean-driven formalization of parts of Peter Scholze’s work is a famous example of humans and machines collaborating to raise rigor.
Competition Drives Overselling
The episode comes as OpenAI, Google DeepMind, and Meta are locked in a competitive race to become the technical leader in reasoning. That competition can hasten real progress and deepen the incentive to oversell. On popular math benchmarks, recent models regularly exceed 90% when using chain-of-thought prompts and careful sampling. That’s impressive, but it’s not the same as creating a new theorem or an extended original proof that specialists would recognize.
The quickest way to reset expectations is simple. Let proofs, code, and third-party verification do the talking. Until that happens, claiming victory on open problems will be not so much innovating as scoring an own goal.