Friday, August 8, 2025

AI Can Do a Lot—but Still Seems Totally Stumped by Sudoku

Artificial intelligence chatbots can whip up the code for a website in just a few seconds and summarize the important parts of a 90-minute meeting in moments. But how trustworthy is the technology? High-profile examples of AI hallucinating or gaslighting users have made some people understandably wary. But a group of researchers at the University of Colorado Boulder has come up with an interesting way to test the trustworthiness of the technology: by playing Sudoku. The researchers gave AI models 2,300 six-by-six Sudokus (which are more simple than the nine-by-nine games most humans play). Then it set the AI loose, asking five different models to solve them all—and then asking the models to explain their answers. The AI struggled a bit with the puzzles themselves. ChatGPT’s o1 model, for instance, only solved 65 percent of the puzzles correctly; that’s an older model that was state of the art two years ago (the company introduced o4-mini in April). Other AI systems did even worse. Nobody’s perfect, not even a machine, but things got really interesting when the researchers asked the AI platforms to explain how they chose their answers. “Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, a co-author of the study and associate professor of computer science at CU Boulder, in a statement. “So it might say, ‘There cannot be a two here because there’s already a two in the same row,’ but that wasn’t the case.” One of the AIs, when asked about Sudoku, answered the question by giving an unprompted weather forecast. “At that point, the AI had gone berserk and was completely confused,” said study co-author Fabio Somenzi, professor in the Department of Electrical, Computer, and Energy Engineering. The hallucinations and glitches, the authors note, “underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.” The o1 model from OpenAI was especially bad at explaining its actions, despite vastly outpacing the other AI models with the puzzles. (The others, the study says, were “not currently capable” of solving six-by-six Sudoku puzzles.) Researchers said its answers failed to justify moves, used the wrong basic terminology, and poorly articulated the path it had taken to solve the puzzle. On a broader scale, the public’s trust in AI has a long way to go. A study by KPMG found that just 41 percent of people are willing to trust AI, even when they’re eager to see its benefits. The World Economic Forum, meanwhile, says trust will shape outcomes in the AI-powered economy, while McKinsey, in March of this year, reported 78 percent of organizations use AI in at least one business function. The Sudoku study was less about whether artificial intelligence could solve the puzzle and more a logic exercise. The focus was to gain insight into how AI systems think. A better understanding of how AI thinks could ultimately improve people’s trust levels and ensure that the results the AI spits out, whether it’s computer code or something to do with your finances, are more reliable. “Puzzles are fun, but they’re also a microcosm for studying the decision-making process in machine learning,” said Somenzi. “If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote.” BY CHRIS MORRIS @MORRISATLARGE

No comments: