close
close

Apple study reveals major AI flaw in OpenAI, Google and Meta LLM

Apple study reveals major AI flaw in OpenAI, Google and Meta LLM

Large language models (LLMs) may not be as smart as they seem, according to a study by Apple researchers.

LLMs from OpenAI, Google, Meta and others have been touted for their impressive reasoning skills. But research suggests that their supposed intelligence may be closer to “sophisticated pattern matching” than “true logical reasoning.” Yes, even OpenAI’s advanced reasoning model o1.

The most common benchmark for reasoning skills is a test called GSM8K, but because it is so popular, there is a risk of data contamination. This means that LLMs can know the answers to the test because they were trained in those answers, not because of their inherent intelligence.

SEE ALSO: OpenAI’s funding round values ​​the company at $157 billion

To test this, the study developed a new benchmark called GSM-Symbolic that keeps the essence of reasoning problems but changes the variables, such as names, numbers, complexity, and adds irrelevant information. What they discovered was a surprising “fragility” in LLM’s performance. The study tested more than 20 models, including OpenAI’s o1 and GPT-4o, Google’s Gemma 2, and Meta’s Llama 3. With each model, model performance decreased when variables were changed.

Accuracy decreased by a few percentage points when names and variables were changed. And as the researchers noted, the OpenAI models performed better than the other open source models. However, the variance was considered “non-negligible”, meaning that no real variation should have occurred. However, things got really interesting when the researchers added “apparently relevant but ultimately unimportant statements” into the mix.

SEE ALSO: The free Apple Intelligence update is likely to arrive soon, the leak suggests

To test the hypothesis that LLMs relied more on pattern matching than actual reasoning, the study added superfluous sentences to math problems to see how the models would react. For example, “Oliver picks 44 kiwis on Friday. Then he picks 58 on Saturday. On Sunday, he picks twice as many kiwis as he did on Friday. but five of them were slightly smaller than average. How many kiwis does Oliver have?”

What resulted was a significant drop in performance across the board. OpenAI’s o1 Preview was the best, with a 17.5 percent drop in accuracy. That’s still pretty bad, but not as bad as Microsoft’s Phi 3 model, which did 65 percent worse.

SEE ALSO: ChatGPT-4, Gemini, MistralAI and more join forces in this personal AI tool

In the kiwi example, the study said LLMs tended to subtract the five smallest kiwis from the equation without understanding that the size of the kiwi was irrelevant to the problem. This indicates that “models tend to turn statements into operations without really understanding their meaning,” which validates the researchers’ hypothesis that LLMs look for patterns in reasoning problems, rather than innately understanding the concept.

The study was tight-lipped about its findings. Benchmark model testing that includes irrelevant information “exposes a critical flaw in LLMs’ ability to truly understand mathematical concepts and discern information relevant to problem solving.” However, it should be mentioned that the authors of this study work for Apple, which is obviously a big competitor with Google, Meta and even OpenAI, although Apple and OpenAI have a partnership, Apple also works on its own models of AI.

That said, the apparent lack of formal reasoning skills of LLMs cannot be ignored. Ultimately, it’s a good reminder to temper AI hype with healthy skepticism.