Language models are great with words, but budgets need numbers. Our benchmark revealed a gap that general-purpose AI still can't close.

In our latest research, one of the world’s most widely used large language models confidently reported that a user spent over $28,000 on bills last month. The actual total? Just over $3,000.
That’s a $25,000 error, delivered with total confidence. As people increasingly use LLMs like ChatGPT and Claude for everyday tasks, including financial advice, mistakes like this can have serious real-world consequences. We set out to measure how often (and how badly) standalone language models get money questions wrong—and how a purpose-built agent architecture can close the gap.
If your AI can’t add, you probably shouldn’t ask it to manage your budget.
LLMs like OpenAI’s GPT-4o and Google’s Gemini now field questions on everything from poetry to portfolio management. According to research we conducted in early 2025, one in four Americans already uses a general-purpose LLM for financial advice. Increasingly, apps such as ChatGPT and Claude allow users to upload files, such as PDFs or spreadsheets, and ask questions about the contents. But how well can they actually extract and interpret that data?
General-purpose LLMs excel at what they’re built for: generating plausible-sounding text based on patterns in their training data. When precision is nonnegotiable, however, LLMs alone fall short. They lack true mathematical reasoning and rely on learned associations, not grounded computation. But tasks like budgeting, cash flow analysis, and transaction classification demand exact numbers and domain-specific context, highlighting the need for agentic systems that can do more than generate fluent text.
To quantify the gap, we benchmarked our specialized AI financial assistant Cleo against three leading general-purpose LLMs: GPT-4o (OpenAI), Gemini Flash 2.5 (Google), and Claude Sonnet 4 (Anthropic).
Each model was tasked with answering representative budgeting questions using a dataset of 129 real transactions, representing $29,502 in inflows and $29,958 in outflows for the month. Our test queries reflected the types of questions real users ask when managing their money. These questions go beyond simple lookups, requiring multi-step aggregation, time-based grouping, and merchant-level analysis.

We asked each evaluated model to answer these benchmark questions.
These questions reflect real user needs, such as identifying spending patterns, comparing income to expenses, and understanding financial behavior across merchants and over time. We scored each model on accuracy, aiming to assess whether it could be trusted to answer realistic financial questions correctly.
The difference was clear. Across all six benchmark questions and 21 total data points, Cleo’s agentic system answered 17 correctly (81%), compared to 13 for GPT-4o, 7 for Claude Sonnet 4, and 6 for Gemini Flash 2.5.

These gaps weren’t limited to edge cases or ambiguous prompts. General-purpose LLMs returned wildly inaccurate answers even on fundamental questions, like “What was my total bill spend last month?” In one case, Gemini overestimated spending on bills by nearly $26,000, reporting $28,813 instead of the correct $3,047.

Only Cleo's agent-based system returned accurate results for both bill spend and percentage of income.
The gap comes down to architecture. When used alone, general-purpose LLMs try to answer financial queries using only internal representations, or statistical associations learned during training. In contrast, Cleo operates as an agentic system that routes all calculations through deterministic tools built for financial data. The LLM’s role is strictly interpretive—parsing the query, formatting the output—not doing the math. This separation of responsibilities eliminates a major failure mode: the mathematical hallucinations that plague standalone, general-purpose LLMs.
While Cleo outperformed general-purpose models across the board, some challenges remain, especially in nuanced classification tasks. One difficult case for every system we tested was categorizing one-off transfers, such as reimbursements between friends. Accurately categorizing such transactions requires context-sensitive labeling and judgment, and it remains an open challenge for both Cleo and general-purpose LLMs. While our architecture is well designed for minimizing mathematical hallucinations, improving transaction classification is an active area of development as we scale Cleo’s capabilities.
The difference isn’t just in the answers, but in how the system gets there.
Financial coaching inherently demands precision. Digital financial products promise accuracy and actionable guidance, and AI tools need to meet the same bar. Because general-purpose LLMs are mainly optimized for fluent language, they introduce error classes that would be unacceptable in any production-grade financial system, including hallucinated figures, faulty aggregations, and misleading correlations.
Cleo is built differently. Our agentic architecture delegates calculations and categorization to well-tested, domain-specific tools, while the LLM layer is used to interpret user intent and generate responses. By separating interpretation from computation, Cleo produces answers that are grounded in real data and controlled logic, not probabilistic guesswork.
The next generation of financial software will require reliability by design. General-purpose models weren’t built for that, but Cleo is. With an agentic, tool-augmented architecture, we can ensure accuracy while providing the contextual, personalized coaching people need.