Building a financial agent on top of commodified LLMs

For too many people, financial stress is the default. Nearly seven in ten Americans say they’ve experienced depression and anxiety driven by financial uncertainty.

That stress shapes both long-term and everyday decisions. Yet most financial tools still treat personal finance as a series of transactions, not a lived emotional experience. They wait for prompts, offer generic advice, and overlook signs of distress or progress.

Today, baseline conversational intelligence is available to anyone through off-the-shelf large language models—what we’ll refer to here as commodified LLMs. The differentiator is no longer raw text generation, but the systems and experiences built around it.

Our most recent release, Cleo 3.0, treats commodified LLMs as a substrate, layering on agentic reasoning, long-term memory, and multimodal interaction to create a proactive, emotionally intelligent financial ally. This post explains what makes Cleo different, the architecture behind our system, the principles we use to safeguard trust, and the results we’re seeing in the hands of our users.

Cleo 3.0 remembers and adapts in real time.

What makes Cleo different

Most consumer finance products, even those that incorporate AI, still behave like help centers or search engines. They’re reactive systems built around templated advice that forget interactions as soon as they’re over.

Cleo 3.0 takes a different approach, in the form of an agentic AI system that remembers and adjusts in real time. A background agent monitors transaction history and prior conversations to detect changes, like shifts in spending or missed goals, and surfaces relevant insights or follow-ups automatically.

This proactive behavior is paired with individual adaptation. Cleo’s memory is personal, remembering why users set certain goals, what they’ve said about their financial aspirations, and what they find most stressful about money. The agent uses this to update Cleo’s internal context over time, enabling her to adjust tone, timing, and message content based on how users react.

Importantly, input isn’t limited to financial transactions. Cleo parses typed messages, voice input, and UI interactions, integrating that data to maintain continuity and decide what to say or do next. These behaviors depend on LLM agents with access to tools and memory. Cleo 3.0 can reason over user state, invoke relevant actions, and generate outputs that reflect both financial and emotional context.

How we built it

Cleo 3.0 started from a familiar place: a rule-based chatbot with machine learning elements. Early versions used dialogue flows, intent classifiers, and recommender systems to deliver financial guidance and nudges. This infrastructure supported a mix of reactive and proactive interactions, but it lacked memory, long-range context, and the flexibility needed for adaptive, emotionally responsive behavior.

The transition to Cleo 3.0 began with small changes. We introduced emotion-sensitive check-ins and voice input, tested retrieval-based systems for long-term recall, and fine-tuned Cleo’s tone on real conversations. The current architecture centers on an LLM agent that plans multi-step interactions, invokes tools, and reasons over both financial and emotional context.

Memory components track goals, stress signals, and tool usage across sessions, while a shared interface natively supports voice and text. This compositional shift means that Cleo’s handcrafted voice, rich dataset of emotional interactions, and behavioral insights now serve as inputs to a system that can reason over them.

Core components of Cleo’s agent system

Cleo 3.0’s agentic architecture is grounded in four capabilities: agentic reasoning, memory, tools, and multimodal interaction.

Agentic reasoning

Cleo’s LLM agent plans actions, invokes tools, and adapts behavior based on evolving user context. Rather than responding in isolation, it decomposes goals using recursive tool calls to handle tasks like budgeting setup or cash flow triage end-to-end. The system also dynamically scales computational effort, allowing for lightweight responses to simpler prompts and deeper planning when needed.

Memory

Cleo maintains structured summaries of what matters to each user, such as financial goals, expressed preferences, and sources of stress. These memories are retrieved selectively based on context to support continuity across sessions. Cleo 3.0 can reference past goals, follow up on incomplete tasks, and avoid repeating questions the user has already answered.

Tools

Cleo 3.0 uses a library of purpose-built tools to retrieve data and perform actions like budget lookups, goal setting, and subscription changes. Each tool is defined by a structured schema and invoked directly by the LLM, which plans and executes tool calls and integrates the results into the conversation. This enables multi-step workflows, ensures precision in tasks involving calculation or user state, and reduces reliance on fragile natural-language completions while avoiding prompt bloat from irrelevant context.

Multimodality

Cleo supports real-time, two-way interaction via voice and text. Voice input is transcribed and processed by the same reasoning engine, while output is synthesized and streamed back to the user, allowing users to switch between modes without losing state. Beyond accessibility and emotional expressiveness, this also enables new product features like voice notes and spoken trivia games, all connected to the same underlying agent.

Designing for trust

We wanted Cleo 3.0 to do more than answer questions. The upgraded system needed to reason across sessions, initiate conversations, and respond thoughtfully to sensitive topics like debt and financial setbacks.

To do that well, accuracy isn’t enough. Outputs must also be well timed, emotionally appropriate, and grounded in context in order to earn user trust. In building Cleo’s agentic system, we encountered three recurring challenges and used corresponding principles to guide product design and evaluation.

1. Unpredictable model behavior → Evaluation-first design

LLMs can produce very different outputs depending on prompt phrasing, context length, and conversation history. That variability increases when agents initiate actions or follow up across sessions.

For Cleo, that means more surface area for failure, with small changes in wording or context leading to inconsistent outputs, especially in multi-step conversations. To manage this, we took an evaluation-first approach to development, testing Cleo 3.0 against a wide-ranging suite of test cases based on real-world user queries.

2. Emotionally sensitive timing → Personalized, opinionated AI

For AI assistants, timing and tone matter as much as content. Advice that might technically be accurate can still land poorly if the user is feeling overwhelmed, avoidant, or frustrated. And traditional chatbots often default to neutrality, which can feel cold or disconnected.

Cleo 3.0 is designed to be opinionated but adaptive. Proactive messages draw on memory and sentiment history to determine what to say and when to say it. We apply confidence thresholds and content filters to manage emotional risk, especially around debt, overspending, and missed goals.

3. Fragmented input modes → Multimodal by default

Cleo users interact via voice, text, and in-app UI, sometimes within the same session. In earlier versions, these modes were handled separately, which led to fragmented experiences when Cleo’s advice felt repetitive or contradicted earlier context.

Cleo 3.0 was built to unify input modes. Whether users type, speak, or tap, the agent remembers the inputs, calls tools, and updates its state in the same way. This ensures consistency across channels and simplifies evaluation, since prompts and behaviors can be tested using the same underlying framework.

Agentic reasoning and emotional context drive meaningful improvements.

Unlocking real results

We evaluate Cleo’s agentic AI system not just on language quality, but on behavioral outcomes: whether users engage more often, complete more financial goals, and respond positively to proactive interventions. Early signals suggest that agentic reasoning and emotional context drive meaningful improvements.

Improved engagement through emotionally resonant check-ins

When Cleo’s weekly review asked users how they were feeling, not just what they were spending, it led to longer, more in-depth conversations. This emotional check-in format is also especially well suited to Cleo 3.0; its open-ended design plays to the strengths of LLM-driven conversation and serves as a natural entry point for memory, helping the agent track not just goals but user sentiment over time.

Easier goal completion with background planning agents

With Cleo 3.0, users don’t need to manually initiate every task. The agent can monitor goal progress in the background and follow up without explicit prompts. By surfacing incomplete tasks and prompting next actions, the system supports higher goal completion, especially for budgeting and savings milestones.

Stronger performance in structured evaluations

We benchmarked Cleo 3.0 against leading general-purpose LLMs using six realistic budgeting questions, such as comparing income to expenses and calculating spending on groceries. Cleo achieved 81% accuracy, outperforming competitors by a wide margin. The advantage comes from ouvr system design: Cleo offloads calculations to structured, deterministic tools, avoiding the logical and mathematical hallucinations common with general-purpose LLMs.