Memory as a step toward more human AI

Effective AI financial coaching depends on continuity. Each interaction should build on the last, with the AI remembering details like spending habits, savings goals, and financial anxieties.

Importantly, users shouldn't have to repeat information the AI should already know. Like a great human coach, a great AI assistant should draw on past sessions to deliver precise, individualized guidance.

For Cleo 3.0, we wanted to give our AI financial assistant that type of long-term memory. This post outlines the user problem our research uncovered, the technical approach we chose, and the retrieval system that helps Cleo remember what matters.

Forgetfulness hurts user engagement

Before we introduced memory, user feedback and retention data indicated a problem: Cleo felt forgetful.

Cleo often asked the same questions and gave similar advice. She might respond to someone expressing financial stress with a generic tip, like cooking instead of dining out, while overlooking that they’d already said they were cutting back. This created an emotional disconnect, leading to lower trust and higher churn.

The cause? A lack of scalable long-term memory. Appending full chat histories to every prompt wasn’t feasible; it created a needle-in-a-haystack problem that caused the LLM to overlook key details. We needed a way for Cleo to recall essential context without bloating prompts or slowing response times.

Users shouldn’t have to repeat information the AI should  already know.

How we chose our approach

Knowledge graphs structure user data as interconnected entities and relationships (e.g., user → has_goal → save_for_house). This enables powerful reasoning over a user’s full financial context, including income, spending, and savings goals. But graph databases entail high engineering overhead and were unnecessarily complex for our goal of conversational continuity. 
Agent-centric stateful systems maintain a persistent internal state, with the AI agent autonomously deciding what to remember and adjusting its behavior over time. While powerful, this would have required fundamentally refactoring Cleo’s conversational logic. 
Semantic insight retrieval extracts and summarizes unstructured or semistructured insights from conversations, then stores them as vector embeddings for future retrieval. Put simply, Cleo saves what matters from each session and brings it up when relevant using retrieval-augmented generation (RAG). This approach was lightweight, integrated cleanly with our existing architecture, and directly supported continuity and personalization.

Of the three, semantic insight retrieval offered the most efficient, scalable path to value by enabling us to launch memory quickly and immediately improve user experience. It also let us remain less opinionated about the types of insights we expect from users, encouraging more dynamic conversations.

To strengthen our implementation, we incorporated structured metadata tagging inspired by knowledge-graph techniques, allowing for memory filtering by topic, recency, and user identity. This also sets the stage for richer reasoning in future updates.

Memory architecture options for Cleo 3.0

A table showing the technical approach to memory

Cleo 3.0’s memory pipeline

We built a memory system to capture and recall essential context—not every word, but key summaries, topics, and user sentiment.

Here’s how Cleo’s memory pipeline works:

Capture and parse. After a meaningful free-text exchange, a dedicated service summarizes the interaction for injection into a prompt for an LLM (GPT-4o or Gemini 2.0 Flash). It also extracts structured metadata, including key topics (such as goal_setting, cash_advance) and overall sentiment. 
Embed and store. The summary is converted into a vector embedding and stored in our vector database alongside the summary text and associated metadata. 
Retrieve and filter. In future conversations, the LLM agent decides when memory adds value. For example, if a user asks “What’s up?”, the agent calls a retrieval tool to semantically search past memories, filtering by recency and topic. 
Inject and generate. Retrieved memories are injected into the LLM prompt, enabling Cleo to respond with greater fluency, relevance, and awareness of user history.

This pipeline provides conversational continuity without bloating prompts, consistently meeting our latency target (p95 < 500 ms).

Our design also addresses specific risks. Memory filtering helps ensure that Cleo doesn’t surface outdated information that conflicts with real-time data, like a user’s bank balance. Likewise, topic and sentiment analysis prevent Cleo from recalling overly sensitive memories, preserving comfort and trust.

Recalling past context substantially improves user experience.

Early learnings and impact

The true test of any system is its real-world impact, and early results validate our core hypothesis: Recalling past context substantially improves user experience.

Building on these positive signals, we developed an evaluation framework to ensure reliability. We’re validating memory accuracy through offline testing to confirm that summaries reflect original conversations and avoid hallucinations. We’re also benchmarking retrieval quality against a custom dataset to assess how well the system surfaces relevant memories at the right time.

We’ve also gained valuable insights around refining retrieval triggers. Initially, we saw that memory could be counterproductive for narrow transactional queries like “What’s my current balance?” In response, we fine-tuned Cleo’s logic to trigger retrieval primarily during open-ended conversations, where contextual recall adds greater value.

These early findings affirm that we’re on the right track. Memory is already helping Cleo become a better listener and financial assistant, laying the groundwork for smarter, more proactive features.

What’s next

Our evaluation framework will help us refine the current system, but we’re also focused on three enhancements to make memory even more valuable: richer memory structures, proactive retrieval, and increased transparency.

Currently, Cleo summarizes each session independently. This supports short-term recall—like referencing a budgeting goal from a few days ago—but doesn’t build a persistent, evolving view of the user. For that, the agent relies on other tools, backed by a relational database.

But many important facts, like long-term goals or financial constraints, emerge through conversation and stay relevant over time. Going forward, we plan to have Cleo reflect on stored memories to consolidate and update her understanding of these facts, enabling deeper continuity across sessions.

We’re also exploring proactive retrieval, where Cleo initiates check-ins based on memory: following up on a user’s adherence to their budgeting plan, or encouraging them when they make progress on a savings goal. In practice, this means linking memory to push notifications and app entry messages, so relevant context is present before the user initiates an interaction.

Finally, we’re designing a user interface that lets users view and manage Cleo’s stored memories. This builds trust and creates a feedback loop: When users can correct or clarify memories, the coaching experience feels more personalized and aligned with their goals.

Takeaways

Success isn’t measured by backend improvements alone, but whether we transform how users experience Cleo. Adding memory is a meaningful step toward an AI that remembers your story, not just data points.

Memory makes Cleo relational, not just reactive. She can now build on past conversations to offer more relevant, personalized coaching.
A lightweight architecture delivered fast value. Our LLM-powered summarization and tagging implementation lets us adjust what Cleo recalls at both storage and retrieval.
Memory unlocks a new generation of potential features. From proactive nudges to editable histories, Cleo is becoming a smarter, more supportive financial assistant.