Until recently, every interaction with Cleo took place in text-based chat. Text is efficient, searchable, and easy to scan. But real conversations, especially around sensitive topics like money, are about more than just effectively conveying information.

When we speak, rhythm and tone add emotional elements that text alone can’t capture. The same is true for AI assistants, where a reminder spoken by a warm, encouraging voice registers differently than the same message displayed in a chat bubble.

Voice Mode is our way of capturing that difference. In Cleo 3.0, we extended Cleo’s conversational core to incorporate a voice layer. Tap the equalizer icon next to the chat input, and you can talk with Cleo out loud in real time. We’ve designed this experience to feel as natural and fluid as possible, so that Cleo feels like a trusted companion and not another generic robot voice.

Why we built Voice Mode

Voice adds new dimensions to Cleo 3.0 in a few key areas:

Personality. The way Cleo’s unique personality builds rapport with users has always been a core strength. Adding a consistent identity across voice-enabled features, like live conversations and Money IQ, deepens that connection.
Empathy. Few domains are as emotionally charged as personal finance. Money can spark feelings of joy and pride just as easily as insecurity and shame. In those moments, the ability to introduce vocal intonation lets us convey warmth and encouragement in ways that aren’t possible with text alone.
Accessibility and efficiency. On a practical level, speaking aloud is simply easier for some users and in certain contexts. Voice Mode enables hands-free communication with Cleo and offers a faster, more natural way to interact.

How we designed Voice Mode

From the start, we decided that voice shouldn’t be a separate surface. Splitting the product into “text Cleo” and “voice Cleo” would risk fragmenting the user experience and internal architecture in undesirable ways. Instead, Voice Mode sets a flag on the existing chat pipeline, routing outputs to text-to-speech.

That design preserved product continuity but introduced engineering challenges. Text and audio needed to stay reasonably synchronized, with low latency, and maintain the Cleo personality that users have come to expect after years of text interactions.

At a high level, Voice Mode works as follows:

The user speaks, and their words are transcribed using native on-device dictation APIs. This enabled us to ship Voice Mode without building a speech-to-text pipeline from scratch.
The partial transcript of the user’s speech streams into Cleo’s model pipeline (the same one that underlies text chat).
As soon as the model begins producing tokens, those tokens are mirrored to ElevenLabs’ v2 voice generation model for synthesis.
Audio frames are then streamed back to the client and stitched together for playback, even before the full response is complete.

When designing Voice Mode, we also wanted the voice itself to sound right. Text chat with Cleo has always been marked by humor and individuality, and we needed the same identity to come through in audio; a generic, machine-like voice wouldn’t have felt like Cleo.

To choose Cleo’s voice, we sourced multiple voice actors and had them record test scripts. From there, we used ElevenLabs to clone the voices and generate Cleo-style lines, then compared them on qualities like cadence and emotional range. Ultimately, we chose a single voice actor whose voice we felt best captured Cleo’s wit and empathy without sacrificing clarity or tipping into the cartoonish.

Technical challenge spotlight: The latency–fluency tradeoff

While the high-level process might look straightforward, developing it required some significant technical orchestration. For example, we needed to create rules to govern scenarios like a user interrupting Cleo mid-reply or gracefully falling back to text when audio occasionally fails to synthesize.

Balancing latency

Latency, in particular, was a notable technical challenge. The human ear is acutely sensitive to delay, with even short pauses potentially breaking the illusion of fluid conversation. Hitting the bar of nearly instantaneous replies is especially tough when balancing multiple services, streamed text, and synthesized audio.

Our early approach was to emit tokens as they arrived, splitting on punctuation and sending each segment directly to ElevenLabs. But this led to some stilted delivery and misreading of certain phrases: Lists broke in the middle, Markdown symbols were read out literally, and numbers with decimals were interpreted incorrectly. In a financial context, the latter was especially problematic. In the phrase “you spent $40.60 at Walmart,” for example, Cleo might read out the amount as “forty dollars [pause] sixty,” rather than “forty dollars and sixty cents.”

We addressed this with several layers of heuristics:

Token classification. Each token is inspected for type (numeral, Markdown symbol, and so on).
Dynamic buffering. Adjacent numerals are held and recombined into a likely complete value before Cleo speaks.
Length bounds. Setting minimum and maximum phrase lengths prevents abrupt single-word flushes and unwieldy run-ons.
First-audio delay. The pipeline waits long enough to pair quick lead-ins with the next clause, smoothing transitions and avoiding awkward tonal shifts after short phrases like “That’s great!”

Building a voice-first personality

We also had to resolve cases where written conventions don’t translate well into audio. For example, when chatting with Cleo via text, swear words are masked with asterisks (e.g., sh*t). Early versions of Voice Mode read the symbol literally, speaking the full word “asterisk.” We reframed the problem in voice-first terms: Cleo now says softened variants (e.g., sh) that preserve the implication without fully pronouncing the word.

Likewise, our text-to-speech pipeline didn’t initially handle exclamation marks well. Although in text they usually suggest excitement, the voice model sometimes rendered them as angry or accusatory, with a “Hey you!” that was intended as cheerful instead sounding like shouting. To maintain warmth, we chose to silently downgrade many exclamation marks to periods, aiming for consistency in tone over perfect fidelity to text.

Lessons learned and next steps

What we learned building Voice Mode is in line with a broader lesson from building conversational systems: The toughest challenges aren’t always related to the core AI models, but instead making those models usable in messy, real-world scenarios. Today’s LLMs and text-to-speech engines are capable of producing fluent text and natural voices, respectively. The hard part was weaving them together in a way that maintained low latency, preserved personality, and ensured reliability inside a product that people trust with their money.

Key takeaways from building Voice Mode include:

The right architecture matters as much as the underlying models; using a single pipeline for text and voice avoids fragmentation.
Latency, not raw model quality, is often the limiting factor in creating a natural conversational experience.
Personality and brand consistency should be important considerations alongside accuracy in order to sustain user trust.

Our current iteration of Voice Mode is just the beginning. As models become faster, more agentic, and more tightly integrated with memory and reasoning, we’re aiming to make conversations with Cleo feel even more like talking to a trusted friend who fully understands your financial story.

How we taught Cleo to talk back