For systems and services in effect as of date of disclosure: Jan 1st 2026
Internally collected data of Cleo AI
Improving chat functionality
Approximately 500,000 bank transactions
Approximately 10,000 chatbot messages
Unlabelled datasets: natural language text, structured data in JSON format
Labelled datasets: transaction category, merchant name
Datasets utilized were internally collected information and may also include copyrighted, trademarked, or patented material.
Proprietary/internal; the datasets we use are freely obtained, not purchased or licensed.
Categories of personal information:
Name or nickname
Purchase history
Employment data
Email address
Location data
Safeguards:
Before using data to train generative models that produce text for chatbot elements, we apply automated processes designed to remove personal information
For models that infer bank transaction metadata, the system is designed to output the name of a single category or entity rather than free-form text. This greatly reduces the chance that unrelated personal information appears in user-facing results
The chatbot implements user-level data isolation at inference time, so that the AI system can only retrieve data about the authenticated user for input to generative models.
Datasets include aggregate statistics on behavior across groups of Cleo users. We apply minimum group sizes (500 users per group) to reduce the likelihood that individual users can be re-identified.
Depending on the model, we may filter data to prioritize high-quality examples, including based on user feedback, in-house annotation, or automated classification.
For classification tasks, we label data using large language models and/or human annotation.
For open-ended generation tasks where the generated text is shown directly to users, we apply processes designed to remove personal information.
May 2023 - present
First used: Oct 2025
We do not currently use synthetic data for training generative AI models.