Get the app

California Generative Artificial Intelligence System Disclosure

For systems and services in effect as of date of disclosure: Jan 1st 2026

1. Dataset Overview

(1) Sources or Owners of the Datasets

Internally collected data of Cleo AI


(2) Purpose Alignment

Improving chat functionality


(3) Dataset Size

  • Approximately 500,000 bank transactions

  • Approximately 10,000 chatbot messages


(4) Types of Data Points

  • Unlabelled datasets: natural language text, structured data in JSON format

  • Labelled datasets: transaction category, merchant name


(5) Intellectual Property Status

Datasets utilized were internally collected information and may also include copyrighted, trademarked, or patented material.


(6) Data Acquisition

Proprietary/internal; the datasets we use are freely obtained, not purchased or licensed.


(7) Personal Information

Categories of personal information:

  • Name or nickname

  • Purchase history

  • Employment data

  • Email address

  • Location data

Safeguards:

  • Before using data to train generative models that produce text for chatbot elements, we apply automated processes designed to remove personal information

  • For models that infer bank transaction metadata, the system is designed to output the name of a single category or entity rather than free-form text. This greatly reduces the chance that unrelated personal information appears in user-facing results

  • The chatbot implements user-level data isolation at inference time, so that the AI system can only retrieve data about the authenticated user for input to generative models.


(8) Aggregate Consumer Information

Datasets include aggregate statistics on behavior across groups of Cleo users. We apply minimum group sizes (500 users per group) to reduce the likelihood that individual users can be re-identified.


(9) Data Processing and Modifications

  • Depending on the model, we may filter data to prioritize high-quality examples, including based on user feedback, in-house annotation, or automated classification.

  • For classification tasks, we label data using large language models and/or human annotation.

  • For open-ended generation tasks where the generated text is shown directly to users, we apply processes designed to remove personal information.


(10) Data Collection Period

May 2023 - present


(11) Dataset Use Dates

First used: Oct 2025


(12) Synthetic Data

We do not currently use synthetic data for training generative AI models.