Thursday, September 5, 2024
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Posted by

Imagine giving an AI not just one voice, but a billion different perspectives each with its own personality, background, and style of thinking. This is exactly what the team behind the paper Scaling Synthetic Data Creation with 1,000,000,000 Personas set out to do.
The idea is to use "personas": fictional or real-world-inspired identities to guide how a language model generates synthetic data. And not just a handful of personas, but literally a billion of them
What Is a Persona Hub?
The authors introduce Persona Hub (about 10 billion tokens), which can be thought of as a compressed file that contains the world’s knowledge (around 100 trillion tokens from web text used to train LLMs). In other words, Persona Hub is a way to teach AI how to see the world through different lenses. Each persona reflects a unique voice.
The paper shows that Persona Hub can be used to create a diversity of synthetic samples—from a machine learning researcher who studies attention mechanisms to a street artist in Berlin. It demonstrates that using personas to make data is powerful, flexible, and easy to scale, and it could change how synthetic data is made and used in AI.
To help others with this research, the authors are releasing 200,000 personas and the following sample data created using them:
- 🧠 50,000 logic problems
- 📚 50,000 instructions
- 🔢 50,000 math questions
- 🎮 10,000 game NPCs
- 🌍 10,000 knowledge texts
- 🛠️ 5,000 functional tools
How It Works
Creating personas might sound tricky, but the process is surprisingly elegant. The team used two key strategies:
1. Text-to-Persona
Every piece of text reflects its writer or reader. So, the model looks at a chunk of web text and asks:
“What kind of person might this come from?”
These personas can be general or detailed, such as:
- A general persona like “a software engineer”
- Or something more specific like “a frontend developer who loves open-source projects”
2. Persona-to-Persona
Text-to-Persona is powerful, but it might miss perspectives that aren’t often represented online—like a child, a homeless person, or a behind-the-scenes movie editor.
To include these voices, the paper introduces Persona-to-Persona. This method creates new personas by exploring relationships between people. For example:
- From “a nurse at a children’s hospital” → we get “a child patient”
- From “a social worker” → we get “a person experiencing homelessness”
This way, the model creates personas not just from text, but from the social fabric.
Cleaning It All Up: Deduplication
With millions (soon billions) of personas, duplication is a problem. To fix that, they used a two-step cleanup process:
-
MinHash Deduplication
Detects personas that are 90% identical based on 1-gram word overlap. -
Embedding Deduplication
Uses AI embeddings to spot meaning-level similarity (if similarity > 0.9 → keep only one).
Persona-Driven Synthetic Data Creation
Their method for making synthetic data using personas is simple but powerful. They just add a persona to the prompt given to the language model. This helps the model think from that persona’s point of view and create more varied data.
With 1 billion personas in Persona Hub, they can generate a huge amount of diverse data.
Like other AI prompting methods, this approach is flexible. There are three ways to use it:
-
Zero-shot prompting:
No examples are given—just the persona and task. This lets the model be creative. -
Few-shot prompting:
A few examples are provided to guide the model’s output. -
Persona-enhanced few-shot prompting:
Adds a matching persona to each example. This boosts performance but takes more effort, since each example needs a suitable persona.
Example: Math Problem
🚀✨