Tuesday, August 5, 2025
Mind the Gap: Why Arabic LLMs Still Lag Behind (and What We Can Do About It)
Posted by

In training large language models (LLMs), post-training has become the secret sauce that turns powerful but generic language models into finely tuned assistants that actually understand what you want. While most people have heard of "training" and maybe even "fine-tuning," post-training is the magic layer that aligns a language model with real-world expectations: understanding your instructions, staying safe, and behaving like a good digital citizen.
For Arabic, that magic is still hard to come by.
The Problem: Plenty of Models, Not Enough (Good) Data
Let’s start with the good news. In recent years, we’ve seen a wave of Arabic-centric LLMs, like JAIS, AceGPT, and the Allam series, enter the scene. These models are promising, but they’re still missing something crucial: high-quality, culturally nuanced Arabic post-training datasets.
Our recent study systematically reviewed all publicly available Arabic post-training datasets on Hugging Face. What we found was eye-opening:
-
Over 360 datasets are available, but the majority are skewed toward only two tasks:
- Translation (42%)
- Question Answering (38%)
-
More complex or essential tasks like function calling, code generation, or dialogue have almost no representation.
-
Over 50% of the datasets are rarely or never used in real-world models.
-
Only a handful are updated regularly or have strong documentation, licensing, or academic validation.
Why This Matters
Imagine building a chatbot for an Arabic-speaking audience. If all your data is translated from English, the bot might miss the tone, context, or even offend users unintentionally. That’s not just awkward—it can be dangerous in sensitive applications like education, healthcare, or civic services.
Three big problems stood out:
-
Cultural Misalignment Many datasets are translated from English without cultural filtering or adaptation. This leads to models that don’t "speak" or "think" like native Arabic speakers.
-
Low Dataset Quality & Visibility Most Arabic datasets lack clear documentation, proper licensing, or validation. Without that, other researchers can’t easily reuse or trust them.
-
Missing Support for Advanced Tasks Code generation, system prompting, and robust function calling? Arabic datasets for these are either missing or experimental at best.
What We Found
We analyzed Arabic datasets across 12 NLP domains. Here's a snapshot of where things stand:
Task Domain | Current Coverage | Gap? |
---|---|---|
Translation, Q&A | Strong (150+ each) | None |
Summarization | Moderate (45) | Quality issues |
Reasoning & Multi-step, Dialog/Conversation, Robustness & Safety | Sparse (8 datasets) | Needs more scale |
Cultural Alignment, Ethics, Bias, and Fairness | Very limited (≤3 each) | Critical gap |
Code Generation, Function Call, Official Docs, and Persona/Ownership/System Prompt | None | Total absence |
Only 57 out of 366 datasets are used in actual models. The rest sit unused, untested, or unnoticed.
What We Need (And How to Build It)
If we want Arabic LLMs to be as good, or better, than their English counterparts, we need to change how we build post-training datasets. That starts with building for authenticity, culture, and complexity.
Here’s what we recommend:
🚫 Stop Translating. ✅ Start Creating.
Don’t just translate English data. Instead, crowdsource or generate native Arabic data that reflects real-life use cases, dialects, and values.
🎯 Target the Gaps
Focus on the domains where Arabic is critically underrepresented:
- Code generation
- Function/API calling
- Culturally aligned Q&A
- Dialogues in regional dialects
- Safety and ethical alignment
🧠 Use Smarter Methods
- Scrape Arabic GitHub repos to create code-comment pairs
- Collect dialectal conversations from native speakers
- Launch collaborative annotation platforms for cultural labeling
- Use LLMs for first-pass annotation, followed by human review
- Prompt existing Arabic-capable LLMs to generate synthetic—but verified—data
Let’s Build Together
Building strong Arabic LLMs is more than a research goal, it’s a cultural mission. We have the models, but we need better data: authentic, representative, and task-specific. That’s where the next wave of innovation lies.
So whether you’re a researcher, a developer, or a language lover, consider this your call to action. Let’s create the datasets that Arabic deserves—rich in dialects, deep in meaning, and powerful enough to unlock the full potential of Arabic AI.
Reference
Study based on analysis of Arabic datasets on Hugging Face.