Mind the Gap: Why Arabic LLMs Still Lag Behind (and What We Can Do About It)

In training large language models (LLMs), post-training has become the secret sauce that turns powerful but generic language models into finely tuned assistants that actually understand what you want. While most people have heard of "training" and maybe even "fine-tuning," post-training is the magic layer that aligns a language model with real-world expectations: understanding your instructions, staying safe, and behaving like a good digital citizen.

For Arabic, that magic is still hard to come by.

The Problem: Plenty of Models, Not Enough (Good) Data

Let’s start with the good news. In recent years, we’ve seen a wave of Arabic-centric LLMs, like JAIS, AceGPT, and the Allam series, enter the scene. These models are promising, but they’re still missing something crucial: high-quality, culturally nuanced Arabic post-training datasets.

Our recent study systematically reviewed all publicly available Arabic post-training datasets on Hugging Face. What we found was eye-opening:

Over 360 datasets are available, but the majority are skewed toward only two tasks:
- Translation (42%)
- Question Answering (38%)
More complex or essential tasks like function calling, code generation, or dialogue have almost no representation.
Over 50% of the datasets are rarely or never used in real-world models.
Only a handful are updated regularly or have strong documentation, licensing, or academic validation.

Why This Matters

Imagine building a chatbot for an Arabic-speaking audience. If all your data is translated from English, the bot might miss the tone, context, or even offend users unintentionally. That’s not just awkward—it can be dangerous in sensitive applications like education, healthcare, or civic services.

Three big problems stood out:

Cultural Misalignment Many datasets are translated from English without cultural filtering or adaptation. This leads to models that don’t "speak" or "think" like native Arabic speakers.
Low Dataset Quality & Visibility Most Arabic datasets lack clear documentation, proper licensing, or validation. Without that, other researchers can’t easily reuse or trust them.
Missing Support for Advanced Tasks Code generation, system prompting, and robust function calling? Arabic datasets for these are either missing or experimental at best.

What We Found

We analyzed Arabic datasets across 12 NLP domains. Here's a snapshot of where things stand:

Task Domain	Current Coverage	Gap?
Translation, Q&A	Strong (150+ each)	None
Summarization	Moderate (45)	Quality issues
Reasoning & Multi-step, Dialog/Conversation, Robustness & Safety	Sparse (8 datasets)	Needs more scale
Cultural Alignment, Ethics, Bias, and Fairness	Very limited (≤3 each)	Critical gap
Code Generation, Function Call, Official Docs, and Persona/Ownership/System Prompt	None	Total absence

Only 57 out of 366 datasets are used in actual models. The rest sit unused, untested, or unnoticed.

What We Need (And How to Build It)

If we want Arabic LLMs to be as good, or better, than their English counterparts, we need to change how we build post-training datasets. That starts with building for authenticity, culture, and complexity.

🚫 Stop Translating. ✅ Start Creating.

Don’t just translate English data. Instead, crowdsource or generate native Arabic data that reflects real-life use cases, dialects, and values.

🎯 Target the Gaps

Focus on the domains where Arabic is critically underrepresented:

Code generation
Function/API calling
Culturally aligned Q&A
Dialogues in regional dialects
Safety and ethical alignment

🧠 Use Smarter Methods

Scrape Arabic GitHub repos to create code-comment pairs
Collect dialectal conversations from native speakers
Launch collaborative annotation platforms for cultural labeling
Use LLMs for first-pass annotation, followed by human review
Prompt existing Arabic-capable LLMs to generate synthetic—but verified—data

Let’s Build Together

Building strong Arabic LLMs is more than a research goal, it’s a cultural mission. We have the models, but we need better data: authentic, representative, and task-specific. That’s where the next wave of innovation lies.

So whether you’re a researcher, a developer, or a language lover, consider this your call to action. Let’s create the datasets that Arabic deserves—rich in dialects, deep in meaning, and powerful enough to unlock the full potential of Arabic AI.

Reference

Study based on analysis of Arabic datasets on Hugging Face.