RefineAI - Research

Mind the GAP: A Review of Arabic Post-Training Datasets and Their Limitations

RefineAI Team, Collaborators

2025

Abstract

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic posttraining datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabiccentric LLMs and applications while providing concrete recommendations for future efforts in Arabic post-training dataset development.

Keywords

Data CurationLLMData QualityArabic NLPPost-Training

arXiv PDF Code

BibTeX

@misc{alkhowaiter2025mindgapreviewarabic,
      title={Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations}, 
      author={Mohammed Alkhowaiter and Norah Alshahrani and Saied Alshahrani and Reem I. Masoud and Alaa Alzahrani and Deema Alnuhait and Emad A. Alghamdi and Khalid Almubarak},
      year={2025},
      eprint={2507.14688},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.14688}, 
}

Joud: A Massive Pre-training Dataset for Arabic Large Language Models

RefineAI Research Lab, Contributors

2025

Abstract

A key factor behind the recent achievements of Large Language Models (LLMs) is their utilization of massive quantities of textual datasets for their unsupervised pre-training. However, simply training a model on all accessible data may not be the best approach, as the quality of the available text data can fluctuate. In this paper, we introduce Joud,1 a massive, diverse, and high-quality pre-training dataset for the Arabic language models. We publicly release our highly-filtered and cleaned dataset, our pre-processing and cleaning pipeline, and its code scripts, aiming to accelerate progress in Arabic LLMs development and hoping to encourage researchers in the Arabic NLP community to open-source their work

Keywords

Traning DataData GenerationArabic NLPLLMData Quality

arXiv PDF Code

BibTeX

@article{refineai2024synthetic,
  title={Joud: A Massive Pre-training Dataset for Arabic Large Language Models,
},
  author={RefineAI Research Lab},
  year={2025}
}

Our Research

Mind the GAP: A Review of Arabic Post-Training Datasets and Their Limitations

Abstract

Keywords

Joud: A Massive Pre-training Dataset for Arabic Large Language Models

Abstract

Keywords

Stay Updated with Our Research