naab: A ready-to-use plug-and-play corpus for Farsi

Sadra Sabouri; Elnaz Rahmati; Soroush Gooran; Hossein Sameti

doi:10.61838/jaiai.1.2.1

Authors

Sadra Sabouri * Speech and Language Processing Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran sadra@ee.sharif.edu

Elnaz Rahmati Speech and Language Processing Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

Soroush Gooran Speech and Language Processing Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

Hossein Sameti Speech and Language Processing Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

https://doi.org/10.61838/jaiai.1.2.1

Keywords:

Natural Language Processing, Low-resource Languages, Large Language Models, Textual Corpus, Open-source Dataset, Data Preprocessing, Persian Language Resources, Text Mining

Abstract

The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce Naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. Naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word ناب (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource languages, to improve the performance of LLMs in their respective domains and bridge the gap between resource-rich and resource-poor languages.

Downloads

Download data is not yet available.

naab: A ready-to-use plug-and-play corpus for Farsi

Authors

Keywords:

Abstract

Downloads

Downloads

Published

Submitted

Revised

Accepted

Issue

Section

How to Cite

Similar Articles

Make a Submission

Keywords

Language

Journal Archive

Average time from submission until

Indexing & Abstracting

Information Table