Skip to content

Datasets

Typological studies require datasets that represent the diversity of the languages of the world. However, Paralex datasets, as is often the case with corpora, tend to be biased towards Indo-European languages. Many large language families are not even represented. If you have some language data that could extend the coverage of Paralex datasets, converting it to the standard would be an invaluable contribution.

Download the datasets on Zenodo

Existing datasets

Coverage

The chart below shows the current language family coverage of Paralex.

Dataset contributors: Mari Aigro, Matthew Baerman, Sacha Beniamine, Olivier Bonami, Jules Bouton, Alessandra T. Cignarella, Maximin Coavoux, Maria Copot, Valts Ernštreits, Matías Guzmán Naranjo, Sims-Williams, Helen, Borja Herce, Ana R. Luís, Francesco Mambrini, Zemp, Marius, Giovanni Moretti, Marco Passarotti, Matteo Pellegrini, Fernando Perdigão, Bogdan Pricop, Tuuli Tuisk

Missing languages

The table below lists the 5 largest language families (according to Glottolog) that are not represented at all in Paralex datasets.

Language family Number of languages in Glottolog
Austronesian 4095
Nuclear Trans New Guinea 832
Pama-Nyungan 642
Austroasiatic 526
Bookkeeping 383

The 5 largest languages with more than 50 millions of L1 speakers that are not covered in Paralex:

ISO 693 Language L1 Speakers (M) Family Branch
cmn Mandarin Chinese 941 Sino-Tibetan Sinitic
eng English 380 Indo-European Germanic
hin Hindi 345 Indo-European Indo-Aryan
ben Bengali 237 Indo-European Indo-Aryan
rus Russian 148 Indo-European Balto-Slavic

These statistics are automatically extracted from the Paralex Zenodo community. The source for the number of speakers is Ethnolog (2024).