Datasets
Typological studies require datasets that represent the diversity of the languages of the world. However, Paralex datasets, as is often the case with corpora, tend to be biased towards Indo-European languages. Many large language families are not even represented. If you have some language data that could extend the coverage of Paralex datasets, converting it to the standard would be an invaluable contribution.
Download the datasets on Zenodo
Existing datasets
Coverage
The chart below shows the current language family coverage of Paralex.
Dataset contributors: Mari Aigro, Matthew Baerman, Sacha Beniamine, Olivier Bonami, Jules Bouton, Alessandra T. Cignarella, Maximin Coavoux, Maria Copot, Valts Ernštreits, Matías Guzmán Naranjo, Sims-Williams, Helen, Borja Herce, Ana R. Luís, Francesco Mambrini, Zemp, Marius, Giovanni Moretti, Marco Passarotti, Matteo Pellegrini, Fernando Perdigão, Bogdan Pricop, Tuuli Tuisk
Missing languages
The table below lists the 5 largest language families (according to Glottolog) that are not represented at all in Paralex datasets.
Language family | Number of languages in Glottolog |
---|---|
Austronesian | 4095 |
Nuclear Trans New Guinea | 832 |
Pama-Nyungan | 642 |
Austroasiatic | 526 |
Bookkeeping | 383 |
The 5 largest languages with more than 50 millions of L1 speakers that are not covered in Paralex:
ISO 693 | Language | L1 Speakers (M) | Family | Branch |
---|---|---|---|---|
cmn | Mandarin Chinese | 941 | Sino-Tibetan | Sinitic |
eng | English | 380 | Indo-European | Germanic |
hin | Hindi | 345 | Indo-European | Indo-Aryan |
ben | Bengali | 237 | Indo-European | Indo-Aryan |
rus | Russian | 148 | Indo-European | Balto-Slavic |
These statistics are automatically extracted from the Paralex Zenodo community. The source for the number of speakers is Ethnolog (2024).