Datasets
Typological studies require datasets that represent the diversity of the languages of the world. However, Paralex datasets, as is often the case with corpora, tend to be biased towards Indo-European languages. Many large language families are not even represented. If you have some language data that could extend the coverage of Paralex datasets, converting it to the standard would be an invaluable contribution.
Accessing datasets
Most Paralex datasets are released on the Paralex Zenodo community. The Zenodo community allows to browse current datasets, retrieve the metadata and download the files.
Download the datasets on Zenodo
The paralex
package also provides a simple command line interface to browse and download datasets:
paralex list # Returns all available datasets*
paralex get <ZENODO_ID> --output <PATH> # Downloads the dataset with id ZENODO_ID to PATH
Optional arguments
list
accepts the following arguments:
--iso <ISO>
followed by iso codes filter the list of datasets and displays only matching datasets.-o/--output <PATH>
saves the dataset list as a CSV table toPATH
.
get
accepts the following arguments:
-o/--output <PATH>
saves the dataset list as a CSV table toPATH
.
Existing datasets
Coverage
The chart below shows the current language family coverage of Paralex.
Dataset contributors: Mari Aigro, Matthew Baerman, Sacha Beniamine, Olivier Bonami, Jules Bouton, Alessandra T. Cignarella, Maximin Coavoux, Maria Copot, Valts Ernštreits, Matías Guzmán Naranjo, Sims-Williams, Helen, Borja Herce, Ana R. Luís, Francesco Mambrini, Zemp, Marius, Giovanni Moretti, Marco Passarotti, Matteo Pellegrini, Fernando Perdigão, Bogdan Pricop, Tuuli Tuisk
Missing languages
The table below lists the 5 largest language families (according to Glottolog) that are not represented at all in Paralex datasets.
Language family | Number of languages in Glottolog |
---|---|
Austronesian | 4095 |
Nuclear Trans New Guinea | 832 |
Pama-Nyungan | 642 |
Austroasiatic | 526 |
Bookkeeping | 383 |
The 5 largest languages with more than 50 millions of L1 speakers that are not covered in Paralex:
ISO 693 | Language | L1 Speakers (M) | Family | Branch |
---|---|---|---|---|
cmn | Mandarin Chinese | 941 | Sino-Tibetan | Sinitic |
eng | English | 380 | Indo-European | Germanic |
hin | Hindi | 345 | Indo-European | Indo-Aryan |
ben | Bengali | 237 | Indo-European | Indo-Aryan |
rus | Russian | 148 | Indo-European | Balto-Slavic |
These statistics are automatically extracted from the Paralex Zenodo community. The source for the number of speakers is Ethnolog (2024).