Datasets

Typological studies require datasets that represent the diversity of the languages of the world. However, Paralex datasets, as is often the case with corpora, tend to be biased towards Indo-European languages. Many large language families are not even represented. If you have some language data that could extend the coverage of Paralex datasets, converting it to the standard would be an invaluable contribution.

Accessing datasets

Most Paralex datasets are released on the Paralex Zenodo community. The Zenodo community allows to browse current datasets, retrieve the metadata and download the files.

Download the datasets on Zenodo

The paralex package also provides a simple command line interface to browse and download datasets:

paralex list # Returns all available datasets*
paralex get <ZENODO_ID> --output <PATH>  # Downloads the dataset with id ZENODO_ID to PATH

Optional arguments

list accepts the following arguments:

--iso <ISO> (followed by iso codes) filters the list of datasets and displays only matching datasets.
-o/--output <PATH> saves the dataset list as a CSV table to PATH.
-u/--update forces update of all the metadata. The command takes longer to complete.

get accepts the following arguments:

-o/--output <PATH> saves the dataset list as a CSV table to PATH.

Existing datasets

Coverage

The chart below shows the current language family coverage of Paralex.

Dataset contributors: unknown, Sacha Beniamine, Jules Bouton, Dunstan Brown, Mae Carroll, Alessandra T. Cignarella, Valts Ernštreits, Matías Guzmán Naranjo, Borja Herce, Marco Passarotti, Matteo Pellegrini, Bogdan Pricop, Andrea Sims, Tuuli Tuisk, Sasha Wilmoth

Missing languages

The table below lists the 5 largest language families (according to Glottolog) that are not represented at all in Paralex datasets.

Language family	Number of languages in Glottolog
Austronesian	4095
Afro-Asiatic	1458
Nuclear Trans New Guinea	832
Austroasiatic	526
Mande	321

The 5 largest languages with more than 50 millions of L1 speakers that are not covered in Paralex:

ISO 693	Language	L1 Speakers (M)	Family	Branch
cmn	Mandarin Chinese	941	Sino-Tibetan	Sinitic
eng	English	380	Indo-European	Germanic
hin	Hindi	345	Indo-European	Indo-Aryan
ben	Bengali	237	Indo-European	Indo-Aryan
por	Portuguese	236	Indo-European	Romance

These statistics are automatically extracted from the Paralex Zenodo community. The source for the number of speakers is Ethnolog (2024).