Datasets

Typological studies require datasets that represent the diversity of the languages of the world. However, Paralex datasets, as is often the case with corpora, tend to be biased towards Indo-European languages. Many large language families are not even represented. If you have some language data that could extend the coverage of Paralex datasets, converting it to the standard would be an invaluable contribution.

Accessing datasets

Most Paralex datasets are released on the Paralex Zenodo community. The Zenodo community allows to browse current datasets, retrieve the metadata and download the files.

Download the datasets on Zenodo

The paralex package also provides a simple command line interface to browse and download datasets:

paralex list # Returns all available datasets*
paralex get <ZENODO_ID> --output <PATH>  # Downloads the dataset with id ZENODO_ID to PATH

Optional arguments

list accepts the following arguments:

--iso <ISO> (followed by iso codes) filters the list of datasets and displays only matching datasets.
-o/--output <PATH> saves the dataset list as a CSV table to PATH.
-u/--update forces update of all the metadata. The command takes longer to complete.

get accepts the following arguments:

-o/--output <PATH> saves the dataset list as a CSV table to PATH.

Existing datasets

Coverage

The chart below shows the current language family coverage of Paralex.

Dataset contributors: unknown, Sacha Beniamine, Olivier Bonami, Jules Bouton, Dunstan Brown, Mae Carroll, Alessandra T. Cignarella, Valts Ernštreits, Matías Guzmán Naranjo, Sims-Williams, Helen, Borja Herce, Ana R. Luís, Zemp, Marius, Marco Passarotti, Matteo Pellegrini, Fernando Perdigão, Bogdan Pricop, Andrea Sims, Tuuli Tuisk, Sasha Wilmoth

Missing languages

The table below lists the 5 largest language families (according to Glottolog) that are not represented at all in Paralex datasets.

Language family	Number of languages in Glottolog
Austronesian	4095
Nuclear Trans New Guinea	832
Austroasiatic	526
Mande	321
Dravidian	281

The 5 largest languages with more than 50 millions of L1 speakers that are not covered in Paralex:

ISO 693	Language	L1 Speakers (M)	Family	Branch
cmn	Mandarin Chinese	941	Sino-Tibetan	Sinitic
eng	English	380	Indo-European	Germanic
hin	Hindi	345	Indo-European	Indo-Aryan
ben	Bengali	237	Indo-European	Indo-Aryan
rus	Russian	148	Indo-European	Balto-Slavic

These statistics are automatically extracted from the Paralex Zenodo community. The source for the number of speakers is Ethnolog (2024).