Skip to content

Datasets

Typological studies require datasets that represent the diversity of the languages of the world. However, Paralex datasets, as is often the case with corpora, tend to be biased towards Indo-European languages. Many large language families are not even represented. If you have some language data that could extend the coverage of Paralex datasets, converting it to the standard would be an invaluable contribution.

Accessing datasets

Most Paralex datasets are released on the Paralex Zenodo community. The Zenodo community allows to browse current datasets, retrieve the metadata and download the files.

Download the datasets on Zenodo

The paralex package also provides a simple command line interface to browse and download datasets:

paralex list # Returns all available datasets*
paralex get <ZENODO_ID> --output <PATH>  # Downloads the dataset with id ZENODO_ID to PATH
Optional arguments

list accepts the following arguments:

  • --iso <ISO> followed by iso codes filter the list of datasets and displays only matching datasets.
  • -o/--output <PATH> saves the dataset list as a CSV table to PATH.

get accepts the following arguments:

  • -o/--output <PATH> saves the dataset list as a CSV table to PATH.

Existing datasets

Coverage

The chart below shows the current language family coverage of Paralex.

Dataset contributors: Mari Aigro, Matthew Baerman, Sacha Beniamine, Olivier Bonami, Jules Bouton, Alessandra T. Cignarella, Maximin Coavoux, Maria Copot, Valts Ernštreits, Matías Guzmán Naranjo, Sims-Williams, Helen, Borja Herce, Ana R. Luís, Francesco Mambrini, Zemp, Marius, Giovanni Moretti, Marco Passarotti, Matteo Pellegrini, Fernando Perdigão, Bogdan Pricop, Tuuli Tuisk

Missing languages

The table below lists the 5 largest language families (according to Glottolog) that are not represented at all in Paralex datasets.

Language family Number of languages in Glottolog
Austronesian 4095
Nuclear Trans New Guinea 832
Pama-Nyungan 642
Austroasiatic 526
Bookkeeping 383

The 5 largest languages with more than 50 millions of L1 speakers that are not covered in Paralex:

ISO 693 Language L1 Speakers (M) Family Branch
cmn Mandarin Chinese 941 Sino-Tibetan Sinitic
eng English 380 Indo-European Germanic
hin Hindi 345 Indo-European Indo-Aryan
ben Bengali 237 Indo-European Indo-Aryan
rus Russian 148 Indo-European Balto-Slavic

These statistics are automatically extracted from the Paralex Zenodo community. The source for the number of speakers is Ethnolog (2024).