Skip to content

Principles

Our standard aims to meet the FAIR and CARE principles, and adds a few of our own, the DeAR principles. Paralex was inspired by the Cross-Linguistic Data Formats (CLDF) standard, and adheres to a similar philosophy and the same the design principles.

FAIR

The FAIR principles are meant to ensure that datasets are both readable by machines and by humans across sub-fields, disciplines and time. Here is a very short summary of the FAIR principles and how this standard aims to meet them:

Findable

Data must have a persistent global identifier (F1), be described by rich metadata (F2) which include the identifier (F3), and be indexed in searchable resources (F4).

  • F1: Use Digital Object Identifiers, generated either by your institution or a repository. We suggest using zenodo for this and archiving.
  • F2/F3: Add json metadata, following the frictionless standard. The metadata refer to the DOIs.
  • F4: Archive the dataset on zenodo, on your institutional repository and/or on any repository of your choice.

Accessible

The data and metadata can be reached using the identifier, and metadata persist even if the data is not accessible anymore. This is mostly achieved by using DOIs, and ensuring long term archiving.

Interoperable

Use a formal, accessible, shared, broadly applicable language for knowledge representation (I1), use FAIR vocabularies (I2) and refer to other (meta)data (I3).

Reusable

Data should be well described (R1) so that they can be re-used and combined in other contexts. This standard's main aim is to ensure that the data is richly and consistently described.

Because the FAIR principles make sure the data is widely shared and reused, and usable computationally, they focus on data users. However, two more group of people are relevant when producing language datasets.

CARE

The CARE Principles for Indigenous Data Governance focus on the interests of the language communities whose languages are described by our datasets. They are meant to be compatible with FAIR principles. These are not principles that can be fullfilled simply by adhering to a formal standard, but rather require careful planning and engagement with language communities. In short, they state:

Collective Benefit:

"Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data."

In the case of language data, native speakers should ideally be involved in the creation and authorship of resources, and the data should be made available in ways that are useful for language communities (such as the creation of pedagogical supports, dictionaries or grammar books).

Authority to control

Indigenous people must have control over how data is shared and how their culture is represented and identified. In particular, we should use endonyms and only distribute data openly with the consent of language communities.

Responsibility

Be accountable to how the data is used in favor of Indigenous people.

Ethics

"Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem."

  • Ensure your data does not stigmatize Indigenous People and cultures, explicitly assess harms and benefits.
  • Describe limitations, provenances, and purposes of the data
  • Ensure long term preservation

The principles invite us to question how language communities can benefit from our work, and to consider that even as authors of datasets, it is not our data.

DeAR

Beyond users and speakers, language data also needs to be planned in ways that are good for the dataset authors. Thus, we introduce our DeAR principles:

Decentralized

Data is decentralised with no single team or institution operating a central database. The standard serves as a format to share data and as a means for researchers to create interoperable data of high-quality. We wish to make the standard as easy to use as possible, and to useful tools to its users.

Automated verification

Data is tested automatically against the descriptions in the metadata in order to guarantee data quality. Moreover, data quality can be checked by writing custom tests (as is done in software development), which are run after each change of the data.

Revisable pipelines

Dataset authors must be able to continuously update data presentation, in particular websites, reflecting the evolving nature of data. This is achieved by generating those publications automatically and directly from the standardized dataset. We will create automated tools which can generate user-friendly views of the data (for example static websites, publication ready pdfs, etc.). These can be run again at any point, so that it is easy to re-generate those from the data edited by the researchers.

Both principes A and R fit particularly well with the use of versioning systems such as git, where validation, testing and publishing can be done through continuous development pipelines.