Our standard aims to meet the FAIR and CARE principles, and adds a few of our own, the DeAR principles. Paralex was inspired by the Cross-Linguistic Data Formats (CLDF) standard, and adheres to a similar philosophy and the same the design principles.
The FAIR principles are meant to ensure that datasets are both readable by machines and by humans across sub-fields, disciplines and time. Here is a very short summary of the FAIR principles and how this standard aims to meet them:
Data must have a persistent global identifier (F1), be described by rich metadata (F2) which include the identifier (F3), and be indexed in searchable resources (F4).
- F1: Use Digital Object Identifiers, generated either by your institution or a repository. We suggest using zenodo for this and archiving.
- F2/F3: Add json metadata, following the frictionless standard. The metadata refer to the DOIs.
- F4: Archive the dataset on zenodo, on your institutional repository and/or on any repository of your choice.
The data and metadata can be reached using the identifier, and metadata persist even if the data is not accessible anymore. This is mostly achieved by using DOIs, and ensuring long term archiving.
Use a formal, accessible, shared, broadly applicable language for knowledge representation (I1), use FAIR vocabularies (I2) and refer to other (meta)data (I3).
- I1: We write the metadata in json, the tables in csv, and respect the frictionless standard
- I2: The standard documents our conventions and columns, providing a FAIR vocabulary.
- I2/I3: The standard comes with built-in linking to other resources and encourages references to other resources and linking to other vocabularies such as gold ontology, unimorph schema, universal dependency tagset, CLTS' BIPA, glottocodes, ISO codes for languages, etc.
Data should be well described (R1) so that they can be re-used and combined in other contexts. This standard's main aim is to ensure that the data is richly and consistently described.
Because the FAIR principles make sure the data is widely shared and reused, and usable computationally, they focus on data users. However, two more group of people are relevant when producing language datasets.
The CARE Principles for Indigenous Data Governance focus on the interests of the language communities whose languages are described by our datasets. They are meant to be compatible with FAIR principles. These are not principles that can be fullfilled simply by adhering to a formal standard, but rather require careful planning and engagement with language communities. In short, they state:
"Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data."
In the case of language data, native speakers should ideally be involved in the creation and authorship of resources, and the data should be made available in ways that are useful for language communities (such as the creation of pedagogical supports, dictionaries or grammar books).
Authority to control
Indigenous people must have control over how data is shared and how their culture is represented and identified. In particular, we should use endonyms and only distribute data openly with the consent of language communities.
Be accountable to how the data is used in favor of Indigenous people.
"Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem."
- Ensure your data does not stigmatize Indigenous People and cultures, explicitly assess harms and benefits.
- Describe limitations, provenances, and purposes of the data
- Ensure long term preservation
The principles invite us to question how language communities can benefit from our work, and to consider that even as authors of datasets, it is not our data.
Beyond users and speakers, language data also needs to be planned in ways that are good for the dataset authors. Thus, we introduce our DeAR principles:
Data is decentralised with no single team or institution operating a central database. The standard serves as a format to share data and as a means for researchers to create interoperable data of high-quality. We wish to make the standard as easy to use as possible, and to useful tools to its users.
Data is tested automatically against the descriptions in the metadata in order to guarantee data quality. Moreover, data quality can be checked by writing custom tests (as is done in software development), which are run after each change of the data.
Dataset authors must be able to continuously update data presentation, in particular websites, reflecting the evolving nature of data. This is achieved by generating those publications automatically and directly from the standardized dataset. We will create automated tools which can generate user-friendly views of the data (for example static websites, publication ready pdfs, etc.). These can be run again at any point, so that it is easy to re-generate those from the data edited by the researchers.
Both principes A and R fit particularly well with the use of versioning systems such as git, where validation, testing and publishing can be done through continuous development pipelines.