Paradigm structures are analyses, and dataset authors have freedom in how they want to formulate this analysis. Among the main problems are:
- What is the inventory of paradigm cells ?
- How should each cell be characterised?
- What counts as a lexeme ?
What is the inventory of paradigm cells ?
Data creators can provide labels of their choice, but should use a cells and features table to document the meaning of these labels, and map from these labels to existing standards and conventions.
How should each cell be characterised ?
For long term usability, it is important to account for paradigm structure choices in the documentation. A particularly tricky case is that of overdifferentiation. For example, in English, one might want to expand the person/number combinations of verbs to match pronouns and define the paradigm of verbs such as:
Present | Preterite | |
---|---|---|
first person singular | I eat | I ate |
second person singular | you eat | you ate |
third person singular | he/she/it eats | he/she/it ate |
first person plural | we eat | we ate |
second person plural | you eat | you ate |
third person plural | they eat | they ate |
Imperative | Present participle | Past participle | Infinitive |
---|---|---|---|
eat | eating | eaten | to eat |
However, for most verbs, it would be sufficient to stipulate:
cell | form |
---|---|
present 3 singular | eats |
present others | eat |
preterite | ate |
past participle | eaten |
present participle | eating |
This choice unfortunately has the consequence of requiring extra cells only for the verb to be:
cell | form |
---|---|
present 1 singular | am |
present 3 singular | is |
present others | are |
preterite 1/3 singular | was |
preterite others | were |
past participle | been |
present participle | being |
We suggest preferring structures which allow for uniform paradigm shapes and documenting these choices clearly. It is easier for users to go from such annotations to a more minimal paradigm structure, than to do the opposite. For propositions about "morphomic" paradigm structures, see Boyé & Schalchli (2016).
What should count as a lexeme
The creators of a dataset are free to produce the analysis which they believe best fit their data.
In some cases, a lexeme is entirely overabundant because it can take either of several inflection classes or stems. In other terms, a same lexeme could be split in several flexemes (see Fradin & Kerleroux 2003, Thornton 2018).
In this case, there are two main solutions:
- Either split these lexemes so that each lexeme identifier corresponds to a single flexeme
- Or account for the two levels by maintaining a single lexeme; and adding a flexeme table and flexeme identifiers.
references
- Fradin, Bernard & Françoise Kerleroux. 2003. Troubles with lexemes. In Geert Booij, Janet DeCesaris, Angela Ralli & Sergio Scalise (eds.), Selected papers from the third Mediterranean Morphology Meeting, 177–196. Barcelona: IULA – Universitat Pompeu Fabra.
- Boyé, G., & Schalchli, G. (2016). The Status of Paradigms. In A. Hippisley & G. Stump (Eds.), The Cambridge Handbook of Morphology (Cambridge Handbooks in Language and Linguistics, pp. 206-234). Cambridge: Cambridge University Press. DOI: 10.1017/9781139814720.009
- Anna M. Thornton (2018). Troubles with flexemes. In Olivier Bonami, Gilles Boyé, Georgette Dal, Hélène Giraudo & Fiammetta Namer (eds.), The lexeme in descriptive and theoretical morphology, 303–321. Berlin: Language Science Press. DOI: 10.5281/zenodo.1407011