Consider ConLLU format for Amalgamations

Hello! I've written a couple of parsers for your Amalgamated format and recently found a competing format that I think may better suit this project. It's an established standard that already has parsers and large annotated corpuses in dozens of languages tagged over the past decade. It's structure is extremely similar to the one this repo uses.

Your custom TSV contains:

- Eng (Heb) Ref & Type
- Greek/Hebrew (split by morpheme)
- Transliteration
- Translation (English or English+Spanish)
- dStrongs
- Grammar (ETCBC)
- Meaning Variants
- Spelling Variants
- Root
- dStrong+Instance
- Alternative Strongs+Instance
- Conjoin word
- Expanded Strong tags

[ConLLU](https://universaldependencies.org/format.html#conll-u-format)'s TSV contains:
- ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
- FORM: Word form or punctuation symbol.
- LEMMA: Lemma or stem of word form.
- UPOS: [Universal part-of-speech tag](https://universaldependencies.org/u/pos/index.html).
- XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available.
- FEATS: List of morphological features from the [universal feature inventory](https://universaldependencies.org/u/feat/index.html) or from a defined [language-specific extension](https://universaldependencies.org/ext-feat-index.html); underscore if not available.
- HEAD: Head of the current word, which is either a value of ID or zero (0).
- DEPREL: [Universal dependency relation](https://universaldependencies.org/u/dep/index.html) to the HEAD ([root](https://universaldependencies.org/u/dep/root.html) iff HEAD = 0) or a defined language-specific subtype of one.
- DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
- MISC: Any other annotation.

[Here's a ConLLU example for Genesis](https://github.com/mr-martian/hbo-UD/blob/master/data/checked/genesis.conllu).

The conversion work would entail:
1. Move from verses as the primary index to sentences.
1. Slightly change your comment headers to match the standard `sent_id`, `text`, and `text_en`, and `text_es` attributes.
1. Replace the `/` morpheme delimiter with newlines.
1. Move meaning variants to new sentences.
1. Move rest of data (dStrong's, source, spelling variants, chapter/verse markers, translation) to MISC column.
1. Mapping from ETCBC's and James Tuaber's grammar to Universal Dependencies (UD) grammar, which someone is already doing for [ETCBC](https://github.com/mr-martian/hbo-UD) and the [NT](https://github.com/mr-martian/grc-UD). You could just omit "UPOS" and "HEAD"/"DEPREL" to start. And you need not replace your existing grammars, they can just go into the "XPOS" column.
1. Find something to do with the "conjoin word" field. I don't know Greek, so I'm not quite sure what it means.

I personally am a big fan of UD and may write a converter to do exactly this. I'll share here when I do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider ConLLU format for Amalgamations #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider ConLLU format for Amalgamations #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions