Skip to content

Consider ConLLU format for Amalgamations #60

@thesmartwon

Description

@thesmartwon

Hello! I've written a couple of parsers for your Amalgamated format and recently found a competing format that I think may better suit this project. It's an established standard that already has parsers and large annotated corpuses in dozens of languages tagged over the past decade. It's structure is extremely similar to the one this repo uses.

Your custom TSV contains:

  • Eng (Heb) Ref & Type
  • Greek/Hebrew (split by morpheme)
  • Transliteration
  • Translation (English or English+Spanish)
  • dStrongs
  • Grammar (ETCBC)
  • Meaning Variants
  • Spelling Variants
  • Root
  • dStrong+Instance
  • Alternative Strongs+Instance
  • Conjoin word
  • Expanded Strong tags

ConLLU's TSV contains:

  • ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
  • FORM: Word form or punctuation symbol.
  • LEMMA: Lemma or stem of word form.
  • UPOS: Universal part-of-speech tag.
  • XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available.
  • FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  • HEAD: Head of the current word, which is either a value of ID or zero (0).
  • DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  • DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  • MISC: Any other annotation.

Here's a ConLLU example for Genesis.

The conversion work would entail:

  1. Move from verses as the primary index to sentences.
  2. Slightly change your comment headers to match the standard sent_id, text, and text_en, and text_es attributes.
  3. Replace the / morpheme delimiter with newlines.
  4. Move meaning variants to new sentences.
  5. Move rest of data (dStrong's, source, spelling variants, chapter/verse markers, translation) to MISC column.
  6. Mapping from ETCBC's and James Tuaber's grammar to Universal Dependencies (UD) grammar, which someone is already doing for ETCBC and the NT. You could just omit "UPOS" and "HEAD"/"DEPREL" to start. And you need not replace your existing grammars, they can just go into the "XPOS" column.
  7. Find something to do with the "conjoin word" field. I don't know Greek, so I'm not quite sure what it means.

I personally am a big fan of UD and may write a converter to do exactly this. I'll share here when I do.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions