CAPR: Computer Assisted Proto-language Reconstruction

CAPR is a Dockerized stack (Flask API + Svelte UI + Caddy) for managing wordlists, cognate boards, and finite-state transducers (FSTs). The project currently focuses on the Burmish and Germanic pipelines; the Germanic dataset now tracks four doculects (English, Old English, Dutch, German).

Quick start (development)

From the repo root:
```
docker compose up -d
```
- Backend ⇨ http://127.0.0.1:5001
- Frontend ⇨ http://127.0.0.1:8080
In another terminal, proxy the stack through Caddy:
```
caddy run --config Caddyfile.dev
```
Open http://localhost:5002, choose burmish-aligned-final.tsv or germanic-aligned-final.tsv, and load the matching FST from server/fsts/.
Need the longer checklist (regressions, tear-down, hand-offs)? See docs/runbook.md.

Documentation map

docs/README.md – master index for all project docs.
SETUP.md – full installation guide (Docker + manual paths).
USAGE.md – UI walkthrough, including the FST editor workflow.
docs/runbook.md + docs/regression_checks.md – operational checklist and API smoke-test plan (server/tools/api_regression.py).
DEV_NOTES.md – dated hand-offs; add a new section per session.
docs/germanic_transducer_report.md – Germanic FST coverage/status summary (with supporting files under docs/germanic_*).

Old English data scaffolding

server/tools/add_old_english_rows.py duplicates every English row into an Old English placeholder so the TSV always contains 1:1 coverage.
server/tools/fetch_old_english_from_wiktionary.py hits the Wiktionary API to pull Old English lemmas from each English entry and writes server/data/old_english_wiktionary.tsv. Run it whenever you want a fresh scrape of the etymology data (results are cached under server/tmp/).
server/data/old_english_swadesh.tsv stores the Wiktionary Swadesh export used to seed real Old English forms.
server/tools/update_old_english_forms.py applies the Swadesh mappings to the gold-standard TSVs (updating IPA, TOKENS, COUNTERPART, NOTE). Run it whenever the stage3 export is regenerated.
server/tools/validate_old_english_pairs.py confirms both TSVs still have a matching Old English row for every English entry (and reports how many placeholders remain).

Project structure

.
├── cognate-app/        # Svelte interface (boards + FST editor)
├── docs/               # Project documentation & planning bundles
├── server/             # Flask API, FSTs, data, regression harness
├── docker-compose.yml  # Development stack (backend + frontend)
├── Caddyfile(.dev)     # Reverse proxy definitions
└── SETUP.md / USAGE.md # Detailed setup & usage notes

Current focus: Old English FST development

The project is actively developing the Proto-Germanic → Old English transducer pipeline.

Recent achievements

31.9% match rate (120/376 OE lexemes) with systematically bucketed mismatches
Empirical discovery: Heavy-syllable nasal apocope rule (PGmc *-ą deletion after heavy stems)
A-restoration fix: Corrected foma syntax bug causing unconditional fronting
Refined diagnostics: Split 256 mismatches into 20+ specific phenomenon buckets

Status (as of 2026-02-07)

Latest reports in server/docs/debug_snapshots/:
- oe_mismatch_report_2026-02-07_refined_v3.txt (bucketed mismatches)
- oe_full_trace_report_2026-02-07_refined_buckets.txt (stage-by-stage traces)
Top mismatch buckets: final_vowel_missing (38), vowel_quality_other (27), breaking_extra_other (22)
Diagnostic tools: server/tools/oe_mismatch_report.py, server/tools/oe_full_trace_report.py

Development workflow

Run mismatch/trace reports to identify issues
Investigate phonological phenomena in reference sources (Hogg, Ringe/Taylor)
Implement/fix FST rules in server/fsts/germanic.txt
Regenerate reports to verify improvements
Document findings in DEV_NOTES.md

Operations

Keep Docker + Caddy steps documented in docs/runbook.md
Record each session in DEV_NOTES.md with regression results

Citations

Xun Gong & Nathan Hill (2020). Materials for an Etymological Dictionary of Burmish. Zenodo. https://doi.org/10.5281/zenodo.4311182
List, J.-M. & R. Forkel (2022). LingRex. Zenodo.
List, J.-M. & R. Forkel (2021). LingPy. https://lingpy.org
Hulden, M. (2009). “Foma: a finite-state compiler and library.” EACL.

Name		Name	Last commit message	Last commit date
Latest commit History 384 Commits
.copilot/session-state/79de229e-e817-4269-9364-009f42864358		.copilot/session-state/79de229e-e817-4269-9364-009f42864358
.vscode		.vscode
cognate-app		cognate-app
docs		docs
server		server
tests		tests
tmp		tmp
.DS_Store		.DS_Store
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Caddyfile		Caddyfile
Caddyfile.dev		Caddyfile.dev
DEV_NOTES.md		DEV_NOTES.md
README.md		README.md
SETUP.md		SETUP.md
USAGE.md		USAGE.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
dutch.bin		dutch.bin
english.bin		english.bin
english_after_consonant_rules.bin		english_after_consonant_rules.bin
english_after_gh_deletion.bin		english_after_gh_deletion.bin
english_after_gh_marker.bin		english_after_gh_marker.bin
english_after_glide_deletion.bin		english_after_glide_deletion.bin
english_after_initial_kn.bin		english_after_initial_kn.bin
english_after_lme_short_vowel_split.bin		english_after_lme_short_vowel_split.bin
english_after_long_vowel_realisation.bin		english_after_long_vowel_realisation.bin
english_after_orthography.bin		english_after_orthography.bin
english_after_palatalisation.bin		english_after_palatalisation.bin
english_after_postvocalic_r_loss.bin		english_after_postvocalic_r_loss.bin
english_after_pre_me_short_back_lowering.bin		english_after_pre_me_short_back_lowering.bin
english_after_proto_input.bin		english_after_proto_input.bin
english_after_proto_rhotic_fronting.bin		english_after_proto_rhotic_fronting.bin
english_after_proto_to_oe.bin		english_after_proto_to_oe.bin
english_after_proto_to_oe_apocope.bin		english_after_proto_to_oe_apocope.bin
english_after_proto_to_oe_weak_tail.bin		english_after_proto_to_oe_weak_tail.bin
english_after_proto_to_oe_weight_cleanup.bin		english_after_proto_to_oe_weight_cleanup.bin
english_after_proto_to_oe_weight_markers.bin		english_after_proto_to_oe_weight_markers.bin
english_after_rhotic_breaking.bin		english_after_rhotic_breaking.bin
english_after_rhotic_coloring.bin		english_after_rhotic_coloring.bin
english_after_short_a_fronting.bin		english_after_short_a_fronting.bin
english_after_silent_cleanup.bin		english_after_silent_cleanup.bin
english_after_surface.bin		english_after_surface.bin
english_after_vowel_rules.bin		english_after_vowel_rules.bin
english_after_weak_tail.bin		english_after_weak_tail.bin
english_after_weak_tail_cleanup.bin		english_after_weak_tail_cleanup.bin
german.bin		german.bin
old_english.bin		old_english.bin
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAPR: Computer Assisted Proto-language Reconstruction

Quick start (development)

Documentation map

Old English data scaffolding

Project structure

Current focus: Old English FST development

Recent achievements

Status (as of 2026-02-07)

Development workflow

Operations

Citations

About

Uh oh!

Releases

Packages

Languages

nh36/caprWIP

Folders and files

Latest commit

History

Repository files navigation

CAPR: Computer Assisted Proto-language Reconstruction

Quick start (development)

Documentation map

Old English data scaffolding

Project structure

Current focus: Old English FST development

Recent achievements

Status (as of 2026-02-07)

Development workflow

Operations

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages