Update to docs, allele formatting by standage · Pull Request #159 · bioforensics/MicroHapDB

standage · 2025-04-24T17:15:03Z

In this PR I'm updating the binder demo notebook. In the process, I changed the allele formatting from A|T|T|A to A:T:T:A to avoid confusion with conventional genetic notation for haplotype phases. (I would love to have dropped the separators altogether, but some legacy functions of the database still need to handle microhaps with indels correctly.)

I also found a bug with how non-1KGP allele frequencies were being renamed post-resolution of locus and allele definition identifiers. ~~It only affected four allele definitions at two loci, and was resolved with a simple change to the build procedure.~~ None of the standard 1KGP allele frequencies or Ae scores were affected.

mh05KK-023 --> mh05KK-023.v1
mh05KK-020 --> mh05KK-023.v2
mh05KK-120 --> mh05KK-120.v1
mh05KK-121 --> mh05KK-120.v2

Update: Actually, after running the new regression test on the master branch, I found three more affected loci—see comment below. As before, the 1KGP allele frequencies remain unaffected.

standage · 2025-04-24T17:16:33Z

dbbuild/lib/marker.py


    def __init__(self, name, rsids, index, xrefs=None, source=None):
        self.name = Marker.check_name(name)
+        self.source_name = str(self.name)


Bug fix part 1

standage · 2025-04-24T17:16:43Z

dbbuild/lib/locus.py

-                self.source_name_map[marker.source.name][marker.name] = self.definition_names[marker.posstr()]
+                self.source_name_map[marker.source.name][marker.source_name] = self.definition_names[marker.posstr()]
                continue
            else:
                new_name = marker.name
                if len(self.markers_by_definition) > 1:
                    new_name = f"{marker.name}.v{len(self.definition_names) + 1}"
                self.definition_names[marker.posstr()] = new_name
-                self.source_name_map[marker.source.name][marker.name] = new_name
+                self.source_name_map[marker.source.name][marker.source_name] = new_name


Bug fix part 2

standage · 2025-04-24T17:18:25Z

microhapdb/tests/test_cli.py

  - 2413 distinct loci
 [frequencies]
-  - 59753 haplotypes
+  - 59704 haplotypes


Correcting for frequency records using deprecated marker identifiers

standage · 2025-04-24T18:21:52Z

microhapdb/tests/test_frequency.py

+def test_marker_names_valid():
+    freq_markers = set(microhapdb.frequencies.Marker)
+    markers = set(microhapdb.markers.Name)
+    invalid = freq_markers - markers
+    print(invalid)
+    assert len(invalid) == 0


Added this regression test

standage · 2025-04-24T18:42:03Z

Additional issues discovered after running the regression test on the master branch.

Three different allele definitions under the identifier mh01NK-001 (Staadig2021, Kidd2018|Turchi2019|Gandotra2020, Pakstis2021) were successfully merged into mh01NH-04 (Hiroaki2015). But weirdly all of the frequencies from these studies were renamed to mh01NH-01.v? instead of mh01NH-04.v?. This is resolved in this branch.
An allele definition under the identifier mh09KK-010 (Gandotra2020|Pakstis) was successfully merged into mh09USC-9pA, but the frequencies from Gandotra2020 were not renamed correctly. This branch corrects the issue.
Two allele definitions under the identifier mh22KK-340 (Gandotra2020|Nimagen2023, Pakstis2021) were successfully merged into mh22USC-22qB, but the frequencies from Gandotra2020 were not renamed correctly. This branch fixes the issue.

Update docs, squash bugs

a589e27

standage commented Apr 24, 2025

View reviewed changes

Add regression test

759b102

standage commented Apr 24, 2025

View reviewed changes

standage added 2 commits April 24, 2025 15:00

Strengthen regression tests

cfc6179

Fix regression test, update change log

66316d7

standage merged commit ffcdaf3 into master Apr 24, 2025
4 checks passed

standage deleted the docs branch April 24, 2025 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to docs, allele formatting#159

Update to docs, allele formatting#159
standage merged 4 commits intomasterfrom
docs

standage commented Apr 24, 2025 •

edited

Loading

Uh oh!

standage Apr 24, 2025

Uh oh!

standage Apr 24, 2025

Uh oh!

standage Apr 24, 2025

Uh oh!

standage Apr 24, 2025

Uh oh!

standage commented Apr 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

standage commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

standage Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

standage Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

standage Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

standage Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

standage commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

standage commented Apr 24, 2025 •

edited

Loading

standage commented Apr 24, 2025 •

edited

Loading