Implementing lusSTR workflow for Amelogenin locus by rnmitchell · Pull Request #85 · bioforensics/lusSTR

rnmitchell · 2025-05-06T10:23:12Z

Previously, we've excluded the amelogenin locus from processing, as there is no need to produce an alternate sequence format (no use for bracketed sequence format, etc.) and it is not used in the prob gen software. However, now that visualizations is now implemented in lusSTR, it is important to be able to evaluate the Amelogenin results for sex determination and overall profile quality. This PR will include the Amelogenin locus through lusSTR but remove it before producing prob gen files.

…quences [skip ci]

rnmitchell · 2025-05-20T16:36:39Z

lusSTR/wrappers/filter.py

+                if not filt_df[filt_df["SampleID"] == sample_id].empty:
+                    make_plot(filt_df, sample_id, filters=True, at=False)
+                    pdf.savefig()


accounting for samples with no sequences that pass filters

rnmitchell · 2025-06-02T10:06:34Z

@standage this is ready for review.

rnmitchell · 2025-06-02T10:09:30Z

lusSTR/wrappers/filter.py

+        ax = fig.add_subplot(6, 5, n)
+        if not marker_df.empty:
+            if marker == "AMELOGENIN":
+                for i, row in marker_df.iterrows():
+                    marker_df.loc[i, "CE_Allele"] = (
+                        0 if marker_df.loc[i, "CE_Allele"] == "X" else 1
+                    )


Creates empty subplot if no alleles/sequences exist for marker

standage

Some questions and suggestions:

standage · 2025-06-02T13:44:24Z

lusSTR/cli/gui.py

+        for i, row in sample_df.iterrows():
+            if sample_df.loc[i, "Locus"] == "AMELOGENIN":
+                sample_df.loc[i, "CE_Allele"] = 0 if sample_df.loc[i, "CE_Allele"] == "X" else 1


A little numpy action could clean this code up a bit.

sample_df = np.where( sample_df["Locus"] == "AMELOGENIN", np.where(sample_df["CE_Allele"] == "X", 0, 1), sample_df["CE_Allele"], )

If you're not a fan of the functional approach and want to keep the row iteration, you may consider using the row object that you're currently ignoring.

for i, row in sample_df.iterrows(): if row["Locus"] == "AMELOGENIN": sample_df.loc[i, "CE_Allele"] = 0 of row.CE_Allele == "X" else 1

I can understand using .loc for all access and assignment operations, since it keeps the syntax consistent. But calling .iterrows without using the row objects is confusing on first read.

standage · 2025-06-02T13:45:17Z

lusSTR/cli/gui.py

+        plot_df = sample_df
+        for i, row in plot_df.iterrows():
+            if plot_df.loc[i, "Locus"] == "AMELOGENIN":
+                plot_df.loc[i, "CE_Allele"] = 0 if plot_df.loc[i, "CE_Allele"] == "X" else 1
+        plot_df["CE_Allele"] = pd.to_numeric(plot_df["CE_Allele"])


Same comment here.

standage · 2025-06-02T13:47:12Z

lusSTR/cli/gui.py

    increase_value = int(math.ceil((max_yvalue / 5) / n)) * n
    n = 0
-    for marker in sample_df["Locus"].unique():
+    all_loci = f_strs if st.session_state.kit == "forenseq" else p_strs


p_strs = powerseq and f_strs = forenseq? I could only figure that out by finding where they're called in the code. You could consider renaming the variables to be more self explanatory.

standage · 2025-06-02T13:49:51Z

lusSTR/scripts/filter_settings.py

-    if len(locus_allele_info) == 1:
-        locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info)
+    if locus == "AMELOGENIN":
+        locus_allele_info = filter_amel(metadata, locus_allele_info, locus_reads)


There's nothing necessarily wrong with how you've updated this code, but we could keep the nesting complexity to a minimum with something like this, right?

if locus == "AMELOGENIN": # ... elif len(locus_allele_info) == 1: # ... else: # ...

I was just trying to avoid have to repeat code (line 34 would have to be in both blocks).

standage · 2025-06-02T13:52:02Z

lusSTR/wrappers/convert.py

+        if (
+            len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
+            and software != "uas"
+            and locus != "AMELOGENIN"
+        ) or (
+            software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p)
+        ):


This conditional is dizzying!

I know, it's complicated. The Amelogenin true X chr sequences were getting thrown out because it's 0 bases after removal... but that first statement is still necessary for all the other markers.

I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.

Returning to this after myriad distractions.

Here's a step closer to something more manageable. The variable names are still too vague, but I think this breaks the complex conditional down into a structure that can actually be understood by mere mortals.

short1 = len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"])) short2 = len(sequence) < (remove_5p + remove_3p) cond1 = short1 and software != "uas" and locus != "AMELOGENIN" cond2 = short2 and software != "uas" and locus == "AMELOGENIN" if cond1 or cond2: flank_summary = [...]

How would we improve the variable names short1, short2, cond1, and cond2 to communicate their purpose?

How about:

flanks_removed_len = remove_5p + remove_3p amel_remove = len(sequence) < flanks_removed_len otherloci_remove = len(sequence) <= (flanks_removed_len + len(metadata["LUS"])) if ( software != "uas" and locus != "AMELOGENIN" and otherloci_remove ) or ( software != "uas" and locus == "AMELOGENIN" and amel_remove ):

Ok, results from our Teams call just now.

locus_min_length = remove_5p + remove_3p + len(metadata["LUS"]) if locus == "AMELOGENIN": locus_min_length += 1 if software != "uas" and len(sequence) < locus_min_length: flank_summary = [...]

We may need to fiddle with += 1 vs -= 1 and < vs <=, but once those are in agreement this should work!

It ended up being -=1 but is updated and working!

standage · 2025-06-02T13:53:23Z

lusSTR/wrappers/filter.py

+p_strs = [
+    "AMELOGENIN",


Is there any way we could define this list in a single place and reference it as necessary?

Roger that. I created json file storing this information.

standage · 2025-06-04T13:31:32Z

lusSTR/cli/gui.py

-        plot = interactive_plots(marker_df, marker, max_yvalue, increase_value, all=True)
+        if marker in missing_loci:
+            marker = f"⚠️{marker}⚠️"
+            plot = go.Figure()


I feel like this plotly import pattern is an elaborate ruse so people can put "go figure" in their code 😆

standage · 2025-06-04T13:34:02Z

lusSTR/wrappers/filter.py

+    strs = str_lists["powerseq_strs"] if kit == "powerseq" else str_lists["forenseq_strs"]
+    ystrs = str_lists["powerseq_ystrs"] if kit == "powerseq" else str_lists["forenseq_ystrs"]


Great! This avoids a scenario where we update a list in one place but forget to in another place. Not that this ever happens...

standage · 2025-06-04T13:37:23Z

lusSTR/wrappers/convert.py

+        if (
+            len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
+            and software != "uas"
+            and locus != "AMELOGENIN"
+        ) or (
+            software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p)
+        ):


I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.

rnmitchell added 10 commits May 6, 2025 06:20

processing amelogenin from UAS sample details report [skip ci]

7752853

steps through convert can use amelogenin [skip ci]

fb465c4

fixed convert step for crappy sequences in amel and filtering amel se…

7a7ad3b

…quences [skip ci]

fixed typo in amel filtering function [skip ci]

086c7fe

amelogenin now plotting correctly in pdf [skip ci]

4634d42

fixed bug in combining reads when using custom sequence ranges [skip ci]

992e608

fixed bug with custom sequence ranges in amel [skip ci]

dc557e0

began implementing amel into GUI marker plots [skip ci]

4ce279c

fixed custom range for amel [skip ci]

65b878a

handling samples with no sequences passing filters [skip ci]

01472ee

rnmitchell commented May 20, 2025

View reviewed changes

rnmitchell added 8 commits May 20, 2025 13:43

fixed plotting amel in gui [skip ci]

ebc7fc3

added blank plots for missing loci [skip ci]

c2929b1

made str lists specific for each kit [skip ci]

694c980

added empty plots to GUI for missing markers [skip ci]

f089083

removed extra marker in powerseq list [skip ci]

cd44d52

removed extra marker in powerseq list [skip ci]

1fb7846

began updating tests [skip ci]

eec3ac1

updated remaining tests

740c5ea

rnmitchell marked this pull request as ready for review June 2, 2025 10:06

rnmitchell requested a review from standage June 2, 2025 10:06

rnmitchell commented Jun 2, 2025

View reviewed changes

standage requested changes Jun 2, 2025

View reviewed changes

rnmitchell added 2 commits June 4, 2025 06:10

fixed formatting issues; added str lists as json file

1f03738

fixed bug

ac814c7

standage reviewed Jun 4, 2025

View reviewed changes

simplified convert code

ae3c813

standage approved these changes Jun 5, 2025

View reviewed changes

standage merged commit 2425112 into master Jun 5, 2025
2 checks passed

standage deleted the amelogenin branch June 5, 2025 16:18

		strs = str_lists["powerseq_strs"] if kit == "powerseq" else str_lists["forenseq_strs"]
		ystrs = str_lists["powerseq_ystrs"] if kit == "powerseq" else str_lists["forenseq_ystrs"]

Conversation

rnmitchell commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnmitchell commented Jun 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

standage left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants