Implementing lusSTR workflow for Amelogenin locus#85
Conversation
…quences [skip ci]
| if not filt_df[filt_df["SampleID"] == sample_id].empty: | ||
| make_plot(filt_df, sample_id, filters=True, at=False) | ||
| pdf.savefig() |
There was a problem hiding this comment.
accounting for samples with no sequences that pass filters
|
@standage this is ready for review. |
| ax = fig.add_subplot(6, 5, n) | ||
| if not marker_df.empty: | ||
| if marker == "AMELOGENIN": | ||
| for i, row in marker_df.iterrows(): | ||
| marker_df.loc[i, "CE_Allele"] = ( | ||
| 0 if marker_df.loc[i, "CE_Allele"] == "X" else 1 | ||
| ) |
There was a problem hiding this comment.
Creates empty subplot if no alleles/sequences exist for marker
standage
left a comment
There was a problem hiding this comment.
Some questions and suggestions:
lusSTR/cli/gui.py
Outdated
| for i, row in sample_df.iterrows(): | ||
| if sample_df.loc[i, "Locus"] == "AMELOGENIN": | ||
| sample_df.loc[i, "CE_Allele"] = 0 if sample_df.loc[i, "CE_Allele"] == "X" else 1 |
There was a problem hiding this comment.
A little numpy action could clean this code up a bit.
sample_df = np.where(
sample_df["Locus"] == "AMELOGENIN",
np.where(sample_df["CE_Allele"] == "X", 0, 1),
sample_df["CE_Allele"],
)If you're not a fan of the functional approach and want to keep the row iteration, you may consider using the row object that you're currently ignoring.
for i, row in sample_df.iterrows():
if row["Locus"] == "AMELOGENIN":
sample_df.loc[i, "CE_Allele"] = 0 of row.CE_Allele == "X" else 1I can understand using .loc for all access and assignment operations, since it keeps the syntax consistent. But calling .iterrows without using the row objects is confusing on first read.
| plot_df = sample_df | ||
| for i, row in plot_df.iterrows(): | ||
| if plot_df.loc[i, "Locus"] == "AMELOGENIN": | ||
| plot_df.loc[i, "CE_Allele"] = 0 if plot_df.loc[i, "CE_Allele"] == "X" else 1 | ||
| plot_df["CE_Allele"] = pd.to_numeric(plot_df["CE_Allele"]) |
lusSTR/cli/gui.py
Outdated
| increase_value = int(math.ceil((max_yvalue / 5) / n)) * n | ||
| n = 0 | ||
| for marker in sample_df["Locus"].unique(): | ||
| all_loci = f_strs if st.session_state.kit == "forenseq" else p_strs |
There was a problem hiding this comment.
p_strs = powerseq and f_strs = forenseq? I could only figure that out by finding where they're called in the code. You could consider renaming the variables to be more self explanatory.
| if len(locus_allele_info) == 1: | ||
| locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info) | ||
| if locus == "AMELOGENIN": | ||
| locus_allele_info = filter_amel(metadata, locus_allele_info, locus_reads) |
There was a problem hiding this comment.
There's nothing necessarily wrong with how you've updated this code, but we could keep the nesting complexity to a minimum with something like this, right?
if locus == "AMELOGENIN":
# ...
elif len(locus_allele_info) == 1:
# ...
else:
# ...There was a problem hiding this comment.
I was just trying to avoid have to repeat code (line 34 would have to be in both blocks).
lusSTR/wrappers/convert.py
Outdated
| if ( | ||
| len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"])) | ||
| and software != "uas" | ||
| and locus != "AMELOGENIN" | ||
| ) or ( | ||
| software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p) | ||
| ): |
There was a problem hiding this comment.
I know, it's complicated. The Amelogenin true X chr sequences were getting thrown out because it's 0 bases after removal... but that first statement is still necessary for all the other markers.
There was a problem hiding this comment.
I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.
There was a problem hiding this comment.
Returning to this after myriad distractions.
Here's a step closer to something more manageable. The variable names are still too vague, but I think this breaks the complex conditional down into a structure that can actually be understood by mere mortals.
short1 = len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
short2 = len(sequence) < (remove_5p + remove_3p)
cond1 = short1 and software != "uas" and locus != "AMELOGENIN"
cond2 = short2 and software != "uas" and locus == "AMELOGENIN"
if cond1 or cond2:
flank_summary = [...]How would we improve the variable names short1, short2, cond1, and cond2 to communicate their purpose?
There was a problem hiding this comment.
How about:
flanks_removed_len = remove_5p + remove_3p
amel_remove = len(sequence) < flanks_removed_len
otherloci_remove = len(sequence) <= (flanks_removed_len + len(metadata["LUS"]))
if (
software != "uas"
and locus != "AMELOGENIN"
and otherloci_remove
) or (
software != "uas" and locus == "AMELOGENIN" and amel_remove
):
There was a problem hiding this comment.
Ok, results from our Teams call just now.
locus_min_length = remove_5p + remove_3p + len(metadata["LUS"])
if locus == "AMELOGENIN":
locus_min_length += 1
if software != "uas" and len(sequence) < locus_min_length:
flank_summary = [...]We may need to fiddle with += 1 vs -= 1 and < vs <=, but once those are in agreement this should work!
There was a problem hiding this comment.
It ended up being -=1 but is updated and working!
lusSTR/wrappers/filter.py
Outdated
| p_strs = [ | ||
| "AMELOGENIN", |
There was a problem hiding this comment.
Is there any way we could define this list in a single place and reference it as necessary?
There was a problem hiding this comment.
Roger that. I created json file storing this information.
| plot = interactive_plots(marker_df, marker, max_yvalue, increase_value, all=True) | ||
| if marker in missing_loci: | ||
| marker = f"⚠️{marker}⚠️" | ||
| plot = go.Figure() |
There was a problem hiding this comment.
I feel like this plotly import pattern is an elaborate ruse so people can put "go figure" in their code 😆
| strs = str_lists["powerseq_strs"] if kit == "powerseq" else str_lists["forenseq_strs"] | ||
| ystrs = str_lists["powerseq_ystrs"] if kit == "powerseq" else str_lists["forenseq_ystrs"] |
There was a problem hiding this comment.
Great! This avoids a scenario where we update a list in one place but forget to in another place. Not that this ever happens...
lusSTR/wrappers/convert.py
Outdated
| if ( | ||
| len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"])) | ||
| and software != "uas" | ||
| and locus != "AMELOGENIN" | ||
| ) or ( | ||
| software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p) | ||
| ): |
There was a problem hiding this comment.
I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.
Previously, we've excluded the amelogenin locus from processing, as there is no need to produce an alternate sequence format (no use for bracketed sequence format, etc.) and it is not used in the prob gen software. However, now that visualizations is now implemented in lusSTR, it is important to be able to evaluate the Amelogenin results for sex determination and overall profile quality. This PR will include the Amelogenin locus through lusSTR but remove it before producing prob gen files.