Fixed bug: Updated parse_gtdbtk.Snakefile#19
Open
shiraz-shah wants to merge 1 commit intoRussel88:mainfrom
Open
Fixed bug: Updated parse_gtdbtk.Snakefile#19shiraz-shah wants to merge 1 commit intoRussel88:mainfrom
shiraz-shah wants to merge 1 commit intoRussel88:mainfrom
Conversation
Changed mmseqs gene clustering to coverage mode 1, so gene fragments do not end up as separate clusters.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MAGinator had a bug because of the default mmseqs gene clustering mode used. Due to this bug, gene fragments from incomplete assemblies would end up as their own gene clusters. This would inflate the total number of gene clusters, with unforeseen downstream consequences for signature gene selection and abundance estimation.
We have fixed this bug by changing the mmseqs clustering mode to coverage mode 1, so gene fragments do not end up as separate clusters, but instead get merged with their full-length counterparts.
In addition, the mmseqs clustering workflow has been changed from easy-linclust to easy-cluster, because the latter is fast enough (20 minutes for a deep 500-sample metagenome data set), while easy-linclust employs a number of heuristics to improve speed at the cost of accuracy.