Filtering faulty circularized genome using conspecific MAGs.
Reconstruction of the circularized genomes is the ultimate goal of prokaryotic genome assembly. The development of accurate long-read sequencing technology enables the assembly of circularized genomes from highly complex metagenomic samples. However, prokaryotic genomes tend to have many repetitive sequences, and those often result in faulty assembly closing, thereby generating circularized genomes with significant gaps. cMAGfilter filters out the circularized metagenome-assembled genomes (cMAGs) with such gaps using their conspecific MAGs. For a given cMAG and its conspecific MAGs, it first searches core contigs, the contigs shared by most of the conspecific MAGs, from conspecific MAGs. Next, it calculates the core contig retrieval rate from the cMAG and filters out the cMAG using the information.
The package is tested on Linux operating systems (Linux: Ubuntu 20.04 LTS). cMAGfilter requires Python>=3.6 and mummer4 package. You can install mummer from its tarball or from bioconda. Please locate the mummer4 package softwares in PATH or specify the location with -nuc parameter.
git clone https://github.com/netbiolab/cMAGfilter.git
cd cMAGfilter
python3 setup.py install --userThe installation only takes a few minutes.
python3 cMAGfilter.py examples/input/circular_contigs/Akkermansia_muciniphila.fna examples/input/conspecific_MAGs/Akkermansia_muciniphila examples/output/Akkermansia_muciniphilaThe testrun was performed under the following system, using only single core.
- OS: Ubuntu 20.04 LTS (Windows Subsystem for Linux 2.0)
- CPU: 12th Gen intel(R) Core(TM) i7-12700H 2.3Ghz
- RAM: 8GB
System usage
- Running time: 6 min 20 sec
- Peak RAM usage: 135MB
Using multiple cores will reduce the running time. The number of conspecific genomes affects the runtime and peak memory usage, roughly in proportion with the square of # conspecific genomes.
You can find the example output files from 'examples/output/Mesosutterella_multiformis'.
all_by_all_alignment_results/This directory contains all-by-all nucmer alignment results between conspecific MAGs.[circular-contig]_align_back_results/This directory contains core contigs to circular contig nucmer alignment results.conspecific_genomes.contig_report.tsvcontains the information on whether the contigs of a conspecific MAG are founded from the other conspecific MAGs.[circular-contig]_core_contigs_alignment.core_contig_stat.tsvcontains the list of core contig and their alignment result against circular contig.core_contigs.fnais FASTA sequence file of core contigs.[circular-contig]_core_contigs_alignment.summary.tsvThis is the final result file.
- circular contig name
- circular contig alignment length
- circular contig length
- circular contig alignment coverage (2. / 3.)
- core contig alignment length
- core contig length
- core contig alignment coverage (5. / 6.)
- aligned core contig count
- core contig count
- core contig retrieval rate (8. / 9.): In the paper, we considered the core contig is genuine if the core contig retrieval rate is higher than 0.95. Therefore, in this example case, we filter out the Mesosutterella_multiformis's core contig as its core contig retrieval rate is 0.867.
The entire 110 HiFi circular contigs and thir conspecific MAGs used in the paper are available from the link (6.6GB).
CY Kim, J Ma, I Lee, HiFi Metagenomic Sequencing Enables Assembly of Accurate and Complete Genomes from Human Gut Microbiota, bioRxiv preprint, Feb. 2022
