Pathotype Identification Pipeline for Escherichia coli (PIP-eco)
Overview: PIP-eco is a comprehensive analytical tool designed to accurately identify and characterize Escherichia coli (E. coli) pathotypes. This pipeline facilitates detailed analysis for both single and hybrid pathotypes using whole genome sequencing (WGS) data. PIP-eco pipeline consists of three infrastructures: Marker gene alignment process, Pan-phylogenetic analysis process, and Pathogenicity Islands (PAIs) analysis process. It accepts assembled bacterial strain collections as input, which can be either NCBI RefSeq records or user's own data in fasta format. In the PIP-eco pipeline, genome annotation on the input WGS data is performed. Follwing this, the pathotype is determined based on marker genes. Additionally, by conducting phylogenetic analysis based on pan-genome analysis, the genetic distances are investigated, thus effectively discriminating hybrid pathotypes. Through these processes, the PIP-eco pipeline is utilized not only for pathotype assignment but also for tracing the trajectories of pathogenic factors. The Processing within the PIP-eco pipeline uses publicly available tools: PROKKA, USEARCH, MUSCLE, and MAFFT.
conda create -y pathotype.yaml
git clone https://github.com/SBL-Kimlab/PIP-eco.git
In the PIP-eco pipeline, each process is performed according to defined modules. Users can directly use the individual modules as shown below, so all processes can be executed at once.
#Before executing the PIP-eco pipeline, it needs to declare /include/include.ipynb.
import os
import os.path as path
from time import sleep
path_root = path.abspath( path.join( os.getcwd(), ".." ) )
path_local = path_root + "/PIPeco"; path_include = path_root + "/include"
file_include = path_include + "/include.ipynb"
%run $file_include
#PIP-eco pipeline excution
os.chdir( path_root )
pipeco = pathotype()
pipeco.method.genome_annotation()
pipeco.method.marker_alignment()
pipeco.method.vf_based_phylogenetic()
pipeco.method.pai_analysis()
- Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068-2069.
- Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.
- Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5), 1792-1797.
- Katoh, K., Misawa, K., Kuma, K. I., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research, 30(14), 3059-3066.
