Generalizable tool used to visualize the genomic neighbourhoods of Acr/Aca proteins and their taxonomic data.
To install the neccessary scripts needed to run the pipeline, simply run the following command in the directory you wish to have the project installed:
git clone https://git.cs.usask.ca/njt694/acrvis.git
If you wish to install any updates to the pipeline, cd to the directory created by the git clone and run:
git pull
Then add the scripts/ folder to your PATH, if you don't want to specify the location every time you call main.sh. You can add the scripts to your PATH on unix systems with the following command (assuming you git cloned into your home directory):
export PATH="~/acrvis/scripts:$PATH"
To run the pipeline, a database is required for psi-BLAST searches and building the genomic neighbourhoods. Scripts to generate/update your own databases are located in the db_scripts/ folder.
To build your own database you must specify a link to the ncbi ftp site. The link should start with "ftp://ftp.ncbi.nih.gov/genomes/genbank/" and should end at an "assembly_summary.txt" file. For example, to build a database of all the bacterial genomes found on ncbi in your current directory, you would run:
db_scripts/get_ncbi_genomes.sh -l "ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt"
To update that database, assuming it was created in your home directory, you would run:
db_scripts/get_ncbi_genomes.sh -l "ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt" -e ~/bacterial_db
For information on all of the optional arguments when building a database, run:
db_scripts/get_ncbi_genomes.sh -h
Note that some of these databases may be very large (terabytes!), so make sure the machine you are downloading them onto has enough space. Also due to the large size, the script may take a long time to run, so it is recommended to run it inside of a 'screen'. You can start and attach to a new screen with the name 'db_building' with the following code:
screen -S db_building
To detach from the screen, use ctrl+a, and to re-attach to the screen, run:
screen -r db_building
The main script for running the full pipeline is main.sh in the scripts/ folder. For all argument information and default values for optional arguments, run
main.sh -h
The required arguments for the script are [-q QUERY], [-g GBFF] and [-d DATABASE] (and [-s SCRIPTS] if you didn't add the scripts to your PATH). [-q QUERY] must point to a folder containing all of the protein fasta file(s) you wish to query against the database. [-g GBFF] points to the Genbank Flat File Format files containing the genome data for all of the sequences in the database. [-d DATABASE] must point to a folder created by db_scripts/get_ncbi_genomes.sh, this is the database that will be queried against.
If the bacterial database located at /birl2/data/Acr/data/bacterialdb is used as [-d DATABASE], then taxonomic trees and pie charts will be generated to visualize the distribution of the matches. If any other database is used, then only .csv files containing the raw taxonomic data will be generated.
Assuming we have a query fasta file, such as Aca7.fasta (from acr_aca_fastas/Aca/Aca7.fasta) in a directory called query/ in our home directory, and using the bacterial database mentioned above as the database, then to run the pipeline, you would run the following code:
main.sh -q ~/query/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff
If you hadn't added scripts/ to your PATH, you would need to run (assuming acrvis/scripts is in your home directory ~):
main.sh -q ~/query/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff -s ~/acrvis/scripts
You can also skip the psi-BLAST step of the pipeline, which tends to be the longest step (150-200 minutes per query against the bacterial database), by specifying [-p PSIBLAST_OUT] instead of -q, as follows, assuming the psiBLAST out is located in ~/acrvis_out_Aca7/psiblast_out/:
main.sh -p ~/acrvis_out_Aca7/psiblast_out/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff
Note that PSIBLAST_OUT must be in the same format as that generated by scripts/run_psiblasts.sh (-outfmt "6 qacc sacc pident qcovhsp length mismatch gapopen qstart qend sstart send qlen slen evalue").
You can also specify a labelling file to use to label the genes in the Clinker plots. If no labelling file is specified, one will be created in the output folder generated by the pipeline. The labelling file contains the accession numbers of the psi-BLAST matches mapped to the name of the query used, and the e-value of the match. If a labelling file is specified, then it will be updated with any new psi-BLAST matches, and if any accessions are found to match multiple queries, then only the match with the lowest e-value will be kept. To specify a labelling file use the -lf argument, for example:
main.sh -q ~/query/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff -lf ~/all_match_names.txt
Note that the query name used by the pipeline will be the name of the fasta file. For example, if Aca7.fasta was used as a query, then the query name would be Aca7.
The pipeline will generate a folder, beginning with acrvis_out_ in your current directory, by default.
Inside of that folder, will be three sub-folders: data/, psiblast_out/, and parsed_outs/, as well as a text file containing labelling information called all_match_names.txt, if no labelling file was specified. The psiblast_out/ folder contains the psi-BLAST matches for each query, and can be used as the -p argument for future calls to the pipe. The parsed_outs/ folder contains the non-redundant psi-BLAST matches (duplicated protein match accessions removed).
The data/ folder contains the bulk of the output. Within it you will find a sub-folder for each query that had psi-BLAST matches. Within that folder, you will find a folder called redundant/, which will contain genomic neighbourhoods who had identical amino acid sequences, and a file ending in .map which maps the genomic neighbourhoods in redundant/ to the representative genomic neighbourhood that was kept for visualization. There will also be a folder called all_non_redundants/ that contains the clustered and plotted non-redudnant genomic neighbourhoods. If any genomic neighbourhoods meet the criteria, there will be a folder called homologous/ and potentially_homologous/, which contain the genomic neighbourhoods above and below the PID threshold, respectively. Within each of those you will find a folder called in_size/ and one called out_size/, which contain the genomic neighbourhoods within and outside of the length threshold, respectively. Inside of those folders you will find cluster folders which each contain similar genomic neighbourhoods and their clinker plots and taxonomic data.
If you are getting "AttributeError: 'NoneType' object has no attribute 'split'" when running find_clusters.py, mentioning 'threadpoolctl.py', try running
pip install threadpoolctl.py -U
to update the package. The error doesn't prevent the pipeline from running correctly, but clutters output.