Skip to content

USask-BINFO/AcrVis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

289 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AcrVis

Generalizable tool used to visualize the genomic neighbourhoods of Acr/Aca proteins and their taxonomic data.

Installation

To install the neccessary scripts needed to run the pipeline, simply run the following command in the directory you wish to have the project installed:

git clone https://git.cs.usask.ca/njt694/acrvis.git

If you wish to install any updates to the pipeline, cd to the directory created by the git clone and run:

git pull

Then add the scripts/ folder to your PATH, if you don't want to specify the location every time you call main.sh. You can add the scripts to your PATH on unix systems with the following command (assuming you git cloned into your home directory):

export PATH="~/acrvis/scripts:$PATH"

Database

To run the pipeline, a database is required for psi-BLAST searches and building the genomic neighbourhoods. Scripts to generate/update your own databases are located in the db_scripts/ folder.

To build your own database you must specify a link to the ncbi ftp site. The link should start with "ftp://ftp.ncbi.nih.gov/genomes/genbank/" and should end at an "assembly_summary.txt" file. For example, to build a database of all the bacterial genomes found on ncbi in your current directory, you would run:

db_scripts/get_ncbi_genomes.sh -l "ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt"

To update that database, assuming it was created in your home directory, you would run:

db_scripts/get_ncbi_genomes.sh -l "ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt" -e ~/bacterial_db

For information on all of the optional arguments when building a database, run:

db_scripts/get_ncbi_genomes.sh -h

Note that some of these databases may be very large (terabytes!), so make sure the machine you are downloading them onto has enough space. Also due to the large size, the script may take a long time to run, so it is recommended to run it inside of a 'screen'. You can start and attach to a new screen with the name 'db_building' with the following code:

screen -S db_building

To detach from the screen, use ctrl+a, and to re-attach to the screen, run:

screen -r db_building

Usage

The main script for running the full pipeline is main.sh in the scripts/ folder. For all argument information and default values for optional arguments, run

main.sh -h

The required arguments for the script are [-q QUERY], [-g GBFF] and [-d DATABASE] (and [-s SCRIPTS] if you didn't add the scripts to your PATH). [-q QUERY] must point to a folder containing all of the protein fasta file(s) you wish to query against the database. [-g GBFF] points to the Genbank Flat File Format files containing the genome data for all of the sequences in the database. [-d DATABASE] must point to a folder created by db_scripts/get_ncbi_genomes.sh, this is the database that will be queried against.

If the bacterial database located at /birl2/data/Acr/data/bacterialdb is used as [-d DATABASE], then taxonomic trees and pie charts will be generated to visualize the distribution of the matches. If any other database is used, then only .csv files containing the raw taxonomic data will be generated.

Assuming we have a query fasta file, such as Aca7.fasta (from acr_aca_fastas/Aca/Aca7.fasta) in a directory called query/ in our home directory, and using the bacterial database mentioned above as the database, then to run the pipeline, you would run the following code:

main.sh -q ~/query/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff

If you hadn't added scripts/ to your PATH, you would need to run (assuming acrvis/scripts is in your home directory ~):

main.sh -q ~/query/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff -s ~/acrvis/scripts

You can also skip the psi-BLAST step of the pipeline, which tends to be the longest step (150-200 minutes per query against the bacterial database), by specifying [-p PSIBLAST_OUT] instead of -q, as follows, assuming the psiBLAST out is located in ~/acrvis_out_Aca7/psiblast_out/:

main.sh -p ~/acrvis_out_Aca7/psiblast_out/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff

Note that PSIBLAST_OUT must be in the same format as that generated by scripts/run_psiblasts.sh (-outfmt "6 qacc sacc pident qcovhsp length mismatch gapopen qstart qend sstart send qlen slen evalue").

You can also specify a labelling file to use to label the genes in the Clinker plots. If no labelling file is specified, one will be created in the output folder generated by the pipeline. The labelling file contains the accession numbers of the psi-BLAST matches mapped to the name of the query used, and the e-value of the match. If a labelling file is specified, then it will be updated with any new psi-BLAST matches, and if any accessions are found to match multiple queries, then only the match with the lowest e-value will be kept. To specify a labelling file use the -lf argument, for example:

main.sh -q ~/query/ -d /birl2/data/Acr/data/bacterialdb -g /birl2/data/Acr/data/bacterial_genomes/gbff -lf ~/all_match_names.txt

Note that the query name used by the pipeline will be the name of the fasta file. For example, if Aca7.fasta was used as a query, then the query name would be Aca7.

Output

The pipeline will generate a folder, beginning with acrvis_out_ in your current directory, by default.

Inside of that folder, will be three sub-folders: data/, psiblast_out/, and parsed_outs/, as well as a text file containing labelling information called all_match_names.txt, if no labelling file was specified. The psiblast_out/ folder contains the psi-BLAST matches for each query, and can be used as the -p argument for future calls to the pipe. The parsed_outs/ folder contains the non-redundant psi-BLAST matches (duplicated protein match accessions removed).

The data/ folder contains the bulk of the output. Within it you will find a sub-folder for each query that had psi-BLAST matches. Within that folder, you will find a folder called redundant/, which will contain genomic neighbourhoods who had identical amino acid sequences, and a file ending in .map which maps the genomic neighbourhoods in redundant/ to the representative genomic neighbourhood that was kept for visualization. There will also be a folder called all_non_redundants/ that contains the clustered and plotted non-redudnant genomic neighbourhoods. If any genomic neighbourhoods meet the criteria, there will be a folder called homologous/ and potentially_homologous/, which contain the genomic neighbourhoods above and below the PID threshold, respectively. Within each of those you will find a folder called in_size/ and one called out_size/, which contain the genomic neighbourhoods within and outside of the length threshold, respectively. Inside of those folders you will find cluster folders which each contain similar genomic neighbourhoods and their clinker plots and taxonomic data.

Common Errors

If you are getting "AttributeError: 'NoneType' object has no attribute 'split'" when running find_clusters.py, mentioning 'threadpoolctl.py', try running

pip install threadpoolctl.py -U

to update the package. The error doesn't prevent the pipeline from running correctly, but clutters output.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •