Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
6eb8886
handle varscan iupac ambiguity codes
aryarm Mar 27, 2021
1ebe4f6
update install instructions to endorse mamba
aryarm Apr 6, 2021
01665bf
docs: new callers must be added to the env file
aryarm Apr 26, 2021
bd4e92f
add newline to callers README
aryarm Apr 26, 2021
7082e70
temporarily handle 2vcf exit code (see #28)
aryarm Apr 29, 2021
86e1453
handle pvals of 1 in random forest output when creating VCF
aryarm Apr 29, 2021
753f81a
add a script for analyzing allele conflicts
aryarm Apr 29, 2021
9e872de
update to snakemake v5.24.2
aryarm Jun 17, 2021
75ee395
initial dockerfile attempt: untested
aryarm Jun 17, 2021
6121d43
Merge pull request #26 from aryarm/patch/varscan-iupac
aryarm Jul 1, 2021
af69606
subset columns before import in 2vcf.py; also resolve #28
aryarm Jul 2, 2021
abfc6b7
add header to allele conflicts script
aryarm Jul 2, 2021
4167a82
be specific about the channels in each env file and remove default ch…
aryarm Jul 2, 2021
cbadb82
remove explicit channel specs in classify env but add to prc env
aryarm Jul 3, 2021
ff5bb00
Merge pull request #31 from aryarm/fix/2vcf-exit-code
aryarm Jul 3, 2021
366c71d
add sed, gzip, bash, and friends to the env files for reproducibility
aryarm Jul 4, 2021
248995b
Merge pull request #32 from aryarm/explicit-envs
aryarm Jul 4, 2021
54050bb
update install instructions to endorse mamba
aryarm Apr 6, 2021
5149592
docs: new callers must be added to the env file
aryarm Apr 26, 2021
7ba7780
add newline to callers README
aryarm Apr 26, 2021
786d7aa
update to snakemake v5.24.2
aryarm Jun 17, 2021
5b34036
initial dockerfile attempt: untested
aryarm Jun 17, 2021
98e7a8a
Merge branch 'installation' of github.com:aryarm/varCA into installation
aryarm Jul 6, 2021
c59fe85
remove for-loop in pindel caller script
aryarm Jul 11, 2021
450afdc
create script for generating small test datasets (resolves #33)
aryarm Jul 12, 2021
546f012
both jurkat and molt4 test datasets must have at least 3 peaks to sat…
aryarm Jul 12, 2021
5b92328
make fillna.bash read from stdin and connect steps via pipes (resolve…
aryarm Jul 12, 2021
c936a5b
bump required snakemake version to 5.24.2
aryarm Jul 12, 2021
3ea4d99
convert pipe to temp file
aryarm Jul 16, 2021
ac31acc
clarify required input in config
aryarm Jul 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM snakemake/snakemake:v5.24.2

# work in the home directory
WORKDIR /home

# copy all the files to the container
COPY . .

# install the pipeline's dependencies
RUN snakemake --use-conda --conda-create-envs-only -j

# run varCA
# specify as a CMD, so it can be overridden by the user
CMD ["./run.bash"]
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![Snakemake](https://img.shields.io/badge/snakemake-5.18.0-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
[![Snakemake](https://img.shields.io/badge/snakemake-5.24.2-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
[![License](https://img.shields.io/apm/l/vim-mode.svg)](LICENSE)

# varCA
Expand All @@ -21,11 +21,12 @@ wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz
```

# setup
The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). We recommend installing version 5.18.0:
The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). For most users, we recommend installing Snakemake via mamba as [described in their documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda-mamba).

However, if your aim is complete reproducbility, we recommend installing version 5.24.2 via this command:
```
conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
conda create -n snakemake -c conda-forge --no-channel-priority 'bioconda::snakemake==5.24.2'
```
We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) like this so that you can use the `--use-conda` flag when calling `snakemake` to let it [automatically handle all dependencies](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).

# execution
1. Activate snakemake via `conda`:
Expand All @@ -52,7 +53,7 @@ You must modify [the config.yaml file](configs#configyaml) to specify paths to y
The pipeline is made up of [two subworkflows](rules). These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the [rules README](rules/README.md) for execution instructions and a description of the outputs. You will need to execute the subworkflows separately [if you ever want to create your own trained models](rules#training-and-testing-varca).

#### Reproducing our results
We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.
We provide the example data so that you may quickly (in ~15 mins, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.

### If this is your first time using Snakemake
We recommend that you run `snakemake --help` to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the `-n -p -r` flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).
Expand Down
2 changes: 1 addition & 1 deletion Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ import warnings
from snakemake.utils import min_version

##### set minimum snakemake version #####
min_version("5.18.0")
min_version("5.24.2")

configfile: "configs/config.yaml"
configfile: "configs/callers.yaml"
Expand Down
6 changes: 6 additions & 0 deletions callers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,9 @@ By providing a dash character `-` in the caller identifier, the caller script (e
The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed as an argument to the original caller script. (For example, the directory containing the output of the `gatk` special script will be passed as a parameter to the `gatk-snp` caller script.)

By providing multiple dashes in your caller identifiers using this scheme, you may design complicated caller script heirarchies involving multiple levels of nesting. In a sense, you can create small pipelines within the pipeline.

### Caller script dependencies
If you caller script requires specific dependencies, just add them to the [prepare.yml](/envs/prepare.yml) [conda environment file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
Any packages listed in this file will be available to your caller script when it is executed.

If you would like to use a newer version of an existing variant caller, just change the version specified in the [prepare.yml](/envs/prepare.yml) file. Snakemake will automatically update the package upon your next execution of the pipeline.
21 changes: 8 additions & 13 deletions callers/pindel
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,13 @@ samp="$5"


# we sleep between each command to avoid potential problems with filesystem latency
pindel -f "$genome" -i <(echo "$bam" 300 "$samp") -j "$peaks" -o "$output_dir/" -L "$output_dir/log" && \
pindel -f "$genome" -i <(echo "$bam" 300 "$samp") -j "$peaks" -o "$output_dir/" -L "$output_dir/log" >"$output_dir/stdout" && \
sleep 10 && \
for file in _D _INV _SI _TD; do
pindel2vcf -p "$output_dir/$file" -r "$genome" -R x -d 00000000 -v "$output_dir/$file.raw.vcf" && \
sleep 10 && \
bgzip -f "$output_dir/$file.raw.vcf" && \
sleep 10 && \
tabix -p vcf -f "$output_dir/$file.raw.vcf.gz" && \
sleep 10 && \
bcftools norm -d all "$output_dir/$file.raw.vcf.gz" > "$output_dir/$file.vcf" && \
sleep 10 && \
bgzip -f "$output_dir/$file.vcf" && tabix -p vcf -f "$output_dir/$file.vcf.gz"
done && \
pindel2vcf -P "$output_dir/" -r "$genome" -R x -d 00000000 -v "$output_dir/pindel.raw.vcf" && \
sleep 10 && \
bcftools concat -a -d all "$output_dir/"{_D,_INV,_SI,_TD}.vcf.gz > "$output_dir/pindel.vcf"
bgzip -f "$output_dir/pindel.raw.vcf" && \
sleep 10 && \
tabix -p vcf -f "$output_dir/pindel.raw.vcf.gz" && \
sleep 10 && \
bcftools norm -d all "$output_dir/pindel.raw.vcf.gz" > "$output_dir/pindel.vcf"

3 changes: 2 additions & 1 deletion callers/varscan
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ output_dir="$4"

samtools mpileup -l "$peaks" -f "$genome" -B "$bam" | \
varscan mpileup2cns --p-value 1 --strand-filter 0 --output-vcf > "$output_dir/varscan.vcf" && \
bgzip -f "$output_dir/varscan.vcf" && tabix -p vcf -f "$output_dir/varscan.vcf.gz"
awk -F $"\t" -v 'OFS=\t' '/^#/ || $4 !~ /^(R|Y|M|W|S|K|V|H|D|B|N)$/' "$output_dir/varscan.vcf" | \
bgzip > "$output_dir/varscan.vcf.gz" && tabix -p vcf -f "$output_dir/varscan.vcf.gz"
16 changes: 8 additions & 8 deletions configs/classify.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ data:
- 'gatk-indel~DP>10' # we only train on sites with a read depth above 10
# here are some example test samples, but you can replace these with whichever
# samples for which you'd like to predict variants
molt4_chr1:
path: out/merged_indel/molt4_chr1/final.tsv.gz
merged: out/merged_indel/molt4_chr1/merge.tsv.gz
molt4_mini:
path: out/merged_indel/molt4_mini/final.tsv.gz
merged: out/merged_indel/molt4_mini/merge.tsv.gz
filter: *filter # use the same filtering expression as the SRR891269 data
jurkat_chr1:
path: out/merged_indel/jurkat_chr1/final.tsv.gz
merged: out/merged_indel/jurkat_chr1/merge.tsv.gz
jurkat_mini:
path: out/merged_indel/jurkat_mini/final.tsv.gz
merged: out/merged_indel/jurkat_mini/merge.tsv.gz
filter: *filter # use the same filtering expression as the SRR891269 data
# # uncomment the lines below if you want to reproduce our results
# SRR891269_evens:
Expand Down Expand Up @@ -87,8 +87,8 @@ tune: false
# You may optionally specify a truth set for each dataset if you want it to
# serve as test data (see "truth" above).
# This config entry is required!
predict: [molt4_chr1, jurkat_chr1]
# predict: [molt4_chr1, jurkat_chr1, SRR891269_evens] # uncomment this line if you want to reproduce our results
predict: [molt4_mini, jurkat_mini]
# predict: [SRR891269_evens] # uncomment this line if you want to reproduce our results

# Callers to use when creating a VCF from the merged .tsv.gz file.
# Supply a list of caller names in the order in which they should be considered
Expand Down
31 changes: 20 additions & 11 deletions configs/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
# Note: this file is written in the YAML syntax (https://learnxinyminutes.com/docs/yaml/)


#---------------------------------------- REQUIRED INPUT -----------------------------------------------#
# If you are executing the pipeline on your own data, you MUST edit these config values

# The path to a text file specifying where to find read information for each sample
# Each row in the sample file should represent a different sample.
# The sample file should have 3 columns (each separated by a single tab):
Expand All @@ -26,14 +29,27 @@
# this is a required input!
sample_file: data/samples.tsv

# Which samples from the samples_file should we execute the pipeline on?
# Which samples from the sample_file should we execute the pipeline on?
# Comment out this line or set it to a falsey value if you want to run all samples in the sample file
SAMP_NAMES: [molt4_chr1, jurkat_chr1]
SAMP_NAMES: [molt4_mini, jurkat_mini]

# The path to a reference genome
# The BWA index files, samtools index file (.fa.fai), and samtools dictionary file (.dict) must be stored in the same directory
# The BWA index files, samtools index file (.fai), and GATK dictionary file (.dict) must be stored in the same directory
# Note: your fasta file cannot be gzipped!
# required!
genome: data/hg38.mini.fa

# The path to trained, ensemble classifiers, which can be used for calling SNVs and
# indels
# Unless you have created your own models, you should just use the ones packaged in the
# example dataset and leave these paths unchanged.
# required!
genome: data/hg38.chr1.fa
snp_model: data/snp.rda
indel_model: data/indel.rda


#------------------------------------------- OPTIONS ---------------------------------------------------#
# If you are executing the pipeline on your own data, you can usually just leave these unchanged.

# The path to the directory in which to place all of the output files
# Defined relative to whatever directory you execute the snakemake command in
Expand Down Expand Up @@ -74,10 +90,3 @@ indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strel
snp_filter: ['gatk-snp~DP>10']
indel_filter: ['gatk-indel~DP>10']

# The path to a trained, ensemble classifier, which can be used for calling SNVs
# required!
snp_model: data/snp.rda

# The path to a trained, ensemble classifier, which can be used for calling indels
# required!
indel_model: data/indel.rda
23 changes: 2 additions & 21 deletions configs/prepare.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ sample_file: data/samples.tsv
SAMP_NAMES: [SRR891269]

# The path to a reference genome
# The BWA index files, samtools index file (.fa.fai), and samtools dictionary file (.dict) must be stored in the same directory
# The BWA index files, samtools index file (.fai), and GATK dictionary file (.dict) must be stored in the same directory
# Note: your fasta file cannot be gzipped!
# required!
genome: /iblm/netapp/data1/external/GRC37/combined/bwa_index_assembly19/Homo_sapiens_assembly19.fasta # replace this with the path to your own reference genome

Expand Down Expand Up @@ -62,20 +63,6 @@ indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strel
# If this value is not provided, it will default to "."
label: "."

# Whether to "normalize" the VCF output of every caller script
# This option is not recommended if you plan to feed the data to a random forest
# Normalization usually involves left-alignment and trimming of variants
# See https://genome.sph.umich.edu/wiki/Variant_Normalization for more info
# In our pipeline, the normalization step also reduces counts of variants that appear at the same position to 1
# If this value is not provided, it will default to false
no_normalize: false

# Whether to replace NA values in the dataset before outputting the final TSV
# This option is not recommended if you plan to feed the data to a random forest
# Unless otherwise specified in the Caller Specific Parameters below, NA values will be replaced with 0
# If this value is not provided, it will default to false
keep_na: false

# A list of filtering expressions for filtering the sites before outputting the final TSV
# A filtering expression consists of the following concatenated together:
# - the caller id
Expand All @@ -87,9 +74,3 @@ keep_na: false
snp_filter: ['gatk-snp~DP>10']
indel_filter: ['gatk-indel~DP>10']

# Whether to normalize weird numerical columns before outputting the final TSV
# (ex: scientific notation or numbers followed by a percent sign)
# We recommended setting this to false in order to allow normalization if you
# plan to feed the data to a random forest
# If this option is commented out, it will default to false
pure_numerics: false
11 changes: 8 additions & 3 deletions envs/classify.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,20 @@
channels:
- defaults
- conda-forge
- nodefaults
dependencies:
- r-dplyr==0.8.5
- r-r.devices==2.16.1
- r-ranger==0.11.2
- r-earth==5.1.2
- conda-forge::coreutils==8.31
- coreutils==8.31
- r-mlr==2.17.1
- r-parallelmap==1.5.0
- r==3.5.1
- r-readr==1.1.1
- r-data.table==1.12.8
- r-plyr==1.8.4
- conda-forge::gawk==5.1.0
- gawk==5.1.0
- conda-forge::coreutils==8.31
- conda-forge::sed==4.8
- conda-forge::bash==5.0.018
- conda-forge::gzip==1.10
17 changes: 11 additions & 6 deletions envs/prc.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
channels:
- conda-forge
- nodefaults
dependencies:
- numpy==1.18.5
- pandas==1.0.4
- scikit-learn==0.23.1
- matplotlib==3.1.2
- coreutils==8.31
- conda-forge::numpy==1.18.5
- conda-forge::pandas==1.0.4
- conda-forge::scikit-learn==0.23.1
- conda-forge::matplotlib==3.1.2
- conda-forge::coreutils==8.31
- bioconda::pysam==0.15.3
- bioconda::bcftools==1.10.2
- gawk==5.1.0
- conda-forge::gawk==5.1.0
- conda-forge::coreutils==8.31
- conda-forge::sed==4.8
- conda-forge::bash==5.0.018
- conda-forge::gzip==1.10
50 changes: 27 additions & 23 deletions envs/prepare.yml
Original file line number Diff line number Diff line change
@@ -1,34 +1,38 @@
channels:
- defaults
- conda-forge
- nodefaults
dependencies:
- r-pbapply==1.4_2
- r-stringr==1.4.0
- r-readr==1.1.1
- conda-forge::r-pbapply==1.4_2
- conda-forge::r-stringr==1.4.0
- conda-forge::r-readr==1.1.1
- bioconda::bioconductor-rsamtools==1.34.0
- gawk==5.1.0
- tabix==0.2.6
- bcftools==1.9=ha228f0b_4
- conda-forge::gawk==5.1.0
- bioconda::tabix==0.2.6
- bioconda::bcftools==1.9=ha228f0b_4
- bioconda::delly==0.8.3
- bioconda::bioconductor-bsgenome.hsapiens.ucsc.hg19==1.4.0
- bioconda::bioconductor-biovizbase==1.30.1
- vcftools==0.1.16
- bioconda::vcftools==0.1.16
- conda-forge::r-rmutil==1.1.4
- gatk4==4.1.7.0
- r-lattice==0.20_41
- r-reshape==0.8.8
- vardict-java==1.7.0
- manta==1.6.0
- bwa==0.7.17
- r-data.table==1.12.8
- r-plyr==1.8.4
- samtools==1.9=h8571acd_11
- strelka==2.9.10
- bioconda::gatk4==4.1.7.0
- conda-forge::r-lattice==0.20_41
- conda-forge::r-reshape==0.8.8
- bioconda::vardict-java==1.7.0
- bioconda::manta==1.6.0
- bioconda::bwa==0.7.17
- conda-forge::r-data.table==1.12.8
- conda-forge::r-plyr==1.8.4
- bioconda::samtools==1.9=h8571acd_11
- bioconda::strelka==2.9.10
- bioconda::bioconductor-variantannotation==1.28.3
- pindel==0.2.5b9
- macs2==2.1.2
- r-dplyr==0.8.5
- bioconda::pindel==0.2.5b9
- bioconda::macs2==2.1.2
- conda-forge::r-dplyr==0.8.5
- bioconda::bioconductor-genomicalignments==1.18.1
- bedtools==2.29.2
- varscan==2.4.4
- bioconda::bedtools==2.29.2
- bioconda::varscan==2.4.4
- bioconda::bioconductor-rtracklayer==1.42.1
- conda-forge::coreutils==8.31
- conda-forge::sed==4.8
- conda-forge::bash==5.0.018
- conda-forge::gzip==1.10
2 changes: 1 addition & 1 deletion rules/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ First, split the truth dataset by chromosome parity using `awk` commands like th

Then, specify in `classify.yaml` the path to the `indel-odds.tsv.gz` file as your training set and the path to the `indel-evens.tsv.gz` file as your test set. Make sure that both your training and test sets have a [truth caller ID](/configs/classify.yaml#L26) and that you've specified the test set ID in the list of `predict` datasets. Once you've finished, execute the `classify` subworkflow on its own (see [above](#execution-1)).

To truly reproduce our results, we also recommend predicting variants for the ATAC-seq cell lines we tested (Jurkat, Molt-4, K-562, etc) by downloading them from GEO accession GSE129086. Although the example data includes the Jurkat and Molt-4 samples, it only includes chromosome 1 from those cell lines, thereby artificially lowering the VarCA scores of the variants in those samples.
To truly reproduce our results, we also recommend predicting variants for the ATAC-seq cell lines we tested (Jurkat, Molt-4, K-562, etc) by downloading them from GEO accession GSE129086. Although the example data includes the Jurkat and Molt-4 samples, it only includes a small portion of chromosome 1 from those cell lines, thereby artificially lowering the VarCA scores of the variants in those samples.

The `classify` subworkflow can only work with one trained model at a time, so you will need to repeat these steps for SNVs. Just replace every mention of "indel" in `classify.yaml` with "snp". Also remember to use only the SNV callers (ie GATK, VarScan 2, and VarDict).

Expand Down
4 changes: 2 additions & 2 deletions rules/classify.smk
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from snakemake.utils import min_version

##### set minimum snakemake version #####
min_version("5.18.0")
min_version("5.24.2")

if 'imported' not in config:
configfile: "configs/classify.yaml"
Expand Down Expand Up @@ -242,7 +242,7 @@ rule tsv2vcf:
output: temp(config['out']+"/{sample}/results.vcf.gz")
conda: "../envs/prc.yml"
shell:
"zcat {input.merge} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|.*~(REF|ALT)$' | scripts/2vcf.py -o {output} {params.callers} {input.results}"
"scripts/2vcf.py -o {output} {params.callers} {input.results} {input.merge}"

rule fix_vcf_header:
""" add contigs to the header of the vcf """
Expand Down
Loading