mcvickerlab · aryarm · Mar 27, 2021 · Apr 6, 2021 · Apr 26, 2021 · Apr 26, 2021
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,14 @@
+FROM snakemake/snakemake:v5.24.2
+
+# work in the home directory
+WORKDIR /home
+
+# copy all the files to the container
+COPY . .
+
+# install the pipeline's dependencies
+RUN snakemake --use-conda --conda-create-envs-only -j
+
+# run varCA
+# specify as a CMD, so it can be overridden by the user
+CMD ["./run.bash"]
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[![Snakemake](https://img.shields.io/badge/snakemake-5.18.0-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
+[![Snakemake](https://img.shields.io/badge/snakemake-5.24.2-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
 [![License](https://img.shields.io/apm/l/vim-mode.svg)](LICENSE)
 
 # varCA
@@ -21,11 +21,12 @@ wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz
 ```
 
 # setup
-The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). We recommend installing version 5.18.0:
+The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). For most users, we recommend installing Snakemake via mamba as [described in their documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda-mamba).
+
+However, if your aim is complete reproducbility, we recommend installing version 5.24.2 via this command:
 ```
-conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
+conda create -n snakemake -c conda-forge --no-channel-priority 'bioconda::snakemake==5.24.2'
 ```
-We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) like this so that you can use the `--use-conda` flag when calling `snakemake` to let it [automatically handle all dependencies](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).
 
 # execution
 1. Activate snakemake via `conda`:
@@ -52,7 +53,7 @@ You must modify [the config.yaml file](configs#configyaml) to specify paths to y
 The pipeline is made up of [two subworkflows](rules). These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the [rules README](rules/README.md) for execution instructions and a description of the outputs. You will need to execute the subworkflows separately [if you ever want to create your own trained models](rules#training-and-testing-varca).
 
 #### Reproducing our results
-We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.
+We provide the example data so that you may quickly (in ~15 mins, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.
 
 ### If this is your first time using Snakemake
 We recommend that you run `snakemake --help` to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the `-n -p -r` flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).

diff --git a/Snakefile b/Snakefile
@@ -2,7 +2,7 @@ import warnings
 from snakemake.utils import min_version
 
 ##### set minimum snakemake version #####
-min_version("5.18.0")
+min_version("5.24.2")
 
 configfile: "configs/config.yaml"
 configfile: "configs/callers.yaml"

diff --git a/callers/README.md b/callers/README.md
@@ -40,3 +40,9 @@ By providing a dash character `-` in the caller identifier, the caller script (e
 The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed as an argument to the original caller script. (For example, the directory containing the output of the `gatk` special script will be passed as a parameter to the `gatk-snp` caller script.)
 
 By providing multiple dashes in your caller identifiers using this scheme, you may design complicated caller script heirarchies involving multiple levels of nesting. In a sense, you can create small pipelines within the pipeline.
+
+### Caller script dependencies
+If you caller script requires specific dependencies, just add them to the [prepare.yml](/envs/prepare.yml) [conda environment file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
+Any packages listed in this file will be available to your caller script when it is executed.
+
+If you would like to use a newer version of an existing variant caller, just change the version specified in the [prepare.yml](/envs/prepare.yml) file. Snakemake will automatically update the package upon your next execution of the pipeline.
diff --git a/callers/pindel b/callers/pindel
@@ -14,18 +14,13 @@ samp="$5"
 
 
 # we sleep between each command to avoid potential problems with filesystem latency
-pindel -f "$genome" -i <(echo "$bam" 300 "$samp") -j "$peaks" -o "$output_dir/" -L "$output_dir/log" && \
+pindel -f "$genome" -i <(echo "$bam" 300 "$samp") -j "$peaks" -o "$output_dir/" -L "$output_dir/log" >"$output_dir/stdout" && \
 sleep 10 && \
-for file in _D _INV _SI _TD; do
-	pindel2vcf -p "$output_dir/$file" -r "$genome" -R x -d 00000000 -v "$output_dir/$file.raw.vcf" && \
-	sleep 10 && \
-	bgzip -f "$output_dir/$file.raw.vcf" && \
-	sleep 10 && \
-	tabix -p vcf -f "$output_dir/$file.raw.vcf.gz" && \
-	sleep 10 && \
-	bcftools norm -d all "$output_dir/$file.raw.vcf.gz" > "$output_dir/$file.vcf" && \
-	sleep 10 && \
-	bgzip -f "$output_dir/$file.vcf" && tabix -p vcf -f "$output_dir/$file.vcf.gz"
-done && \
+pindel2vcf -P "$output_dir/" -r "$genome" -R x -d 00000000 -v "$output_dir/pindel.raw.vcf" && \
 sleep 10 && \
-bcftools concat -a -d all "$output_dir/"{_D,_INV,_SI,_TD}.vcf.gz > "$output_dir/pindel.vcf"
+bgzip -f "$output_dir/pindel.raw.vcf" && \
+sleep 10 && \
+tabix -p vcf -f "$output_dir/pindel.raw.vcf.gz" && \
+sleep 10 && \
+bcftools norm -d all "$output_dir/pindel.raw.vcf.gz" > "$output_dir/pindel.vcf"
+
diff --git a/callers/varscan b/callers/varscan
@@ -13,4 +13,5 @@ output_dir="$4"
 
 samtools mpileup -l "$peaks" -f "$genome" -B "$bam" | \
 varscan mpileup2cns --p-value 1 --strand-filter 0 --output-vcf > "$output_dir/varscan.vcf" && \
-bgzip -f "$output_dir/varscan.vcf" && tabix -p vcf -f "$output_dir/varscan.vcf.gz"
+awk -F $"\t" -v 'OFS=\t' '/^#/ || $4 !~ /^(R|Y|M|W|S|K|V|H|D|B|N)$/' "$output_dir/varscan.vcf" | \
+bgzip > "$output_dir/varscan.vcf.gz" && tabix -p vcf -f "$output_dir/varscan.vcf.gz"
diff --git a/configs/classify.yaml b/configs/classify.yaml
@@ -43,13 +43,13 @@ data:
       - 'gatk-indel~DP>10' # we only train on sites with a read depth above 10
   # here are some example test samples, but you can replace these with whichever
   # samples for which you'd like to predict variants
-  molt4_chr1:
-    path: out/merged_indel/molt4_chr1/final.tsv.gz
-    merged: out/merged_indel/molt4_chr1/merge.tsv.gz
+  molt4_mini:
+    path: out/merged_indel/molt4_mini/final.tsv.gz
+    merged: out/merged_indel/molt4_mini/merge.tsv.gz
     filter: *filter # use the same filtering expression as the SRR891269 data
-  jurkat_chr1:
-    path: out/merged_indel/jurkat_chr1/final.tsv.gz
-    merged: out/merged_indel/jurkat_chr1/merge.tsv.gz
+  jurkat_mini:
+    path: out/merged_indel/jurkat_mini/final.tsv.gz
+    merged: out/merged_indel/jurkat_mini/merge.tsv.gz
     filter: *filter # use the same filtering expression as the SRR891269 data
   # # uncomment the lines below if you want to reproduce our results
   # SRR891269_evens:
@@ -87,8 +87,8 @@ tune: false
 # You may optionally specify a truth set for each dataset if you want it to
 # serve as test data (see "truth" above).
 # This config entry is required!
-predict: [molt4_chr1, jurkat_chr1]
-# predict: [molt4_chr1, jurkat_chr1, SRR891269_evens] # uncomment this line if you want to reproduce our results
+predict: [molt4_mini, jurkat_mini]
+# predict: [SRR891269_evens] # uncomment this line if you want to reproduce our results
 
 # Callers to use when creating a VCF from the merged .tsv.gz file.
 # Supply a list of caller names in the order in which they should be considered

diff --git a/configs/config.yaml b/configs/config.yaml
@@ -8,6 +8,9 @@
 # Note: this file is written in the YAML syntax (https://learnxinyminutes.com/docs/yaml/)
 
 
+#---------------------------------------- REQUIRED INPUT -----------------------------------------------#
+# If you are executing the pipeline on your own data, you MUST edit these config values
+
 # The path to a text file specifying where to find read information for each sample
 # Each row in the sample file should represent a different sample.
 # The sample file should have 3 columns (each separated by a single tab):
@@ -26,14 +29,27 @@
 # this is a required input!
 sample_file: data/samples.tsv
 
-# Which samples from the samples_file should we execute the pipeline on?
+# Which samples from the sample_file should we execute the pipeline on?
 # Comment out this line or set it to a falsey value if you want to run all samples in the sample file
-SAMP_NAMES: [molt4_chr1, jurkat_chr1]
+SAMP_NAMES: [molt4_mini, jurkat_mini]
 
 # The path to a reference genome
-# The BWA index files, samtools index file (.fa.fai), and samtools dictionary file (.dict) must be stored in the same directory
+# The BWA index files, samtools index file (.fai), and GATK dictionary file (.dict) must be stored in the same directory
+# Note: your fasta file cannot be gzipped!
+# required!
+genome: data/hg38.mini.fa
+
+# The path to trained, ensemble classifiers, which can be used for calling SNVs and
+# indels
+# Unless you have created your own models, you should just use the ones packaged in the
+# example dataset and leave these paths unchanged.
 # required!
-genome: data/hg38.chr1.fa
+snp_model: data/snp.rda
+indel_model: data/indel.rda
+
+
+#------------------------------------------- OPTIONS ---------------------------------------------------#
+# If you are executing the pipeline on your own data, you can usually just leave these unchanged.
 
 # The path to the directory in which to place all of the output files
 # Defined relative to whatever directory you execute the snakemake command in
@@ -74,10 +90,3 @@ indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strel
 snp_filter: ['gatk-snp~DP>10']
 indel_filter: ['gatk-indel~DP>10']
 
-# The path to a trained, ensemble classifier, which can be used for calling SNVs
-# required!
-snp_model: data/snp.rda
-
-# The path to a trained, ensemble classifier, which can be used for calling indels
-# required!
-indel_model: data/indel.rda
diff --git a/configs/prepare.yaml b/configs/prepare.yaml
@@ -30,7 +30,8 @@ sample_file: data/samples.tsv
 SAMP_NAMES: [SRR891269]
 
 # The path to a reference genome
-# The BWA index files, samtools index file (.fa.fai), and samtools dictionary file (.dict) must be stored in the same directory
+# The BWA index files, samtools index file (.fai), and GATK dictionary file (.dict) must be stored in the same directory
+# Note: your fasta file cannot be gzipped!
 # required!
 genome: /iblm/netapp/data1/external/GRC37/combined/bwa_index_assembly19/Homo_sapiens_assembly19.fasta # replace this with the path to your own reference genome
 
@@ -62,20 +63,6 @@ indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strel
 # If this value is not provided, it will default to "."
 label: "."
 
-# Whether to "normalize" the VCF output of every caller script
-# This option is not recommended if you plan to feed the data to a random forest
-# Normalization usually involves left-alignment and trimming of variants
-# See https://genome.sph.umich.edu/wiki/Variant_Normalization for more info
-# In our pipeline, the normalization step also reduces counts of variants that appear at the same position to 1
-# If this value is not provided, it will default to false
-no_normalize: false
-
-# Whether to replace NA values in the dataset before outputting the final TSV
-# This option is not recommended if you plan to feed the data to a random forest
-# Unless otherwise specified in the Caller Specific Parameters below, NA values will be replaced with 0
-# If this value is not provided, it will default to false
-keep_na: false
-
 # A list of filtering expressions for filtering the sites before outputting the final TSV
 # A filtering expression consists of the following concatenated together:
 #   - the caller id
@@ -87,9 +74,3 @@ keep_na: false
 snp_filter: ['gatk-snp~DP>10']
 indel_filter: ['gatk-indel~DP>10']
 
-# Whether to normalize weird numerical columns before outputting the final TSV
-# (ex: scientific notation or numbers followed by a percent sign)
-# We recommended setting this to false in order to allow normalization if you
-# plan to feed the data to a random forest
-# If this option is commented out, it will default to false
-pure_numerics: false
diff --git a/envs/classify.yml b/envs/classify.yml
@@ -1,15 +1,20 @@
 channels:
-  - defaults
+  - conda-forge
+  - nodefaults
 dependencies:
   - r-dplyr==0.8.5
   - r-r.devices==2.16.1
   - r-ranger==0.11.2
   - r-earth==5.1.2
-  - conda-forge::coreutils==8.31
+  - coreutils==8.31
   - r-mlr==2.17.1
   - r-parallelmap==1.5.0
   - r==3.5.1
   - r-readr==1.1.1
   - r-data.table==1.12.8
   - r-plyr==1.8.4
-  - conda-forge::gawk==5.1.0
+  - gawk==5.1.0
+  - conda-forge::coreutils==8.31
+  - conda-forge::sed==4.8
+  - conda-forge::bash==5.0.018
+  - conda-forge::gzip==1.10
diff --git a/envs/prc.yml b/envs/prc.yml
@@ -1,11 +1,16 @@
 channels:
   - conda-forge
+  - nodefaults
 dependencies:
-  - numpy==1.18.5
-  - pandas==1.0.4
-  - scikit-learn==0.23.1
-  - matplotlib==3.1.2
-  - coreutils==8.31
+  - conda-forge::numpy==1.18.5
+  - conda-forge::pandas==1.0.4
+  - conda-forge::scikit-learn==0.23.1
+  - conda-forge::matplotlib==3.1.2
+  - conda-forge::coreutils==8.31
   - bioconda::pysam==0.15.3
   - bioconda::bcftools==1.10.2
-  - gawk==5.1.0
+  - conda-forge::gawk==5.1.0
+  - conda-forge::coreutils==8.31
+  - conda-forge::sed==4.8
+  - conda-forge::bash==5.0.018
+  - conda-forge::gzip==1.10
diff --git a/envs/prepare.yml b/envs/prepare.yml
@@ -1,34 +1,38 @@
 channels:
-  - defaults
+  - conda-forge
+  - nodefaults
 dependencies:
-  - r-pbapply==1.4_2
-  - r-stringr==1.4.0
-  - r-readr==1.1.1
+  - conda-forge::r-pbapply==1.4_2
+  - conda-forge::r-stringr==1.4.0
+  - conda-forge::r-readr==1.1.1
   - bioconda::bioconductor-rsamtools==1.34.0
-  - gawk==5.1.0
-  - tabix==0.2.6
-  - bcftools==1.9=ha228f0b_4
+  - conda-forge::gawk==5.1.0
+  - bioconda::tabix==0.2.6
+  - bioconda::bcftools==1.9=ha228f0b_4
   - bioconda::delly==0.8.3
   - bioconda::bioconductor-bsgenome.hsapiens.ucsc.hg19==1.4.0
   - bioconda::bioconductor-biovizbase==1.30.1
-  - vcftools==0.1.16
+  - bioconda::vcftools==0.1.16
   - conda-forge::r-rmutil==1.1.4
-  - gatk4==4.1.7.0
-  - r-lattice==0.20_41
-  - r-reshape==0.8.8
-  - vardict-java==1.7.0
-  - manta==1.6.0
-  - bwa==0.7.17
-  - r-data.table==1.12.8
-  - r-plyr==1.8.4
-  - samtools==1.9=h8571acd_11
-  - strelka==2.9.10
+  - bioconda::gatk4==4.1.7.0
+  - conda-forge::r-lattice==0.20_41
+  - conda-forge::r-reshape==0.8.8
+  - bioconda::vardict-java==1.7.0
+  - bioconda::manta==1.6.0
+  - bioconda::bwa==0.7.17
+  - conda-forge::r-data.table==1.12.8
+  - conda-forge::r-plyr==1.8.4
+  - bioconda::samtools==1.9=h8571acd_11
+  - bioconda::strelka==2.9.10
   - bioconda::bioconductor-variantannotation==1.28.3
-  - pindel==0.2.5b9
-  - macs2==2.1.2
-  - r-dplyr==0.8.5
+  - bioconda::pindel==0.2.5b9
+  - bioconda::macs2==2.1.2
+  - conda-forge::r-dplyr==0.8.5
   - bioconda::bioconductor-genomicalignments==1.18.1
-  - bedtools==2.29.2
-  - varscan==2.4.4
+  - bioconda::bedtools==2.29.2
+  - bioconda::varscan==2.4.4
   - bioconda::bioconductor-rtracklayer==1.42.1
   - conda-forge::coreutils==8.31
+  - conda-forge::sed==4.8
+  - conda-forge::bash==5.0.018
+  - conda-forge::gzip==1.10
diff --git a/rules/README.md b/rules/README.md
@@ -103,7 +103,7 @@ First, split the truth dataset by chromosome parity using `awk` commands like th
 
 Then, specify in `classify.yaml` the path to the `indel-odds.tsv.gz` file as your training set and the path to the `indel-evens.tsv.gz` file as your test set. Make sure that both your training and test sets have a [truth caller ID](/configs/classify.yaml#L26) and that you've specified the test set ID in the list of `predict` datasets. Once you've finished, execute the `classify` subworkflow on its own (see [above](#execution-1)).
 
-To truly reproduce our results, we also recommend predicting variants for the ATAC-seq cell lines we tested (Jurkat, Molt-4, K-562, etc) by downloading them from GEO accession GSE129086. Although the example data includes the Jurkat and Molt-4 samples, it only includes chromosome 1 from those cell lines, thereby artificially lowering the VarCA scores of the variants in those samples.
+To truly reproduce our results, we also recommend predicting variants for the ATAC-seq cell lines we tested (Jurkat, Molt-4, K-562, etc) by downloading them from GEO accession GSE129086. Although the example data includes the Jurkat and Molt-4 samples, it only includes a small portion of chromosome 1 from those cell lines, thereby artificially lowering the VarCA scores of the variants in those samples.
 
 The `classify` subworkflow can only work with one trained model at a time, so you will need to repeat these steps for SNVs. Just replace every mention of "indel" in `classify.yaml` with "snp". Also remember to use only the SNV callers (ie GATK, VarScan 2, and VarDict).
 

diff --git a/rules/classify.smk b/rules/classify.smk
@@ -1,7 +1,7 @@
 from snakemake.utils import min_version
 
 ##### set minimum snakemake version #####
-min_version("5.18.0")
+min_version("5.24.2")
 
 if 'imported' not in config:
     configfile: "configs/classify.yaml"
@@ -242,7 +242,7 @@ rule tsv2vcf:
     output: temp(config['out']+"/{sample}/results.vcf.gz")
     conda: "../envs/prc.yml"
     shell:
-        "zcat {input.merge} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|.*~(REF|ALT)$' | scripts/2vcf.py -o {output} {params.callers} {input.results}"
+        "scripts/2vcf.py -o {output} {params.callers} {input.results} {input.merge}"
 
 rule fix_vcf_header:
     """ add contigs to the header of the vcf """