From 6eb888613861cac204c9a9ab7a60432819d612fd Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Sat, 27 Mar 2021 09:23:11 -0700
Subject: [PATCH 01/26] handle varscan iupac ambiguity codes

resolves #25
---
 callers/varscan | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/callers/varscan b/callers/varscan
index 2b28adc..78a9a04 100755
--- a/callers/varscan
+++ b/callers/varscan
@@ -13,4 +13,5 @@ output_dir="$4"
 
 samtools mpileup -l "$peaks" -f "$genome" -B "$bam" | \
 varscan mpileup2cns --p-value 1 --strand-filter 0 --output-vcf > "$output_dir/varscan.vcf" && \
-bgzip -f "$output_dir/varscan.vcf" && tabix -p vcf -f "$output_dir/varscan.vcf.gz"
+awk -F $"\t" -v 'OFS=\t' '/^#/ || $4 !~ /^(R|Y|M|W|S|K|V|H|D|B|N)$/' "$output_dir/varscan.vcf" | \
+bgzip > "$output_dir/varscan.vcf.gz" && tabix -p vcf -f "$output_dir/varscan.vcf.gz"

From 1ebe4f6c54ba67e084d76c66279320b1dddea062 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Tue, 6 Apr 2021 11:40:27 -0700
Subject: [PATCH 02/26] update install instructions to endorse mamba

based on comments made in https://github.com/aryarm/varCA/issues/8#issuecomment-814290102
---
 README.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 8e06620..3083a68 100644
--- a/README.md
+++ b/README.md
@@ -21,11 +21,12 @@ wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz
 ```
 
 # setup
-The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). We recommend installing version 5.18.0:
+The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). For most users, we recommend installing Snakemake via mamba as [described in their documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda-mamba).
+
+However, if your aim is complete reproducbility, we recommend installing version 5.18.0 via this command:
 ```
 conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
 ```
-We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) like this so that you can use the `--use-conda` flag when calling `snakemake` to let it [automatically handle all dependencies](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).
 
 # execution
 1. Activate snakemake via `conda`:

From 01665bf2b791ab5e687097517e81ba670f86bb76 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Mon, 26 Apr 2021 12:29:16 -0700
Subject: [PATCH 03/26] docs: new callers must be added to the env file

---
 callers/README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/callers/README.md b/callers/README.md
index 179465a..7cafb03 100644
--- a/callers/README.md
+++ b/callers/README.md
@@ -40,3 +40,8 @@ By providing a dash character `-` in the caller identifier, the caller script (e
 The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed as an argument to the original caller script. (For example, the directory containing the output of the `gatk` special script will be passed as a parameter to the `gatk-snp` caller script.)
 
 By providing multiple dashes in your caller identifiers using this scheme, you may design complicated caller script heirarchies involving multiple levels of nesting. In a sense, you can create small pipelines within the pipeline.
+
+### Caller script dependencies
+If you caller script requires specific dependencies, just add them to the [prepare.yml](/envs/prepare.yml) [conda environment file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
+Any packages listed in this file will be available to your caller script when it is executed.
+If you would like to use a newer version of an existing variant caller, just change the version specified in the [prepare.yml](/envs/prepare.yml) file. Snakemake will automatically update the package upon your next execution of the pipeline.

From bd4e92f48c4a9b2ccfba0494882c6dea59f0b26b Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Mon, 26 Apr 2021 12:37:27 -0700
Subject: [PATCH 04/26] add newline to callers README

---
 callers/README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/callers/README.md b/callers/README.md
index 7cafb03..c312831 100644
--- a/callers/README.md
+++ b/callers/README.md
@@ -44,4 +44,5 @@ By providing multiple dashes in your caller identifiers using this scheme, you m
 ### Caller script dependencies
 If you caller script requires specific dependencies, just add them to the [prepare.yml](/envs/prepare.yml) [conda environment file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
 Any packages listed in this file will be available to your caller script when it is executed.
+
 If you would like to use a newer version of an existing variant caller, just change the version specified in the [prepare.yml](/envs/prepare.yml) file. Snakemake will automatically update the package upon your next execution of the pipeline.

From 7082e70e4949d4d4ec8968baeecf4ead52dc532a Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Thu, 29 Apr 2021 11:21:23 -0700
Subject: [PATCH 05/26] temporarily handle 2vcf exit code (see #28)

---
 rules/classify.smk | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rules/classify.smk b/rules/classify.smk
index 49f542d..748d1f6 100644
--- a/rules/classify.smk
+++ b/rules/classify.smk
@@ -242,7 +242,7 @@ rule tsv2vcf:
     output: temp(config['out']+"/{sample}/results.vcf.gz")
     conda: "../envs/prc.yml"
     shell:
-        "zcat {input.merge} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|.*~(REF|ALT)$' | scripts/2vcf.py -o {output} {params.callers} {input.results}"
+        "zcat {input.merge} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|.*~(REF|ALT)$' | scripts/2vcf.py -o {output} {params.callers} {input.results} || true"
 
 rule fix_vcf_header:
     """ add contigs to the header of the vcf """

From 86e1453f74d46cfba8cff7bf0f4fdbbf59311604 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Thu, 29 Apr 2021 11:22:17 -0700
Subject: [PATCH 06/26] handle pvals of 1 in random forest output when creating
 VCF

---
 scripts/2vcf.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/scripts/2vcf.py b/scripts/2vcf.py
index d7f19ad..32559ab 100755
--- a/scripts/2vcf.py
+++ b/scripts/2vcf.py
@@ -14,8 +14,12 @@
 
 def phred(vals):
     """ apply the phred scale to the vals provided """
-    return -10*np.log10(1-vals)
-    return -10*np.ma.log10(1-vals).filled(-3) # fill all infinite values with a phred scale of 30
+    with np.errstate(divide='raise'):
+        try:
+            return -10*np.log10(1-vals)
+        except FloatingPointError:
+            return np.float64(30)
+            return -10*np.ma.log10(1-vals).filled(-3) # fill all infinite values with a phred scale of 30
 
 def plot_line(lst, show_discards=False):
     plt.clf()

From 753f81aee816a3396a657a8fd15b7069c0bd95e8 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Thu, 29 Apr 2021 11:28:34 -0700
Subject: [PATCH 07/26] add a script for analyzing allele conflicts

---
 scripts/README.md             |  3 +++
 scripts/allele_conflicts.bash | 49 +++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)
 create mode 100755 scripts/allele_conflicts.bash

diff --git a/scripts/README.md b/scripts/README.md
index 5547750..aaf5afd 100644
--- a/scripts/README.md
+++ b/scripts/README.md
@@ -7,6 +7,9 @@ All python scripts implement the `--help` argument. For bash, R, and awk scripts
 ### [2vcf.py](2vcf.py)
 A python script that uses files from the `prepare` and `classify` pipelines to create a VCF with the final, predicted variants. This script also has a special internal mode, which can be used for recalibrating the QUAL scores output in the VCF.
 
+### [allele_conflicts.bash](allele_conflicts.bash)
+A bash script for identifying sites at which the variant callers in our ensemble outputted conflicting alleles.
+
 ### [cgrep.bash](cgrep.bash)
 A bash script for extracting columns from TSVs via `grep`. Every argument besides the first is passed directly to `grep`.
 
diff --git a/scripts/allele_conflicts.bash b/scripts/allele_conflicts.bash
new file mode 100755
index 0000000..a497fe9
--- /dev/null
+++ b/scripts/allele_conflicts.bash
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+
+# List all positions (to stdout) at which VarCA called a variant but there were conflicts for the alternative allele.
+# Note that this script will only work for the chosen subset of variant callers.
+
+# arg1: merge.tsv.gz file
+# arg2: results.tsv.gz file
+# arg3: 'snp' or 'indel'
+
+# Ex: scripts/allele_conflicts.bash out/merged_indel/SRR891269/merge.tsv.gz out/new-classify/classify-indel/SRR891269_even_test/results.tsv.gz indel
+
+if [ "$3" = 'indel' ]; then
+    zcat "$1" | \
+    scripts/cgrep.bash - -E '^(CHROM|POS)$|(gatk|varscan|vardict|pindel|illumina-strelka|pg-indel).*~(ALT)$' | \
+    awk -F $'\t' -v 'OFS=\t' '$3 != $4 || $4 != $5 || $5 != $6 || $6 != $7 || $7 != $3' | \
+    tail -n+2 | \
+    awk -F $'\t' -v 'OFS=\t' '{ { for(i=3; i<=NF-1; i++) count[$i]++; } PROCINFO["sorted_in"] = "@val_num_desc"; { for (val in count) { print $0, val, count[val]; break; } } delete count; }' | \
+    sed 's/\t/,/' | \
+    LC_ALL=C sort -t $'\t' -k1,1 | \
+    LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order - <(
+        zcat "$2" | \
+        awk -F $"\t" -v 'OFS=\t' 'NR == 1 || $NF == 1' | \
+        sed 's/\t/,/' | \
+        LC_ALL=C sort -t $'\t' -k1,1
+    )
+else
+    zcat "$1" | \
+    scripts/cgrep.bash - -E '^(CHROM|POS)$|(gatk|varscan|vardict|pg-snp).*~(ALT)$' | \
+    awk -F $'\t' -v 'OFS=\t' '$3 != $4 || $4 != $5 || $5 != $3' | \
+    tail -n+2 | \
+    awk -F $'\t' -v 'OFS=\t' '{ { for(i=3; i<=NF-1; i++) count[$i]++; } PROCINFO["sorted_in"] = "@val_num_desc"; { for (val in count) { print $0, val, count[val]; break; } } delete count; }' | \
+    sed 's/\t/,/' | \
+    LC_ALL=C sort -t $'\t' -k1,1 | \
+    LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order - <(
+        zcat "$2" | \
+        awk -F $"\t" -v 'OFS=\t' 'NR == 1 || $NF == 1' | \
+        sed 's/\t/,/' | \
+        LC_ALL=C sort -t $'\t' -k1,1
+    )
+fi | \
+sed 's/,/\t/' | \
+LC_ALL=C sort -t $'\t' -k1,1V -k2,2n
+
+# To make a table containing alternative alleles for the following rules (as columns)
+# 1) variant caller priority
+# 2) majority rule
+# 3) platinum genomes
+# bcftools query -f '%CHROM\t%POS\t%INFO/CALLER\n' out/new-classify/classify-indel/SRR891269_even_test/final.vcf.gz | sed 's/gatk-indel/1/; s/varscan-indel/2/; s/vardict-indel/3/; s/pindel/4/; s/illumina-strelka/5/' | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1 | LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order <(zcat out/new-classify/classify-indel/SRR891269_even_test/allele_conflicts.tsv.gz | sed 's/\t/,/' | sort -t $'\t' -k1,1) - | cut -f2- | awk -F $'\t' -v 'OFS=\t' '{print $$NF, $7, $6;}' | less
+# bcftools query -f '%CHROM\t%POS\t%INFO/CALLER\n' out/new-classify/classify-snp/SRR891269_even_test/final.vcf.gz | sed 's/gatk-snp/1/; s/varscan-snp/2/; s/vardict-snp/3/' | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1 | LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order <(zcat out/new-classify/classify-snp/SRR891269_even_test/allele_conflicts.tsv.gz | sed 's/\t/,/' | sort -t $'\t' -k1,1) - | cut -f2- | awk -F $'\t' -v 'OFS=\t' '{print $$NF, $5, $4;}' | less

From 9e872de4155101d16a89eb47dd50476662d9d2f2 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Wed, 16 Jun 2021 22:24:31 -0700
Subject: [PATCH 08/26] update to snakemake v5.24.2

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 3083a68..fdb38b9 100644
--- a/README.md
+++ b/README.md
@@ -23,9 +23,9 @@ wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz
 # setup
 The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). For most users, we recommend installing Snakemake via mamba as [described in their documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda-mamba).
 
-However, if your aim is complete reproducbility, we recommend installing version 5.18.0 via this command:
+However, if your aim is complete reproducbility, we recommend installing version 5.24.2 via this command:
 ```
-conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
+conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.24.2'
 ```
 
 # execution

From 75ee39586944d761cdc7eafb4ff05ccb3c583592 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Wed, 16 Jun 2021 22:33:13 -0700
Subject: [PATCH 09/26] initial dockerfile attempt: untested

---
 Dockerfile | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
 create mode 100644 Dockerfile

diff --git a/Dockerfile b/Dockerfile
new file mode 100644
index 0000000..1284efa
--- /dev/null
+++ b/Dockerfile
@@ -0,0 +1,14 @@
+FROM snakemake/snakemake:v5.24.2
+
+# work in the home directory
+WORKDIR /home
+
+# copy all the files to the container
+COPY . .
+
+# install the pipeline's dependencies
+RUN snakemake --use-conda --conda-create-envs-only -j
+
+# run varCA
+# specify as a CMD, so it can be overridden by the user
+CMD ["./run.bash"]

From af69606981c99424d01dc6a5afa3ccd0e19173c1 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Fri, 2 Jul 2021 13:10:00 -0700
Subject: [PATCH 10/26] subset columns before import in 2vcf.py; also resolve
 #28

---
 rules/classify.smk | 2 +-
 rules/prepare.smk  | 2 +-
 scripts/2vcf.py    | 5 ++++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/rules/classify.smk b/rules/classify.smk
index 748d1f6..41eee60 100644
--- a/rules/classify.smk
+++ b/rules/classify.smk
@@ -242,7 +242,7 @@ rule tsv2vcf:
     output: temp(config['out']+"/{sample}/results.vcf.gz")
     conda: "../envs/prc.yml"
     shell:
-        "zcat {input.merge} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|.*~(REF|ALT)$' | scripts/2vcf.py -o {output} {params.callers} {input.results} || true"
+        "scripts/2vcf.py -o {output} {params.callers} {input.results} {input.merge}"
 
 rule fix_vcf_header:
     """ add contigs to the header of the vcf """
diff --git a/rules/prepare.smk b/rules/prepare.smk
index ce5f93d..f5f0c1b 100644
--- a/rules/prepare.smk
+++ b/rules/prepare.smk
@@ -144,7 +144,7 @@ rule bed_peaks:
         # to convert to BED, we must extract the first three columns (chr, start, stop)
         "cut -f -3 \"{input.peaks}\" | "
         "bedtools getfasta -fi {input.ref} -bedOut -bed stdin | "
-        "sort -t $'\t' -k1,1V -k2,2n > \"{output}\" && "
+        "sort -t $'\\t' -k1,1V -k2,2n > \"{output}\" && "
         "test -s \"{output}\""
 
 
diff --git a/scripts/2vcf.py b/scripts/2vcf.py
index 32559ab..64afddb 100755
--- a/scripts/2vcf.py
+++ b/scripts/2vcf.py
@@ -218,7 +218,8 @@ def main(prepared, classified, callers=None, cs=1000, all_sites=False, pretty=Fa
     prepared = get_calls(
         pd.read_csv(
             prepared, sep="\t", header=0, index_col=["CHROM", "POS"],
-            dtype=str, chunksize=cs, na_values="."
+            dtype=str, chunksize=cs, na_values=".",
+            usecols=lambda col: col in ['CHROM', 'POS', 'REF'] or col.endswith('~REF') or col.endswith('~ALT')
         ), callers, pretty
     )
     # flush out the first item in the generator: the vartype
@@ -338,6 +339,8 @@ def write_vcf(out, records):
         callers = args.callers.split(",")
 
     if not args.internal:
+        import matplotlib
+        matplotlib.use('Agg')
         vcf = write_vcf(args.out, main(args.prepared, args.classified, callers, args.chunksize, args.all, args.pretty, args.type))
     else:
         if not sys.flags.interactive:

From abfc6b77716cf12f20bda000e124ce5558f0e07b Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Fri, 2 Jul 2021 13:22:36 -0700
Subject: [PATCH 11/26] add header to allele conflicts script

---
 scripts/allele_conflicts.bash | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/scripts/allele_conflicts.bash b/scripts/allele_conflicts.bash
index a497fe9..a794998 100755
--- a/scripts/allele_conflicts.bash
+++ b/scripts/allele_conflicts.bash
@@ -10,6 +10,7 @@
 # Ex: scripts/allele_conflicts.bash out/merged_indel/SRR891269/merge.tsv.gz out/new-classify/classify-indel/SRR891269_even_test/results.tsv.gz indel
 
 if [ "$3" = 'indel' ]; then
+#    echo -e "CHROM\tPOS\tgatk\tvarscan\tvardict\tpindel\tstrelka\tpg\tmajority\tmajority_idx\tgatk\tvarscan\tvardict\tpindel\tstrelka\tpg\tpg\tprob0\tprob1\tvarca"
     zcat "$1" | \
     scripts/cgrep.bash - -E '^(CHROM|POS)$|(gatk|varscan|vardict|pindel|illumina-strelka|pg-indel).*~(ALT)$' | \
     awk -F $'\t' -v 'OFS=\t' '$3 != $4 || $4 != $5 || $5 != $6 || $6 != $7 || $7 != $3' | \
@@ -24,6 +25,7 @@ if [ "$3" = 'indel' ]; then
         LC_ALL=C sort -t $'\t' -k1,1
     )
 else
+#    echo -e "CHROM\tPOS\tgatk\tvarscan\tvardict\tpg\tmajority\tmajority_idx\tgatk\tvarscan\tvardict\tpg\tpg\tprob0\tprob1\tvarca"
     zcat "$1" | \
     scripts/cgrep.bash - -E '^(CHROM|POS)$|(gatk|varscan|vardict|pg-snp).*~(ALT)$' | \
     awk -F $'\t' -v 'OFS=\t' '$3 != $4 || $4 != $5 || $5 != $3' | \
@@ -42,8 +44,8 @@ sed 's/,/\t/' | \
 LC_ALL=C sort -t $'\t' -k1,1V -k2,2n
 
 # To make a table containing alternative alleles for the following rules (as columns)
-# 1) variant caller priority
+# 1) caller priority rule
 # 2) majority rule
 # 3) platinum genomes
-# bcftools query -f '%CHROM\t%POS\t%INFO/CALLER\n' out/new-classify/classify-indel/SRR891269_even_test/final.vcf.gz | sed 's/gatk-indel/1/; s/varscan-indel/2/; s/vardict-indel/3/; s/pindel/4/; s/illumina-strelka/5/' | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1 | LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order <(zcat out/new-classify/classify-indel/SRR891269_even_test/allele_conflicts.tsv.gz | sed 's/\t/,/' | sort -t $'\t' -k1,1) - | cut -f2- | awk -F $'\t' -v 'OFS=\t' '{print $$NF, $7, $6;}' | less
-# bcftools query -f '%CHROM\t%POS\t%INFO/CALLER\n' out/new-classify/classify-snp/SRR891269_even_test/final.vcf.gz | sed 's/gatk-snp/1/; s/varscan-snp/2/; s/vardict-snp/3/' | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1 | LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order <(zcat out/new-classify/classify-snp/SRR891269_even_test/allele_conflicts.tsv.gz | sed 's/\t/,/' | sort -t $'\t' -k1,1) - | cut -f2- | awk -F $'\t' -v 'OFS=\t' '{print $$NF, $5, $4;}' | less
+# bcftools query -f '%CHROM\t%POS\t%INFO/CALLER\n' out-original/new-classify/classify-indel/SRR891269_even_test/final.vcf.gz | sed 's/gatk-indel/1/; s/varscan-indel/2/; s/vardict-indel/3/; s/pindel/4/; s/illumina-strelka/5/' | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1 | LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order <(zcat out-original/new-classify/classify-indel/SRR891269_even_test/allele_conflicts.tsv.gz | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1) - | cut -f2- | awk -F $'\t' -v 'OFS=\t' '{print $$NF, $7, $6;}' | less
+# bcftools query -f '%CHROM\t%POS\t%INFO/CALLER\n' out-original/new-classify/classify-snp/SRR891269_even_test/final.vcf.gz | sed 's/gatk-snp/1/; s/varscan-snp/2/; s/vardict-snp/3/' | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1 | LC_ALL=C join -t $'\t' -e '' -j1 -o auto --nocheck-order <(zcat out-original/new-classify/classify-snp/SRR891269_even_test/allele_conflicts.tsv.gz | sed 's/\t/,/' | LC_ALL=C sort -t $'\t' -k1,1) - | cut -f2- | awk -F $'\t' -v 'OFS=\t' '{print $$NF, $5, $4;}' | less

From 4167a82d5088ce02807528f9c352772633869f52 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Fri, 2 Jul 2021 13:30:29 -0700
Subject: [PATCH 12/26] be specific about the channels in each env file and
 remove default channels

---
 envs/classify.yml |  3 ++-
 envs/prepare.yml  | 47 ++++++++++++++++++++++++-----------------------
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/envs/classify.yml b/envs/classify.yml
index 5ddb80e..c79f205 100644
--- a/envs/classify.yml
+++ b/envs/classify.yml
@@ -1,5 +1,6 @@
 channels:
-  - defaults
+  - conda-forge
+  - nodefaults
 dependencies:
   - r-dplyr==0.8.5
   - r-r.devices==2.16.1
diff --git a/envs/prepare.yml b/envs/prepare.yml
index 9b92631..eb8849e 100644
--- a/envs/prepare.yml
+++ b/envs/prepare.yml
@@ -1,34 +1,35 @@
 channels:
-  - defaults
+  - conda-forge
+  - nodefaults
 dependencies:
-  - r-pbapply==1.4_2
-  - r-stringr==1.4.0
-  - r-readr==1.1.1
+  - conda-forge::r-pbapply==1.4_2
+  - conda-forge::r-stringr==1.4.0
+  - conda-forge::r-readr==1.1.1
   - bioconda::bioconductor-rsamtools==1.34.0
-  - gawk==5.1.0
-  - tabix==0.2.6
-  - bcftools==1.9=ha228f0b_4
+  - conda-forge::gawk==5.1.0
+  - bioconda::tabix==0.2.6
+  - bioconda::bcftools==1.9=ha228f0b_4
   - bioconda::delly==0.8.3
   - bioconda::bioconductor-bsgenome.hsapiens.ucsc.hg19==1.4.0
   - bioconda::bioconductor-biovizbase==1.30.1
-  - vcftools==0.1.16
+  - bioconda::vcftools==0.1.16
   - conda-forge::r-rmutil==1.1.4
-  - gatk4==4.1.7.0
-  - r-lattice==0.20_41
-  - r-reshape==0.8.8
-  - vardict-java==1.7.0
-  - manta==1.6.0
-  - bwa==0.7.17
-  - r-data.table==1.12.8
-  - r-plyr==1.8.4
-  - samtools==1.9=h8571acd_11
-  - strelka==2.9.10
+  - bioconda::gatk4==4.1.7.0
+  - conda-forge::r-lattice==0.20_41
+  - conda-forge::r-reshape==0.8.8
+  - bioconda::vardict-java==1.7.0
+  - bioconda::manta==1.6.0
+  - bioconda::bwa==0.7.17
+  - conda-forge::r-data.table==1.12.8
+  - conda-forge::r-plyr==1.8.4
+  - bioconda::samtools==1.9=h8571acd_11
+  - bioconda::strelka==2.9.10
   - bioconda::bioconductor-variantannotation==1.28.3
-  - pindel==0.2.5b9
-  - macs2==2.1.2
-  - r-dplyr==0.8.5
+  - bioconda::pindel==0.2.5b9
+  - bioconda::macs2==2.1.2
+  - conda-forge::r-dplyr==0.8.5
   - bioconda::bioconductor-genomicalignments==1.18.1
-  - bedtools==2.29.2
-  - varscan==2.4.4
+  - bioconda::bedtools==2.29.2
+  - bioconda::varscan==2.4.4
   - bioconda::bioconductor-rtracklayer==1.42.1
   - conda-forge::coreutils==8.31

From cbadb8201319e70b1d9441a24865ca26673aeae3 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sat, 3 Jul 2021 16:29:01 -0700
Subject: [PATCH 13/26] remove explicit channel specs in classify env but add
 to prc env since the classify env is mixed conda-forge/bioconda while prc
 just uses conda-forge

---
 envs/classify.yml |  4 ++--
 envs/prc.yml      | 13 +++++++------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/envs/classify.yml b/envs/classify.yml
index c79f205..c6e0d9d 100644
--- a/envs/classify.yml
+++ b/envs/classify.yml
@@ -6,11 +6,11 @@ dependencies:
   - r-r.devices==2.16.1
   - r-ranger==0.11.2
   - r-earth==5.1.2
-  - conda-forge::coreutils==8.31
+  - coreutils==8.31
   - r-mlr==2.17.1
   - r-parallelmap==1.5.0
   - r==3.5.1
   - r-readr==1.1.1
   - r-data.table==1.12.8
   - r-plyr==1.8.4
-  - conda-forge::gawk==5.1.0
+  - gawk==5.1.0
diff --git a/envs/prc.yml b/envs/prc.yml
index f695875..d28f4cf 100644
--- a/envs/prc.yml
+++ b/envs/prc.yml
@@ -1,11 +1,12 @@
 channels:
   - conda-forge
+  - nodefaults
 dependencies:
-  - numpy==1.18.5
-  - pandas==1.0.4
-  - scikit-learn==0.23.1
-  - matplotlib==3.1.2
-  - coreutils==8.31
+  - conda-forge::numpy==1.18.5
+  - conda-forge::pandas==1.0.4
+  - conda-forge::scikit-learn==0.23.1
+  - conda-forge::matplotlib==3.1.2
+  - conda-forge::coreutils==8.31
   - bioconda::pysam==0.15.3
   - bioconda::bcftools==1.10.2
-  - gawk==5.1.0
+  - conda-forge::gawk==5.1.0

From 366c71d3552ba1c8b257b5317a046056fda7ceff Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sat, 3 Jul 2021 17:22:46 -0700
Subject: [PATCH 14/26] add sed, gzip, bash, and friends to the env files for
 reproducibility

---
 envs/classify.yml | 4 ++++
 envs/prc.yml      | 4 ++++
 envs/prepare.yml  | 3 +++
 3 files changed, 11 insertions(+)

diff --git a/envs/classify.yml b/envs/classify.yml
index c6e0d9d..7509e17 100644
--- a/envs/classify.yml
+++ b/envs/classify.yml
@@ -14,3 +14,7 @@ dependencies:
   - r-data.table==1.12.8
   - r-plyr==1.8.4
   - gawk==5.1.0
+  - conda-forge::coreutils==8.31
+  - conda-forge::sed==4.8
+  - conda-forge::bash==5.0.018
+  - conda-forge::gzip==1.10
diff --git a/envs/prc.yml b/envs/prc.yml
index d28f4cf..091e6ea 100644
--- a/envs/prc.yml
+++ b/envs/prc.yml
@@ -10,3 +10,7 @@ dependencies:
   - bioconda::pysam==0.15.3
   - bioconda::bcftools==1.10.2
   - conda-forge::gawk==5.1.0
+  - conda-forge::coreutils==8.31
+  - conda-forge::sed==4.8
+  - conda-forge::bash==5.0.018
+  - conda-forge::gzip==1.10
diff --git a/envs/prepare.yml b/envs/prepare.yml
index eb8849e..53ea7cb 100644
--- a/envs/prepare.yml
+++ b/envs/prepare.yml
@@ -33,3 +33,6 @@ dependencies:
   - bioconda::varscan==2.4.4
   - bioconda::bioconductor-rtracklayer==1.42.1
   - conda-forge::coreutils==8.31
+  - conda-forge::sed==4.8
+  - conda-forge::bash==5.0.018
+  - conda-forge::gzip==1.10

From 54050bbc905b7ac1bff992c0fa41bdd01b8669c3 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Tue, 6 Apr 2021 11:40:27 -0700
Subject: [PATCH 15/26] update install instructions to endorse mamba

based on comments made in https://github.com/aryarm/varCA/issues/8#issuecomment-814290102
---
 README.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 8e06620..3083a68 100644
--- a/README.md
+++ b/README.md
@@ -21,11 +21,12 @@ wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz
 ```
 
 # setup
-The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). We recommend installing version 5.18.0:
+The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). For most users, we recommend installing Snakemake via mamba as [described in their documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda-mamba).
+
+However, if your aim is complete reproducbility, we recommend installing version 5.18.0 via this command:
 ```
 conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
 ```
-We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) like this so that you can use the `--use-conda` flag when calling `snakemake` to let it [automatically handle all dependencies](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).
 
 # execution
 1. Activate snakemake via `conda`:

From 51495922936f0b1a1f336e7c789ee361c69b0a06 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Mon, 26 Apr 2021 12:29:16 -0700
Subject: [PATCH 16/26] docs: new callers must be added to the env file

---
 callers/README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/callers/README.md b/callers/README.md
index 179465a..7cafb03 100644
--- a/callers/README.md
+++ b/callers/README.md
@@ -40,3 +40,8 @@ By providing a dash character `-` in the caller identifier, the caller script (e
 The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed as an argument to the original caller script. (For example, the directory containing the output of the `gatk` special script will be passed as a parameter to the `gatk-snp` caller script.)
 
 By providing multiple dashes in your caller identifiers using this scheme, you may design complicated caller script heirarchies involving multiple levels of nesting. In a sense, you can create small pipelines within the pipeline.
+
+### Caller script dependencies
+If you caller script requires specific dependencies, just add them to the [prepare.yml](/envs/prepare.yml) [conda environment file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
+Any packages listed in this file will be available to your caller script when it is executed.
+If you would like to use a newer version of an existing variant caller, just change the version specified in the [prepare.yml](/envs/prepare.yml) file. Snakemake will automatically update the package upon your next execution of the pipeline.

From 7ba7780423259b5ee6ed2cf57fba0d54f4ac77b6 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Mon, 26 Apr 2021 12:37:27 -0700
Subject: [PATCH 17/26] add newline to callers README

---
 callers/README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/callers/README.md b/callers/README.md
index 7cafb03..c312831 100644
--- a/callers/README.md
+++ b/callers/README.md
@@ -44,4 +44,5 @@ By providing multiple dashes in your caller identifiers using this scheme, you m
 ### Caller script dependencies
 If you caller script requires specific dependencies, just add them to the [prepare.yml](/envs/prepare.yml) [conda environment file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually).
 Any packages listed in this file will be available to your caller script when it is executed.
+
 If you would like to use a newer version of an existing variant caller, just change the version specified in the [prepare.yml](/envs/prepare.yml) file. Snakemake will automatically update the package upon your next execution of the pipeline.

From 786d7aaee8a7fa38ee7d5ab4ca1806dabe75a8bf Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Wed, 16 Jun 2021 22:24:31 -0700
Subject: [PATCH 18/26] update to snakemake v5.24.2

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 3083a68..fdb38b9 100644
--- a/README.md
+++ b/README.md
@@ -23,9 +23,9 @@ wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz
 # setup
 The pipeline is written as a Snakefile which can be executed via [Snakemake](https://snakemake.readthedocs.io). For most users, we recommend installing Snakemake via mamba as [described in their documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda-mamba).
 
-However, if your aim is complete reproducbility, we recommend installing version 5.18.0 via this command:
+However, if your aim is complete reproducbility, we recommend installing version 5.24.2 via this command:
 ```
-conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
+conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.24.2'
 ```
 
 # execution

From 5b34036b2033a0e600ae503b99934fade5db5591 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryarm@users.noreply.github.com>
Date: Wed, 16 Jun 2021 22:33:13 -0700
Subject: [PATCH 19/26] initial dockerfile attempt: untested

---
 Dockerfile | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
 create mode 100644 Dockerfile

diff --git a/Dockerfile b/Dockerfile
new file mode 100644
index 0000000..1284efa
--- /dev/null
+++ b/Dockerfile
@@ -0,0 +1,14 @@
+FROM snakemake/snakemake:v5.24.2
+
+# work in the home directory
+WORKDIR /home
+
+# copy all the files to the container
+COPY . .
+
+# install the pipeline's dependencies
+RUN snakemake --use-conda --conda-create-envs-only -j
+
+# run varCA
+# specify as a CMD, so it can be overridden by the user
+CMD ["./run.bash"]

From c59fe85c99c84a70618e2f4bfc1812df13eed0db Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sat, 10 Jul 2021 22:52:14 -0700
Subject: [PATCH 20/26] remove for-loop in pindel caller script and redirect
 chatty stdout to a log file

---
 callers/pindel | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/callers/pindel b/callers/pindel
index 17fc32e..060becc 100755
--- a/callers/pindel
+++ b/callers/pindel
@@ -14,18 +14,13 @@ samp="$5"
 
 
 # we sleep between each command to avoid potential problems with filesystem latency
-pindel -f "$genome" -i <(echo "$bam" 300 "$samp") -j "$peaks" -o "$output_dir/" -L "$output_dir/log" && \
+pindel -f "$genome" -i <(echo "$bam" 300 "$samp") -j "$peaks" -o "$output_dir/" -L "$output_dir/log" >"$output_dir/stdout" && \
 sleep 10 && \
-for file in _D _INV _SI _TD; do
-	pindel2vcf -p "$output_dir/$file" -r "$genome" -R x -d 00000000 -v "$output_dir/$file.raw.vcf" && \
-	sleep 10 && \
-	bgzip -f "$output_dir/$file.raw.vcf" && \
-	sleep 10 && \
-	tabix -p vcf -f "$output_dir/$file.raw.vcf.gz" && \
-	sleep 10 && \
-	bcftools norm -d all "$output_dir/$file.raw.vcf.gz" > "$output_dir/$file.vcf" && \
-	sleep 10 && \
-	bgzip -f "$output_dir/$file.vcf" && tabix -p vcf -f "$output_dir/$file.vcf.gz"
-done && \
+pindel2vcf -P "$output_dir/" -r "$genome" -R x -d 00000000 -v "$output_dir/pindel.raw.vcf" && \
 sleep 10 && \
-bcftools concat -a -d all "$output_dir/"{_D,_INV,_SI,_TD}.vcf.gz > "$output_dir/pindel.vcf"
+bgzip -f "$output_dir/pindel.raw.vcf" && \
+sleep 10 && \
+tabix -p vcf -f "$output_dir/pindel.raw.vcf.gz" && \
+sleep 10 && \
+bcftools norm -d all "$output_dir/pindel.raw.vcf.gz" > "$output_dir/pindel.vcf"
+

From 450afdc755d086a33e82d6d551c97eed4d1fa42d Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sun, 11 Jul 2021 17:44:11 -0700
Subject: [PATCH 21/26] create script for generating small test datasets
 (resolves #33) A limit on the size of our datasets is that manta requires
 that there be at least 100 high-quality reads. So we can't go smaller than
 that.

---
 README.md                     |  2 +-
 configs/classify.yaml         | 16 ++++-----
 configs/config.yaml           |  7 ++--
 configs/prepare.yaml          |  3 +-
 scripts/README.md             |  5 ++-
 scripts/create_test_data.bash | 67 +++++++++++++++++++++++++++++++++++
 6 files changed, 86 insertions(+), 14 deletions(-)
 create mode 100755 scripts/create_test_data.bash

diff --git a/README.md b/README.md
index fdb38b9..57c87ce 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-[![Snakemake](https://img.shields.io/badge/snakemake-5.18.0-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
+[![Snakemake](https://img.shields.io/badge/snakemake-5.24.2-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
 [![License](https://img.shields.io/apm/l/vim-mode.svg)](LICENSE)
 
 # varCA
diff --git a/configs/classify.yaml b/configs/classify.yaml
index a8aa9e1..1f8f63e 100644
--- a/configs/classify.yaml
+++ b/configs/classify.yaml
@@ -43,13 +43,13 @@ data:
       - 'gatk-indel~DP>10' # we only train on sites with a read depth above 10
   # here are some example test samples, but you can replace these with whichever
   # samples for which you'd like to predict variants
-  molt4_chr1:
-    path: out/merged_indel/molt4_chr1/final.tsv.gz
-    merged: out/merged_indel/molt4_chr1/merge.tsv.gz
+  molt4_mini:
+    path: out/merged_indel/molt4_mini/final.tsv.gz
+    merged: out/merged_indel/molt4_mini/merge.tsv.gz
     filter: *filter # use the same filtering expression as the SRR891269 data
-  jurkat_chr1:
-    path: out/merged_indel/jurkat_chr1/final.tsv.gz
-    merged: out/merged_indel/jurkat_chr1/merge.tsv.gz
+  jurkat_mini:
+    path: out/merged_indel/jurkat_mini/final.tsv.gz
+    merged: out/merged_indel/jurkat_mini/merge.tsv.gz
     filter: *filter # use the same filtering expression as the SRR891269 data
   # # uncomment the lines below if you want to reproduce our results
   # SRR891269_evens:
@@ -87,8 +87,8 @@ tune: false
 # You may optionally specify a truth set for each dataset if you want it to
 # serve as test data (see "truth" above).
 # This config entry is required!
-predict: [molt4_chr1, jurkat_chr1]
-# predict: [molt4_chr1, jurkat_chr1, SRR891269_evens] # uncomment this line if you want to reproduce our results
+predict: [molt4_mini, jurkat_mini]
+# predict: [SRR891269_evens] # uncomment this line if you want to reproduce our results
 
 # Callers to use when creating a VCF from the merged .tsv.gz file.
 # Supply a list of caller names in the order in which they should be considered
diff --git a/configs/config.yaml b/configs/config.yaml
index 1e386f5..8d05d86 100644
--- a/configs/config.yaml
+++ b/configs/config.yaml
@@ -28,12 +28,13 @@ sample_file: data/samples.tsv
 
 # Which samples from the samples_file should we execute the pipeline on?
 # Comment out this line or set it to a falsey value if you want to run all samples in the sample file
-SAMP_NAMES: [molt4_chr1, jurkat_chr1]
+SAMP_NAMES: [molt4_mini, jurkat_mini]
 
 # The path to a reference genome
-# The BWA index files, samtools index file (.fa.fai), and samtools dictionary file (.dict) must be stored in the same directory
+# The BWA index files, samtools index file (.fai), and GATK dictionary file (.dict) must be stored in the same directory
+# Note: your fasta file cannot be gzipped!
 # required!
-genome: data/hg38.chr1.fa
+genome: data/hg38.mini.fa
 
 # The path to the directory in which to place all of the output files
 # Defined relative to whatever directory you execute the snakemake command in
diff --git a/configs/prepare.yaml b/configs/prepare.yaml
index 057d6dd..1e21932 100644
--- a/configs/prepare.yaml
+++ b/configs/prepare.yaml
@@ -30,7 +30,8 @@ sample_file: data/samples.tsv
 SAMP_NAMES: [SRR891269]
 
 # The path to a reference genome
-# The BWA index files, samtools index file (.fa.fai), and samtools dictionary file (.dict) must be stored in the same directory
+# The BWA index files, samtools index file (.fai), and GATK dictionary file (.dict) must be stored in the same directory
+# Note: your fasta file cannot be gzipped!
 # required!
 genome: /iblm/netapp/data1/external/GRC37/combined/bwa_index_assembly19/Homo_sapiens_assembly19.fasta # replace this with the path to your own reference genome
 
diff --git a/scripts/README.md b/scripts/README.md
index aaf5afd..6e0c264 100644
--- a/scripts/README.md
+++ b/scripts/README.md
@@ -19,6 +19,9 @@ A fast awk script for classifying each site in a VCF as DEL, INS, SNP, etc. It a
 ### [classify.bash](classify.bash)
 A bash script for converting all REF/ALT columns in a TSV to binary positive/negative labels using `classify.awk`.
 
+### [create_test_data.bash](create_test_data.bash)
+A bash script for creating a small test dataset for the pipeline from existing data.
+
 ### [fillna.bash](fillna.bash)
 A bash script for replacing NA values in a large TSV.
 
@@ -50,7 +53,7 @@ A python script for creating ROC plots. It takes as input the output of `statist
 A python script for generating points to use in a precision-recall or ROC curve. It takes as input a two column TSV: true labels and prediction p-values.
 
 ### [train_RF.R](train_RF.R)
-An R script for creating a trained classifier. We recommend using the `Snakefile-classify` pipeline to run this script.
+An R script for creating a trained classifier. We recommend using the _classify_ subworkflow to run this script.
 
 ### [tune_plot.R](tune_plot.R)
 An R script for visualizing the results of hyperparameter tuning from the `train_RF.R` script.
diff --git a/scripts/create_test_data.bash b/scripts/create_test_data.bash
new file mode 100755
index 0000000..cbffacf
--- /dev/null
+++ b/scripts/create_test_data.bash
@@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+
+# Create a small test dataset from a larger example dataset composed of chr1 from the
+# Jurkat and MOLT4 cell lines
+
+
+create_seq() {
+    # arg1 (string): either "jurkat" or "molt4"
+
+    num_peaks=2
+    if [ "$1" == "molt4" ]; then
+        num_peaks=3
+    fi
+
+    comm -12 <(
+        bedtools intersect -c -bed -a out/peaks/"$1"_chr1/"$1"_chr1_peaks.narrowPeak -b <(
+            zcat out/merged_indel/"$1"_chr1/merge.tsv.gz | \
+            scripts/cgrep.bash - -E '^(CHROM|POS)$|(gatk|varscan|vardict|pindel|illumina-strelka|pg-indel).*~(ALT)$' | \
+            awk -F $'\t' -v 'OFS=\t' '$3 == $4 && $4 == $5 && $5 == $6 && $6 == $7 && $7 == $3 && $3 != "."' | \
+            tail -n+2 | \
+            awk -F $'\t' -v 'OFS=\t' '{print $1, $2, $2+length($3);}'
+        ) | \
+        grep -vE '0$' | cut -f-3 | sort
+    ) <(
+      bedtools intersect -c -bed -a out/peaks/"$1"_chr1/"$1"_chr1_peaks.narrowPeak -b <(
+          zcat out/merged_snp/"$1"_chr1/merge.tsv.gz | \
+          scripts/cgrep.bash - -E '^(CHROM|POS)$|(gatk|varscan|vardict|pg-snp).*~(ALT)$' | \
+          awk -F $'\t' -v 'OFS=\t' '$3 == $4 && $4 == $5 && $3 != "."' | \
+          tail -n+2 | \
+           awk -F $'\t' -v 'OFS=\t' '{print $1, $2, $2+length($3);}'
+        ) | \
+        grep -vE '0$' | cut -f-3 | sort
+    ) | sort -t $'\t' -k1,1 -k2,2n | head -"$num_peaks" > data/"$1".mini.bed
+
+    samtools sort -n out/align/"$1"_chr1/rmdup.bam | \
+    bedtools pairtobed -abam - -b data/"$1".mini.bed 2>/dev/null | \
+    samtools view -h | sed 's/'"$1"'_chr1/'"$1"'_mini/g' | grep -Ev '^\@PG' | \
+    samtools sort -nO bam > data/"$1"_mini.bam
+
+    if [ "$1" == "molt4" ]; then
+        samtools fastq -1 data/"$1".mini.1.fq.gz -2 data/"$1".mini.2.fq.gz -0 /dev/null -s /dev/null -n data/"$1"_mini.bam
+        rm data/molt4_mini.bam
+    fi
+}
+
+create_seq jurkat
+create_seq molt4
+
+rm data/hg38.mini.*
+
+cat data/*.bed | sort -k1,1 -k2,2n | sed -e 1b -e '$!d' | tail -n+2 | \
+awk -F $'\t' -v 'OFS=\t' '{print $1, 0, $3;}' | \
+bedtools slop -i - -g <(cut -f-2 out/data-full-chr1/hg38.chr1.fa.fai) -b 500 | \
+bedtools getfasta -fi out/data-full-chr1/hg38.chr1.fa -bed - | \
+sed 's/>chr1:0-.*$/>chr1/' > data/hg38.mini.fa
+
+samtools faidx data/hg38.mini.fa
+gatk CreateSequenceDictionary -R data/hg38.mini.fa
+bwa index data/hg38.mini.fa &>/dev/null
+
+samtools view -h data/jurkat_mini.bam | sed 's/LN:.*$/LN:'"$(cut -f2 data/hg38.mini.fa.fai)"'/' | samtools sort -O bam > data/jurkat.mini.bam
+samtools index data/jurkat.mini.bam
+
+rm data/molt4.mini.bed data/jurkat_mini.bam
+
+gatk ValidateSamFile -R data/hg38.mini.fa -I data/jurkat.mini.bam --MODE SUMMARY
+

From 546f012c9adba8f63d9a544643c4e8d5b4c443c6 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sun, 11 Jul 2021 17:50:11 -0700
Subject: [PATCH 22/26] both jurkat and molt4 test datasets must have at least
 3 peaks to satisfy manta

---
 scripts/create_test_data.bash | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/scripts/create_test_data.bash b/scripts/create_test_data.bash
index cbffacf..2c44145 100755
--- a/scripts/create_test_data.bash
+++ b/scripts/create_test_data.bash
@@ -7,11 +7,6 @@
 create_seq() {
     # arg1 (string): either "jurkat" or "molt4"
 
-    num_peaks=2
-    if [ "$1" == "molt4" ]; then
-        num_peaks=3
-    fi
-
     comm -12 <(
         bedtools intersect -c -bed -a out/peaks/"$1"_chr1/"$1"_chr1_peaks.narrowPeak -b <(
             zcat out/merged_indel/"$1"_chr1/merge.tsv.gz | \
@@ -30,7 +25,7 @@ create_seq() {
            awk -F $'\t' -v 'OFS=\t' '{print $1, $2, $2+length($3);}'
         ) | \
         grep -vE '0$' | cut -f-3 | sort
-    ) | sort -t $'\t' -k1,1 -k2,2n | head -"$num_peaks" > data/"$1".mini.bed
+    ) | sort -t $'\t' -k1,1 -k2,2n | head -3 > data/"$1".mini.bed
 
     samtools sort -n out/align/"$1"_chr1/rmdup.bam | \
     bedtools pairtobed -abam - -b data/"$1".mini.bed 2>/dev/null | \

From 5b92328a5cebd934bbf8f4704694f54ab4d2445f Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sun, 11 Jul 2021 18:25:43 -0700
Subject: [PATCH 23/26] make fillna.bash read from stdin and connect steps via
 pipes (resolves #14) also remove extra config params that nobody uses and
 mark some extra files as temp

---
 README.md            |  2 +-
 configs/prepare.yaml | 20 --------------------
 rules/README.md      |  2 +-
 rules/prepare.smk    | 30 +++++++++++-------------------
 scripts/fillna.bash  | 12 ++++++------
 5 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/README.md b/README.md
index 57c87ce..ffe06ac 100644
--- a/README.md
+++ b/README.md
@@ -53,7 +53,7 @@ You must modify [the config.yaml file](configs#configyaml) to specify paths to y
 The pipeline is made up of [two subworkflows](rules). These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the [rules README](rules/README.md) for execution instructions and a description of the outputs. You will need to execute the subworkflows separately [if you ever want to create your own trained models](rules#training-and-testing-varca).
 
 #### Reproducing our results
-We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.
+We provide the example data so that you may quickly (in ~15 mins, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow [these steps](rules#testing-your-model--reproducing-our-results) to create all of the plots and tables in our paper.
 
 ### If this is your first time using Snakemake
 We recommend that you run `snakemake --help` to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the `-n -p -r` flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).
diff --git a/configs/prepare.yaml b/configs/prepare.yaml
index 1e21932..4fe0024 100644
--- a/configs/prepare.yaml
+++ b/configs/prepare.yaml
@@ -63,20 +63,6 @@ indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strel
 # If this value is not provided, it will default to "."
 label: "."
 
-# Whether to "normalize" the VCF output of every caller script
-# This option is not recommended if you plan to feed the data to a random forest
-# Normalization usually involves left-alignment and trimming of variants
-# See https://genome.sph.umich.edu/wiki/Variant_Normalization for more info
-# In our pipeline, the normalization step also reduces counts of variants that appear at the same position to 1
-# If this value is not provided, it will default to false
-no_normalize: false
-
-# Whether to replace NA values in the dataset before outputting the final TSV
-# This option is not recommended if you plan to feed the data to a random forest
-# Unless otherwise specified in the Caller Specific Parameters below, NA values will be replaced with 0
-# If this value is not provided, it will default to false
-keep_na: false
-
 # A list of filtering expressions for filtering the sites before outputting the final TSV
 # A filtering expression consists of the following concatenated together:
 #   - the caller id
@@ -88,9 +74,3 @@ keep_na: false
 snp_filter: ['gatk-snp~DP>10']
 indel_filter: ['gatk-indel~DP>10']
 
-# Whether to normalize weird numerical columns before outputting the final TSV
-# (ex: scientific notation or numbers followed by a percent sign)
-# We recommended setting this to false in order to allow normalization if you
-# plan to feed the data to a random forest
-# If this option is commented out, it will default to false
-pure_numerics: false
diff --git a/rules/README.md b/rules/README.md
index 5106880..1ef2ce9 100644
--- a/rules/README.md
+++ b/rules/README.md
@@ -103,7 +103,7 @@ First, split the truth dataset by chromosome parity using `awk` commands like th
 
 Then, specify in `classify.yaml` the path to the `indel-odds.tsv.gz` file as your training set and the path to the `indel-evens.tsv.gz` file as your test set. Make sure that both your training and test sets have a [truth caller ID](/configs/classify.yaml#L26) and that you've specified the test set ID in the list of `predict` datasets. Once you've finished, execute the `classify` subworkflow on its own (see [above](#execution-1)).
 
-To truly reproduce our results, we also recommend predicting variants for the ATAC-seq cell lines we tested (Jurkat, Molt-4, K-562, etc) by downloading them from GEO accession GSE129086. Although the example data includes the Jurkat and Molt-4 samples, it only includes chromosome 1 from those cell lines, thereby artificially lowering the VarCA scores of the variants in those samples.
+To truly reproduce our results, we also recommend predicting variants for the ATAC-seq cell lines we tested (Jurkat, Molt-4, K-562, etc) by downloading them from GEO accession GSE129086. Although the example data includes the Jurkat and Molt-4 samples, it only includes a small portion of chromosome 1 from those cell lines, thereby artificially lowering the VarCA scores of the variants in those samples.
 
 The `classify` subworkflow can only work with one trained model at a time, so you will need to repeat these steps for SNVs. Just replace every mention of "indel" in `classify.yaml` with "snp". Also remember to use only the SNV callers (ie GATK, VarScan 2, and VarDict).
 
diff --git a/rules/prepare.smk b/rules/prepare.smk
index f5f0c1b..16498ce 100644
--- a/rules/prepare.smk
+++ b/rules/prepare.smk
@@ -385,16 +385,6 @@ rule merge_callers:
         "{input.all_sites}) {input.caller_output} | "
         "(read -r head; echo \"$head\"; sort -t $'\\t' -k1,1V -k2,2n) | gzip > {output}"
 
-rule binarize:
-    """convert REF/ALT columns to binary labels"""
-    input: rules.merge_callers.output
-    params:
-        label = config['label'] if 'label' in config else "."
-    output: config['out']+"/merged_{type}/{sample}/binary.tsv.gz"
-    conda: "../envs/prepare.yml"
-    shell:
-        "paste <(zcat {input} | scripts/cgrep.bash - -Ev '~(REF|ALT)$') <(scripts/classify.bash {input} '{params.label:q}') | gzip >{output}"
-
 
 def na_vals(wildcards):
     """ retrieve all of the NA parameters """
@@ -412,27 +402,29 @@ def na_vals(wildcards):
 rule fillna:
     """
         prepare the caller for use by the classifier by
-        1) extracting the columns desired by the user
-        2) filling NA values with the defaults provided
+        1) converting REF/ALT columns to binary labels
+        2) extracting the columns desired by the user
+        3) filling NA values with the defaults provided
     """
-    input: rules.binarize.output
+    input: rules.merge_callers.output
     params:
         na_vals = na_vals,
-    output: config['out']+"/merged_{type}/{sample}/fillna.tsv.gz"
+        label = config['label'] if 'label' in config else "."
+    output: temp(config['out']+"/merged_{type}/{sample}/fillna.tsv.gz")
     conda: "../envs/prepare.yml"
     shell:
-        "scripts/fillna.bash {input} {params.na_vals:q} | "
+        "paste <(zcat {input} | scripts/cgrep.bash - -Ev '~(REF|ALT)$') <(scripts/classify.bash {input} '{params.label:q}') | "
+        "scripts/fillna.bash {params.na_vals:q} | "
         "awk -F $'\\t' -v 'OFS=\\t' '{{for (i=1;i<=NF;i++) if ($i==\".\") $i=0}}1' | "
         "gzip >{output}"
 
 
 rule apply_filters:
     """ apply filtering on the data according to the filtering expressions """
-    input: rules.fillna.input if check_config('keep_na') \
-        else rules.fillna.output
+    input: rules.fillna.output
     params:
         expr = lambda wildcards: config[wildcards.type+"_filter"]
-    output: config['out']+"/merged_{type}/{sample}/filter.tsv.gz"
+    output: temp(config['out']+"/merged_{type}/{sample}/filter.tsv.gz")
     conda: "../envs/prepare.yml"
     shell:
         "zcat {input} | scripts/filter.bash {params.expr:q} | gzip > {output}"
@@ -452,7 +444,7 @@ rule norm_nums:
 
 rule result:
     """ symlink the result so the user knows what the final output is """
-    input: rules.norm_nums.input if check_config('pure_numerics') else rules.norm_nums.output
+    input: rules.norm_nums.output
     output: config['out']+"/merged_{type}/{sample}/final.tsv.gz"
     conda: "../envs/prepare.yml"
     shell:
diff --git a/scripts/fillna.bash b/scripts/fillna.bash
index 917b050..9925e9e 100755
--- a/scripts/fillna.bash
+++ b/scripts/fillna.bash
@@ -2,22 +2,22 @@
 
 # this script replaces NA values in a large TSV with the provided replacements
 
-# param1: the .tsv.gz file containing the large table for which we want to fill NA values
-# param2+: the name of a column followed by the NA value replacement for that col (as two adjacent arguments)
+# stdin: the .tsv file containing the large table for which we want to fill NA values
+# param1+: the name of a column followed by the NA value replacement for that col (as two adjacent arguments)
 # 		   you can specify multiple column-value pairs
 # return (to stdout): the same table except where all specified NA values have been replaced
 
 
-script_dir="$(dirname "$BASH_SOURCE")";
+read -r head
 
 pfiles=()
-for (( i=2; i<$#; i+=2 )); do
+for (( i=1; i<$#; i+=2 )); do
 	j=$((i+1))
 	reg="${!i}"
 	val="${!j}"
 	# retrieve the idx of the column among the columns
-	reg="$(zcat "$1" | head -n1 | tr '\t' '\n' | grep -Fn "$reg" | cut -f1 -d: | head -n1)"
+	reg="$(echo "$head" | tr '\t' '\n' | grep -Fn "$reg" | cut -f1 -d: | head -n1)"
 	pfiles+=("{if (\$$reg==\".\") \$$reg=\"$val\"}")
 done
 
-zcat "$1" | awk -F $'\t' -v 'OFS=\t' "${pfiles[*]} 1"
+{ echo "$head"; cat; } | awk -F $'\t' -v 'OFS=\t' "${pfiles[*]} 1"

From c936a5bca152499c351eb9eda9981938be290154 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Sun, 11 Jul 2021 18:42:13 -0700
Subject: [PATCH 24/26] bump required snakemake version to 5.24.2

---
 README.md          | 2 +-
 Snakefile          | 2 +-
 rules/classify.smk | 2 +-
 rules/prepare.smk  | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index ffe06ac..b23d674 100644
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@ The pipeline is written as a Snakefile which can be executed via [Snakemake](htt
 
 However, if your aim is complete reproducbility, we recommend installing version 5.24.2 via this command:
 ```
-conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.24.2'
+conda create -n snakemake -c conda-forge --no-channel-priority 'bioconda::snakemake==5.24.2'
 ```
 
 # execution
diff --git a/Snakefile b/Snakefile
index 0844018..50e3af7 100644
--- a/Snakefile
+++ b/Snakefile
@@ -2,7 +2,7 @@ import warnings
 from snakemake.utils import min_version
 
 ##### set minimum snakemake version #####
-min_version("5.18.0")
+min_version("5.24.2")
 
 configfile: "configs/config.yaml"
 configfile: "configs/callers.yaml"
diff --git a/rules/classify.smk b/rules/classify.smk
index 41eee60..596ab40 100644
--- a/rules/classify.smk
+++ b/rules/classify.smk
@@ -1,7 +1,7 @@
 from snakemake.utils import min_version
 
 ##### set minimum snakemake version #####
-min_version("5.18.0")
+min_version("5.24.2")
 
 if 'imported' not in config:
     configfile: "configs/classify.yaml"
diff --git a/rules/prepare.smk b/rules/prepare.smk
index 16498ce..754860f 100644
--- a/rules/prepare.smk
+++ b/rules/prepare.smk
@@ -5,7 +5,7 @@ import warnings
 from snakemake.utils import min_version
 
 ##### set minimum snakemake version #####
-min_version("5.18.0")
+min_version("5.24.2")
 
 if 'imported' not in config:
     configfile: "configs/prepare.yaml"

From 3ea4d9945617cbf17fde81433cf8b6aa79f89753 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Fri, 16 Jul 2021 12:25:12 -0700
Subject: [PATCH 25/26] convert pipe to temp file

---
 rules/prepare.smk | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rules/prepare.smk b/rules/prepare.smk
index 754860f..99d4d2e 100644
--- a/rules/prepare.smk
+++ b/rules/prepare.smk
@@ -359,7 +359,7 @@ rule join_all_sites:
         tsv = caller_tsv,
         prepared_tsv = rules.prepare_merge.output
     output:
-        pipe(config['out'] + "/merged_{type}/{sample}/{caller}.tsv")
+        temp(config['out'] + "/merged_{type}/{sample}/{caller}.tsv")
     conda: "../envs/prepare.yml"
     shell:
         "LC_ALL=C join -t $'\\t' -e. -a1 -j1 -o auto --nocheck-order "

From ac31accd363cbadbf67c2c70dbeb90685e286b55 Mon Sep 17 00:00:00 2001
From: Arya Massarat <23412689+aryam7@users.noreply.github.com>
Date: Thu, 21 Jul 2022 22:18:04 -0700
Subject: [PATCH 26/26] clarify required input in config

---
 configs/config.yaml | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/configs/config.yaml b/configs/config.yaml
index 8d05d86..a20701f 100644
--- a/configs/config.yaml
+++ b/configs/config.yaml
@@ -8,6 +8,9 @@
 # Note: this file is written in the YAML syntax (https://learnxinyminutes.com/docs/yaml/)
 
 
+#---------------------------------------- REQUIRED INPUT -----------------------------------------------#
+# If you are executing the pipeline on your own data, you MUST edit these config values
+
 # The path to a text file specifying where to find read information for each sample
 # Each row in the sample file should represent a different sample.
 # The sample file should have 3 columns (each separated by a single tab):
@@ -26,7 +29,7 @@
 # this is a required input!
 sample_file: data/samples.tsv
 
-# Which samples from the samples_file should we execute the pipeline on?
+# Which samples from the sample_file should we execute the pipeline on?
 # Comment out this line or set it to a falsey value if you want to run all samples in the sample file
 SAMP_NAMES: [molt4_mini, jurkat_mini]
 
@@ -36,6 +39,18 @@ SAMP_NAMES: [molt4_mini, jurkat_mini]
 # required!
 genome: data/hg38.mini.fa
 
+# The path to trained, ensemble classifiers, which can be used for calling SNVs and
+# indels
+# Unless you have created your own models, you should just use the ones packaged in the
+# example dataset and leave these paths unchanged.
+# required!
+snp_model: data/snp.rda
+indel_model: data/indel.rda
+
+
+#------------------------------------------- OPTIONS ---------------------------------------------------#
+# If you are executing the pipeline on your own data, you can usually just leave these unchanged.
+
 # The path to the directory in which to place all of the output files
 # Defined relative to whatever directory you execute the snakemake command in
 # Defaults to "out" if this line is commented out or set to a falsey value
@@ -75,10 +90,3 @@ indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strel
 snp_filter: ['gatk-snp~DP>10']
 indel_filter: ['gatk-indel~DP>10']
 
-# The path to a trained, ensemble classifier, which can be used for calling SNVs
-# required!
-snp_model: data/snp.rda
-
-# The path to a trained, ensemble classifier, which can be used for calling indels
-# required!
-indel_model: data/indel.rda