Created 19 June 2019
Last updated 9 April 2020
This project aligns fastq files to the genome using STAR, concatenates the SJ.out.tab and Chimeric.out.junction outputs into one file, and creates a class input file based on the STAR output.
The script write_jobs.py is the main wrapper that runs all necessary jobs. To run on a sample, edit the following variables in the main function:
Update the script with information about your sample by inputting the following parameters.
data_path: set equal to the path containing the fastqs. Example:data_path = "/scratch/PI/horence/JuliaO/single_cell/data/SRA/19.05.31.GSE109774/"assembly: set equal to the keyword for the genome assembly you want to use (maybe "mm10" for mouse assembly 10, or hg38 for human assembly 38). Example:assembly = "mm10"gtf_file: The path to the gtf file that should be used to annotate genes. Example:gtf_file = "/share/PI/horence/circularRNApipeline_Cluster/index/mm10_genes.gtf"run_name: The unique name you want to give to the current run. It can be useful to include the date and a signifier for the data. Example:run_name = "GSE109774_colon"r_ends: list of unique endings for the file names of read one (location 0) and read 2 (location 1). Example:r_ends = ["_1.fastq.gz", "_2.fastq.gz"]names: list of unique identifiers for each fastq; e.g. file location for read 1 should be <data_path><r_ends[0]> and file location for read 2 should be <data_path><r_ends[1]>. Example:names = ["SRR65462{}".format(i) for i in range(73,74)]single: Set equal to True if the data is single-end, and False if it is paired-end. Note that currently ifsingle = Trueit is assumed that the single read to be aligned is in the second fastq file (because of the tendancy for droplet protocols to have read 1 contain the barcode and UMI and read 2 contain the sequence that aligns to the genome). This also causes the files to be demultiplexed to create a new fastq file before they're mapped.
These parameters modify which combinations of STAR parameters are run. Be careful about including too many; even if you just have two values for each, you will run the pipeline 16 times. If you have 12 samples, then you're running the pipeline 192 times:
chimSegmentMin: A list containing every value of the--chimSegmentMinSTAR parameter you want to run on. Example:chimSegmentMin = [12,10]chimJunctionOverhangMin: A list containing every value of the--chimJunctionOverhangMinSTAR parameter you want to run on. Example:chimJunctionOverhangMin = [13, 10]alignSJstitchMismatchNmax: For every value<x>in this list, STAR will be run with--alignSJstitchMismatchNmax <x> -1 <x> <x>. Example:alignSJstitchMismatchNmax = [0,1]chimSegmentReadGapMax: A list containing every value of the--chimSegmentReadGapMaxSTAR parameter you want to run on. Example:chimSegmentReadGapMax = [0,3]
These parameters let you decide which portion of the script you want to run (for example, if you're modifying the create_class_input.py script only, so the mapping shouldn't change):
run_whitelist: Set equal to True if you want to run UMI-tools whitelist script to extract cell barcodes and identify the most likely true cell barcodes (will be run only for 10X)run_extract: Set equal to True if you want to run UMI-tools extract script which removes UMIs from fastq reads and append them to read namerun_map: Set equal to True if you want to run the mapping job, and False otherwiserun_star_fusion: Set equal to True if you want to run the STAR-Fusion, and False otherwiserun_ann: Set equal to True if you want to annotate and concatenate the STAR files, False otherwiserun_class: Set equal to True if you want to create the class input file, false otherwiserun_modify_class: Set equal to True if you want to make consistent junction IDs in which reverse strands and discrepancies between reporting alignments in chimeric and aligned sam files are taken care of.run_ensembl: Set equal to True if you want to add gene names to the STAR gene count file and add gene ensembl ids and gene counts to the class input file, False otherwiserun_compare: Set equal to True if you want to comapre the junction in the class inpout file with those in the STAR and STAR-Fusion ouput files, false otherwiserun_GLM: Set equal to True if you want to run the GLM step and assign statistical scores to each junction in the class input file. The output of this step is a file namedGLM_output.txt.
After assigning these variables, run python3 write_jobs.py to submit the jobs.
There will be a separate folder for every combination of STAR parameters that was run. These folders will follow the naming convention output/<run_name>_cSM_<chimSegmentMin value>_cJOM_<chimJunctionOverhangMin value>_aSJMN_<alignSJstitchMismatchNmax value>_cSRGM_<chimSegmentReadGapMax value>. Example: output/GSE109774_colon_cSM_10_cJOM_10_aSJMN_1_cSRGM_3. Each of these folders will contain a folder for each variable in names, so a different folder for each pair of fastq files. Each of these sub-folders contains the following:
Based on the parameters we are running with, this includes 2Aligned.out.sam, 2Chimeric.out.sam, 2Chimeric.out.junction, 2SJ.out.tab, and the same files with 1 instead of 2 if the reads are paired-end (among other STAR-generated files).
The created files are called 1SJ_Chimeric.out (if the data is paired-end) and 2SJ_Chimeric.out. Each of these concatenates the respective SJ.out.tab and Chimeric.out.junction files, and adds columns for the gene names of the donor and acceptor as well as their strands (because we are defining strand to be whatever strand the gene is on, and ? if there is no gene on either strand or a gene on both strands). However, these two files don't have the same columns, so right now the only columns that are shared are "donor_chromosome", "acceptor_chromosome", "donor_gene", and "acceptor_gene". The rest of the columns from the individual file are still present, but they are left blank for rows that belong to the "other" file. A more complete description of the columns can be found in the STAR manual: http://chagall.med.cornell.edu/RNASEQcourse/STARmanual.pdf
The purpose of the class input files is for each row to contain a summary of read 1 and read 2 for every read where read 1 is either Chimeric or contains a gap.
STAR-Fusion is an additional module developed by STAR authors to call fusion junctions using the Chimeric.out.junction file generated by STAR. All STAR-Fusion output files are written in the star-fusion folder which is placed in the same folder that other STAR output files are written. The final fusion calls by STAR-Fusion can be found in star-fusion.fusion_predictions.abridged.tsv. Currently STAR-Fusion is only run for R1 (even for PE data).
For single-end reads, a read is included in the class input file if it is Chimeric (contains the ch flag in the BAM file) or has an N in its cigar string (N codes for a base being skipped). For paired-end reads, if read1 is Chimeric or has an N in its cigar string, the read will be included in the class input file. Its line in the class input file will also include information about read 2, regardless of whether r2 is Chimeric/has an N in its CIGAR string or not. Note that if read 1 is not Chimeric and doesn't have an N in its CIGAR string, then read 2 won't be included in the class input file even if read 2 is Chimeric or has an N in its CIGAR string (this is something we could change).
Each read is included in class_input.tsv at most one time. If a read has a genomic read and a junctional read, the junctional read will be included in the class input file (and the fact that a genomic alignment exists will be marked by a zero in the genomicAlignmentR1 column). However, if there are multiple junctional reads then the read with the lowest value in the HI tag will be included in class_input.tsv. All other junctional alignments can be found in class_input_secondary.tsv.
The class input file is saved in both parquet format and tsv format (class_input.tsv and class_input.pq; same for class_input_secondary). class_input.tsv is modified by the GLM script to include the SICILIAN score, to deduplicate UMIs (by UMI + barcode + junction name), and to add a few columns such as overlap_R1. However, class_input.pq is not modified by the GLM.
A note on the naming convention for the fields of the class input file: id and class are the only fields that are necessarily the same for read 1 and read 2. All other fields have R1 in them if they pertain to read 1, and R2 in them if they pertain to read 2. For single-end reads, the information for the one read will always show up in the R1 columns, even if it's actually from the fastq file labeled 2. Then within read 1 and read 2 the columns are split into A and B. For a read that aligns to two locations (either in the Chimeric file, or with an N in the CIGAR string in the Aligned file), the first portion of the read is referred to as A and the second is referred to as B. Here "first portion" means that if you saw the read in the raw fastq file, the first bases in the read would align to the A location, and the last bases would align to the B location.
Categories of the form <field>R2 rather than <field>R1 follow the same definition but for the second read unless otherwise indicated.
If a read is from the Aligned file rather than the Chimeric file, the following columns will have the value NA: flagB, posB, aScoreB, qualB, seqB. If a read doesn't contain a junction, the following columns will also have the value NA: strandR2B, chrR2B, geneR2B, readClassR2, juncPosR2A, juncPosR2B, cigarR2B, MR2B, SR2B.
The class input file contains the following fields (not necessarily in this order):
Junction classification
linear: acceptor and donor are on the same chromosome and strand, closer than 1 MB to each other, and are based on the reference genome canonical orderingrev: acceptor and donor are on the same chromosome and strand, closer than 1 MB to each other, and are ordered opposite of the reference genome canonical orderingsc: acceptor and donor are on the same chromosome but opposite strands.fusion: acceptor and donor are on different chromosomes, or on the same chromosome and strand but farther than 1MB from each other.
Refnames for negative-strand genes will have the acceptor first and the donor second
-
id: Read name. Example:SRR6546273.367739 -
readLenR1: length of read 1 (including any softclipped portions) -
fileTypeR1: equalsAlignedif the read came from the1Aligned.out.samfile,Chimericif it came from the1Chimeric.out.samfile. -
seqR1: The read sequence -
AT_run_R1: max(the length of the longest run of A's, the length of the longest run of T's). The A's and T's are combined because we could have the sequence or the reverse complement. This is calculated fromseqR1. -
GC_run_R1: max(the length of the longest run of G's, the length of the longest run of C's). This is calculated fromseqR1 -
max_run_R1: max(AT_run_R1,GC_run_R1) -
entropyR1: The entropy of the read calculated based on 5-mers. Let$k_1, \ldots, k_n$ be all the 5-mers in the read sequence (for example, for ACTCCGAGTCCTCCG the 5-mers would be ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, ACTCCGAGTCCTCCG, and ACTCCGAGTCCTCCG). Then let$N(c)$ equal the number of times the kmer c appears in$k_1, \ldots, k_n$ (for example, CTCCG appears twice in ACTCCGAGTCCTCCG). Then the entropy is defined as$\sum_C -\frac{N(c)}{n}\log\left(\frac{N(c)}{n}\right)$ (here C is the set of unique 5-mers in the read) -
refName_ABR1: The refName for R1 will always be of the form<chrR1A>:<geneR1A>:<juncPosR1A>:<gene_strandR1A>|<chrR1B>:<geneR1B>:<juncPosR1B>:<gene_strandR1B>|<readClassR1> -
UMI: for 10X data this is the UMI found through UMI-tools, otherwise this is NA -
barcode: for 10X data this is the barcode found through UMI-tools, otherwise this is NA -
maxA_10merR1: The maximum number of A's in a 10-base stretch in the read -
maxT_10merR1: The maximum number of T's in a 10-base stretch in the read -
maxG_10merR1: The maximum number of G's in a 10-base stretch in the read -
maxC_10merR1: The maximum number of C's in a 10-base stretch in the read -
aScoreR1A: alignment score from the SAM file after theAS. -
aScoreR1B: alignment score from the SAM file after theAS. -
MR1A: The number of M's incigarR1A -
MR1B: The number of M's incigarR1B -
SR1A: The number of S's in the portion of the cigar string corresponding to the first half of the splice -
SR1B: The number of S's in the portion of the cigar string corresponding to the second half of the splice -
nmmR1A: The number of mismatches in the read; calculated by finding the number of times A,C,T or G appears in the MD flag -
nmmR1B: The number of mismatches in the read; calculated by finding the number of times A,C,T or G appears in the MD flag -
qualR1A: Mapping quality of the first portion of read 1. From the manual: "The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(1-1/Nmap)) for multi-mapping reads" -
qualR1B: Mapping quality for the second portion of read 1. -
NHR1A: Number of reported alignments that contains the query in the current record -
NHR1B: Number of reported alignments that contains the query in the current record -
HIR1A: Query hit index, indicating the alignment record is the i-th one stored in SAM -
HIR1B: Query hit index, indicating the alignment record is the i-th one stored in SAM -
cigarR1A: The cigar string for portion A (for Chimeric, this is without the softclipped portion that corresponds to B; for Aligned, this is without the long N sequence marking the intron and everything after) -
cigarR2B: The cigar string for portion B -
juncPosR1A: The last position part A of read 1 aligns to before the junction. IffileTypeR1 = Chimeric: ifflagR1Ais 0 or 256, this is equal toposR1A +the sum of the M's, N's, and D's in the CIGAR string. IfflagR1Ais 16 or 272 this is equal toposR1A. IffileTypeR1 = Aligned, then this equalsposR1Aplus the sum of the M's, N's, and D's before the largest N value in the CIGAR string. -
juncPosR1B: The first position of part B of read 1 that aligns after the junction. IffileTypeR1 = Chimeric: ifflagR1Bis 0 or 256, this is equal toposR1B. IfflagR1Bis 16 or 272 this is equal toposR1B +the sum of the M's, N's, and D's in the CIGAR string. IffileTypeR1 = Aligned, then this equalsposR1B. -
geneR1A: Gene that read 1 part A was aligned to. If no gene was annotated in that area, it's marked as "unknown". If multiple genes are annotated in this area, it's marked with all of those gene names concatenated with commas in between Example:Ubb,Gm1821. Also see the note on the annotation. -
geneR1B: Gene that read 1 part B was aligned to. -
chrR1A: Chromosome that read 1 part A was aligned to -
chrR1B: Chromosome that read 1 part B was aligned to -
read_strand_compatible: A 1 here indicates that the read strands are compatible, 0 indicates that they're not. This is 1 ifread_strandR1Adoesn't equalread_strandR2Aand 0 otherwise. Note that this is only based on the "A" part of the read; cross-reference withstrand_crossR1andstrand_crossR2to verify that all read strands are in accordance. This is not present for single-end reads. -
location_compatible: A 1 here indicates that the locations are compatible, a 0 indicates that they're not. IfreadClassR1isfus, this is 1. IfreadClassR1issc, this is 0. IfreadClassR1islinandread_strandR1is +, then this is 1 ifposR1A<posR2Aand 0 otherwise. IfreadClassR1islinandread_strandR1is -, then this is 1 ifposR1A>posR2Aand 0 otherwise. ifreadClassR1isrev, then this is 1 ifposR2Ais betweenposR1AandposR1Band 0 otherwise. Note that this is only based on the "A" part of read 2 for right now. This is not present for single-end reads. -
genomicAlignmentR1: Here 1 indicates that there is a genomic alignment for read 1, and 0 indicates that there is not. -
primaryR1A: -
primaryR1B: -
genomic_aScoreR1: the maximum alignment score of all genomic alignments of that read (NA ifgenomicAlignmentR1== 0) -
spliceDist: The absolute difference betweenjuncPosR1AandjuncPosR1B -
refName_newR1: the consistent disambiguated junction id obtained byrun_modify_classstep. All subsequent analyses on junctions are based on the ids in this column. -
gene_strandR1A: The strand that the gene atjuncPosR1Ais on (+or-); if there is no gene at that location, or there is a gene on both strands, this equals?. -
gene_strandR1B: The strand that the gene atjuncPosR1Bis on; if there is no gene at that location, or there is a gene on both strands, this equals?. *geneR1A_uniq: Gene name after disambiguation steps have been run on it (to try to narrow down to just one gene) -
geneR1B_uniq: Gene name after disambiguation steps have been run on it (to try to narrow down to just one gene) -
max_id_priority: The minimumHIR1Avalue for a givenid(used to split alignments betweenclass_inputandclass_input_secondary)
overlap_R1: junction overlap for R1 (the minimum of junction anchors)max_overlap_R1: maximum overlap for R1 (the maximum of junction anchors)median_overlap_R1: median ofoverlap_R1across all aligned reads for each junctionis.zero_nmm: indicates whether the R1 alignment is mismatch freeis.multimapping: indicates whether the alignment is multimappingnjunc_binR1A: junction noise score forR1Anjunc_binR1B: junction noise score forR1Bthreeprime_partner_number_R1: number of distinct 3' splice sites across the class input file for each 5' splice site in the class input filefiveprime_partner_number_R1: number of distinct 5' splice sites across the class input file for each 3' splice site in the class input filelength_adj_AS_R2: length ajusted alignment score for R2glm_per_read_prob: per-read score for each read alignment computed by GLMglm_per_read_prob_corrected: per-read score for each read alignment computed by GLM where per-read scores for anomalous reads are downscaledglmnet_per_read_prob: per-read score for each read alignment computed by GLMnetglmnet_per_read_prob_corrected: per-read score for each read alignment computed by GLMnet and per-read scores for anomalous reads are downscaledglmnet_per_read_prob_constrained: per-read score for each read alignment computed by constrained GLMnet modelglmnet_per_read_prob_corrected_constrained: per-read score for each read alignment computed by constrained GLMnet model and per-read scores for anomalous reads are downscaledglmnet_twostep_per_read_prob: the two-step per-read score for each chimeric readglmnet_twostep_per_read_prob_constrained: per-read score for each read alignment computed by twostep GLMnet with constrained cofficientsp_predicted_glm: aggregated score for each junction based onglm_per_read_probp_predicted_glm_corrected: aggregated score for each junction based onglm_per_read_prob_correctedp_predicted_glmnet: aggregated score for each junction based onglmnet_per_read_probp_predicted_glmnet_corrected: aggregated score for each junction based onglmnet_per_read_prob_correctedp_predicted_glmnet_constrained: aggregated score for each junction based onglmnet_per_read_prob_constrainedp_predicted_glmnet_corrected_constrained: aggregated score for each junction based onglmnet_per_read_prob_corrected_constrainedp_predicted_glmnet_twostep: aggregated score for each junction based onglmnet_twostep_per_read_probp_predicted_glmnet_twostep_constrained: aggregated score for each junction based onglmnet_twostep_per_read_prob_constrainedjunc_cdf_glm: cdf ofp_predicted_glmrelative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glm_corrected: cdf ofp_predicted_glm_correctedrelative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glmnet: cdf ofp_predicted_glmnetrelative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glmnet_corrected: cdf ofp_predicted_glmnet_correctedrelative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glmnet_constrained: cdf ofp_predicted_glmnet_constrainedrelative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glmnet_corrected_constrained: cdf ofp_predicted_glmnet_corrected_constrainedrelative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glmnet_twostep: cdf ofp_predicted_glmnet_twosteprelative to its null distribution by randomly assigning reads to junctionsgenomic_aScoreR1: the maximum alignment score of all genomic alignments of that read (NA ifgenomicAlignmentR1== 0)frac_genomic_reads: fraction of aligned reads for each junction that have also genomic alignmentfrac_anomaly: the fraction of the aligned reads for the junction that are anamolousave_min_junc_14mer: the average ofmin_junc_14meracross the reads aligned to the junctionave_max_junc_14mer: the average ofmax_junc_14meracross the reads aligned to the junctionave_AT_run_R1: the average ofAT_run_R1across the reads aligned to the junctionave_GC_run_R1: the average ofGC_run_R1across the reads aligned to the junctionave_max_run_R1: the average ofmax_run_R1across the reads aligned to the junctionp_val_median_overlap_R1: the p-value of the statistical test for comparing the median overlaps of the aligned reads to the junction against the null of randomly aligned reads and small p-values are desired as they indicate that the median_overlap of the junction is large enough.uniformity_test_pval: the p_value of the uniformity test for the junction overlap of the reads aligned to the junction computed by chisq.test. It is computed only for junctions with at most 15 reads and p-values close to 1 are desired as they indicate reads are uniformly distributed.sd_overlap: the standard deviation of the junction overlap for the reads aligned to the junction
geneR1B_ensembl: the gene ensembl id forgeneR1BgeneR1A_ensembl: the gene ensembl id forgeneR1AgeneR1B_name: the HUGO name forgeneR1BgeneR1A_name: the HUGO name forgeneR1AgeneR1A_expression: the gene counts (htseq counts) forgeneR1Aaccording to column V3 in1ReadsPerGene.out.tab(2ReadsPerGene.out.tabfor PE data)geneR1B_expression: the gene counts (htseq counts) forgeneR1Baccording to column V3 in1ReadsPerGene.out.tab(2ReadsPerGene.out.tabfor PE data)
This file is built by the GLM_script.R and contains all junction level metrics including aggregated p_predicted and junc_cdf scores and other alignment qualities at the junction level. Specifically GLM_output.txt contains emperical p-values computed by comparing the junc_cdf scores obtained via various varioations of the GLM(net) model against the distribution of those scores in "Bad junctios", junctions with at least 10% of reads with genomic alignment. There are currently the following empirical p-values in the GLM output file:
emp.p_glm: obtained based onjunc_cdf_glmemp.p_glmnet: obtained based onjunc_cdf_glmnetemp.p_glmnet_constrained: obtained based onjunc_cdf_glmnet_constrainedemp.p_glm_corrected: obtained based onjunc_cdf_glm_correctedemp.p_glmnet_corrected: obtained based onjunc_cdf_glmnet_correctedemp.p_glmnet_corrected_constrained: obtained based onjunc_cdf_glmnet_corrected_constrained
The following columns from the class input file is written in GLM_output.txt:
refName_newR1, geneR1B_uniq, geneR1A_uniq, geneR1B_ensembl, geneR1A_ensembl, readClassR1, geneR1A, geneR1B, geneR1A_expression_unstranded, geneR1A_expression_stranded, geneR1B_expression_unstranded, geneR1B_expression_stranded, is.STAR_Chim, is.STAR_SJ, is.STAR_Fusion, numReads, median_overlap_R1, sd_overlap, njunc_binR1A, njunc_binR1B, threeprime_partner_number_R1, fiveprime_partner_number_R1,p_predicted_glm, junc_cdf_glm, p_predicted_glm_corrected, junc_cdf_glm_corrected, p_predicted_glmnet, junc_cdf_glmnet, p_predicted_glmnet_corrected, junc_cdf_glmnet_corrected, p_predicted_glmnet_twostep, junc_cdf_glmnet_twostep, frac_genomic_reads, frac_anomaly, ave_min_junc_14mer, ave_max_junc_14mer, ave_AT_run_R1, ave_GC_run_R1, ave_max_run_R1, ave_AT_run_R2, ave_GC_run_R2, ave_max_run_R2, p_val_median_overlap_R1, uniformity_test_pval, p_predicted_glmnet_constrained, junc_cdf_glmnet_constrained, p_predicted_glmnet_corrected_constrained, junc_cdf_glmnet_corrected_constrained, p_predicted_glmnet_twostep_constrained, seqR1, seqR2
The following columsn are added from the STAR output file for "Aligned" junctions (SJ.out.tab):
intron_motif, is.annotated, num_uniq_map_reads, num_multi_map_reads, maximum_SJ_overhang,
If run_compare is set to TRUE, junctions in both class input files are comapred with the junctions in the STAR output files SJ.out.tab and Chimeric.out.junction. Currently, the comaprison is only for junctional R1 reads. This step adds 3 columns to the class input file: is.STAR_Chim, is.STAR_SJ, and is.STAR_Fusion.
Also, the following 4 files will be written at the end of this module:
in_star_chim_not_in_classinput_priority_Align.txt: chimeric junctions in the STARChimeric.out.junctionfile that cannot be found inclass_input_priorityAlign.tsvin_star_chim_not_in_classinput_priority_Chim.txt: chimeric junctions in the STARChimeric.out.junctionfile that cannot be found inclass_input_priorityChimeric.tsvin_star_SJ_not_in_classinput_priority_Align.txt: splice junctions in the STARSJ.out.tabfile that cannot be found inclass_input_priorityAlign.tsvin_star_SJ_not_in_classinput_priority_Chim.txt: chimeric junctions in the STARSJ.out.tabfile that cannot be found in class_input_priorityChimeric.tsv
There is a file called wrapper.log in the folder for every pipeline run, as well as for every sample. The goal of these files it to make it easier to look at the output from the jobs you submit with the pipeline by collecting it all in the same place. For example, the folder output/GSE109774_colon_cSM_10_cJOM_10_aSJMN_0_cSRGM_0 will contain a wrapper.log file which has the .out and .err files concatenated for every job and every sample in that run; these outputs are sorted by job type (so the outputs for the mapping jobs for each sample will be next to each other, etc). There is also a wrapper.log file in each sample sub-folder; for example, output/GSE109774_colon_cSM_10_cJOM_10_aSJMN_0_cSRGM_0/SRR6546273 will contain this file. It contains the output for all .out and .err outputs from all the jobs run on that specific sample. The wrapper.log files are rewritten every time the pipeline is run on a sample.
Should our pipeline be based on independent alignment of both reads, or alignment in the paired-end read mode? This will depend on which is better for detecting junctions.
For the processed files:
splice_ann: both sides of the junction are annotated as exon boundaries and the junction itself is found in the GTF; subset of both_ann
both_ann: Both sides of the junction are annotated as exon boundaries; this is equivalent to exon_annR1A AND exon_annR1B
just_both_ann: Both sides of the junction are annotated as exon boundaries, but the splice is not annotated; just_both_ann UNION splice_ann = both_ann
exon_annR1A: The exon on the first half of the junction is at an annotated boundary
exon_annR1B: The exon on the second half of the junction is at an annotated boundary
one_ann: Exactly one of exon_annR1A and exon_annR1B is true
none_ann: Neither of the boundaries of the splice junction are annotated exon boundaries.
none_ann_known_gene: none_ann is TRUE and geneR1A_uniq is NOT unknown or blank
none_ann_unknown_gene: none_ann is TRUE and geneR1A_uniq IS unknown or blank