This is a pipeline to analyze next-generation sequencing of small RNAs in C. elegans. The pipeline can be broken down into two major parts:
-
Generate count matrices. Trims and maps reads to the C. elegans genome, then generates count matrices of the number of reads mapping antisense to each gene. This first part of the pipeline is designed to run in a high-performance computing cluster based on Linux and Slurm.
-
generate_count_matrices.shis the main file for Part 1 and calls all the other Part 1 scripts. On line 6 ofgenerate_count_matrices.sh, specify the full pathname of your project directory:# Assign a variable to the pathname of the project (this is the main directory). Change this to fit your own path. main_dir=<project pathname> -
Then execute lines 8-16 of
generate_count_matrices.shto generate the following directory structure:project name ├── logs ├── meta ├── raw_data ├── results └── scripts -
Before continuing on with the rest of
generate_count_matrices.sh, make sure that:- You've copied your raw, demultiplexed fastq files into the
raw_datadirectory. - All Part 1 scripts are in the
scriptsdirectory. - Your metadata file
metadata.txtis in themetadirectory. Column 1 ofmetadata.txtmust contain the desired output filename, and there must also be a column containing the input filename. Seemetadata.txtin this repository for an example.
- You've copied your raw, demultiplexed fastq files into the
-
Note, this pipeline assumes the reads contain a 4-nucleotide-long barcode at the 5' end. If your reads do not contain a 5' barcode and instead begin immediately with the insert, make the following two changes:
-
Change line 36 in
select_5prime_barcode.shfrom:grep -B 1 -A 2 -e ^$barcode1 -e ^$barcode2 $input_path$input_file | sed '/^--/d' > $output_path$new_name
to:
cp $input_path$input_file $output_path$new_name
With this change, running
select_5prime_barcode.shwill simply assign new, meaningful names to the fastq files usingmetadata.txtand place them in a new directory inresultscalledsort_5prime. -
Change line 23 in
trim_5prime.shfrom:cutadapt -u 4 -o $output $1 > ${2}/logs/trim5/${base}.txt
to:
cp $1 $output
With this change, running
trim_5prime.shwill simply copy the fastq files into a new directory inresultscalledtrim3_trim5and add "_trim5" to the end of each filename.
-
-
-
Differential analysis and visualization. Uses the count matrices generated in Part 1 to perform a simple wild type vs. mutant analysis to identify genes that are differentially targeted by small RNAs. This part of the pipeline is designed to run as an RStudio project (
DA_and_visualization.Rproj).-
main_script.Ris the main file for Part 2. -
Before beginning, make sure the count matrices are in the
datadirectory and that the metadata file (.csv format) is in themetadirectory. Examples of these files can be found in theexample_filesdirectory. -
Part 2 outputs include the following:
- A table of normalized counts (median of ratios method)
- A list of differentially targeted genes and their corresponding log2 fold changes and adjusted p-values
- A biplot of the top two principal components determined by principal component analysis
- A volcano plot of log2 fold change vs. significance, with labels for the top 10 significant genes
- The option to plot normalized counts for any given gene (specified by WormBase Gene ID)
-
| Software | Version | Used in |
|---|---|---|
gcc |
6.2.0 | Part 1: Generate count matrices |
python |
2.7.12 | Part 1: Generate count matrices |
cutadapt |
1.14 | Part 1: Generate count matrices |
fastqc |
0.11.5 | Part 1: Generate count matrices |
bowtie |
1.2.2 | Part 1: Generate count matrices |
samtools |
1.9 | Part 1: Generate count matrices |
deeptools |
3.0.2 | Part 1: Generate count matrices |
featureCounts |
2.0.0 | Part 1: Generate count matrices |
R |
3.5.1 | Part 2: Differential analysis and visualization |
DESeq2 |
1.22.2 | Part 2: Differential analysis and visualization |
tidyverse |
1.2.1 | Part 2: Differential analysis and visualization |
ggrepel |
0.8.1 | Part 2: Differential analysis and visualization |