This pipeline was created as part of a project of the course "Computational Workflows". It aims to provide minimal functionality for the analysis of RNA-seq data, while being reproducible and modular. The pipeline is inspired and uses code from the nf-core/rnaseq pipeline (version 3.16.0).
-
The FastQC module of nf-core is used to generate quality reports of the input reads.
-
The nf-core module for Trim Galore is used to trim the raw reads.
-
HISAT2-build and HISAT2-align are used to index the reference genome (.fasta and .gtf file) and align the reads to the reference.
-
The SAMtools-sort module is used to sort the output alignment (.bam) of HISAT2. This step is required for subsequent feature counting.
-
The nf-core subreads-featurecounts module is used to count the alignments per genomic feature.
The pipeline can be used with the following parameters:
nextflow run main.nf\
--input <PATH_TO_SAMPLESHEET.csv> \
--fasta <PATH_TO_REFERENCE_GENOME.fa> \
--gtf <PATH_TO_GTF.gtf> \
--outdir <PATH_TO_OUT_DIRECTORY> \
--threads <#_OF_THREADS> \
--ram <#_OF_RAM_GB> \
-profile <conda/docker> \--input path to the samplesheet. An exemplary samplesheet can be found in the tests folder and should follow the following format:
sample,fastq_1,fastq_2,strandedness
SRR23195516_oxy_sni,./tests/data/reduced_SRX19144486_SRR23195516_1.fastq.gz,./tests/data/reduced_SRX19144486_SRR23195516_2.fastq.gz,auto
SRR23195511_oxy_sham,./tests/data/reduced_SRX19144488_SRR23195511_1.fastq.gz,./tests/data/reduced_SRX19144488_SRR23195511_2.fastq.gz,auto
--fasta path to the reference genome FASTA file.
--gtf path to the reference genomes feature file.
--outdir path to a folder where all the outputs should be stored.
--threads maximum number of threads to be used for the computation
--ram maximum amount of RAM to use during computation
-profile profile to run the pipeline with.
Exemplary command to run the pipeline with the included test files in the tests directory. The reference genome files (fasta and gtf) had to be excluded from the github repo, since they are too large. The files can be downloaded from NCBI and have to be extracted.
nextflow run main.nf\
--input ./tests/samplesheet_test.csv \
--fasta ./tests/data/GCF_000001635.27_GRCm39_genomic.fna \
--gtf ./tests/data/GCF_000001635.27_GRCm39_genomic.gtf \
--outdir ./tests/out_test/ \
--threads 12 \
--ram 8GB \
-profile docker \The output of all steps of the pipeline can be found in the folder specified with --outdir. Exemplary output files can be found in the tests/out_test/ folder.
- FastQC generates quality reports in HTML format (example)
- Trim Galore generates .fasta files with the trimmed reads, as well as trimming report files (example).
- The HISAT2 alignments and summary files are stored in the hisat2 folder
- The sorted alignment files as generated by SAMtools can be found in the samtools folder
- The feature counts from Subread featureCounts can be accessed in the counts folder
- The versions of all tools that were used over the nf-core modules can be found in the versions/versions.yml file