Skip to content

Add additional inputs to the samplesheet#647

Merged
nvnieuwk merged 19 commits intonf-core:devfrom
nvnieuwk:bam-input
Apr 16, 2025
Merged

Add additional inputs to the samplesheet#647
nvnieuwk merged 19 commits intonf-core:devfrom
nvnieuwk:bam-input

Conversation

@nvnieuwk
Copy link
Contributor

Fixes #644

This PR adds the possibility to supply BAM, CRAM, junctions and splice_junctions files to the samplesheet. Adding these files will make sure that the STAR alignment step will be skipped, thus increasing the speed and efficiency of the pipeline.

It's however still advised to also pass the FASTQ files since some tools depend on these files. Would it make sense to add a BAM/CRAM -> FASTQ conversion to the pipeline to eliminate this requirement?

I also fixed the flow a bit more here since VCF_COLLECT didn't actually run before

@github-actions
Copy link

github-actions bot commented Apr 10, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 6536e6a

+| ✅ 219 tests passed       |+
#| ❔   2 tests were ignored |#
!| ❗   4 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in nextflow.config: Update the field with the details of the contributors to your pipeline. New with Nextflow version 24.10.0
  • pipeline_todos - TODO string in ro-crate-metadata.json: "description": "

    \n \n <source media="(prefers-color-scheme: dark)" srcset="docs/images/nf-core-rnafusion_logo_dark.png">\n <img alt="nf-core/rnafusion" src="docs/images/nf-core-rnafusion_logo_light.png">\n \n

    \n\nGitHub Actions CI Status\nGitHub Actions Linting StatusAWS CICite with Zenodo\nnf-test\n\nNextflow\nrun with conda\nrun with docker\nrun with singularity\nLaunch on Seqera Platform\n\nGet help on SlackFollow on TwitterFollow on MastodonWatch on YouTube\n\n## Introduction\n\nnf-core/rnafusion is a bioinformatics pipeline that ...\n\n TODO nf-core:\n Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the\n major pipeline sections and the types of output it produces. You're giving an overview to someone new\n to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction\n\n\n Include a figure that guides the user through the major workflow steps. Many nf-core\n workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. \n Fill in short bullet-pointed list of the default steps in the pipeline 1. Read QC (FastQC)2. Present QC for raw reads (MultiQC)\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.\n\n Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.\n Explain what rows and columns represent. For instance (please edit as appropriate):\n\nFirst, prepare a samplesheet with your input data that looks as follows:\n\nsamplesheet.csv:\n\ncsv\nsample,fastq_1,fastq_2\nCONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz\n\n\nEach row represents a fastq file (single-end) or a pair of fastq files (paired end).\n\n\n\nNow, you can run the pipeline using:\n\n update the following command to include all required parameters for a minimal example \n\nbash\nnextflow run nf-core/rnafusion \\\n -profile <docker/singularity/.../institute> \\\n --input samplesheet.csv \\\n --outdir <OUTDIR>\n\n\n> [!WARNING]\n> Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.\n\nFor more details and further functionality, please refer to the usage documentation and the parameter documentation.\n\n## Pipeline output\n\nTo see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page.\nFor more details about the output files and reports, please refer to the\noutput documentation.\n\n## Credits\n\nnf-core/rnafusion was originally written by Martin Proks, Annick Renevey.\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n If applicable, make list of people who have also contributed \n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the contributing guidelines.\n\nFor further information or help, don't hesitate to get in touch on the Slack #rnafusion channel (you can join with this invite).\n\n## Citations\n\n Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. \n If you use nf-core/rnafusion for your analysis, please cite it using the following doi: 10.5281/zenodo.XXXXXX \n\n Add bibliography of tools and data used in your pipeline \n\nAn extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.\n\nYou can cite the nf-core publication as follows:\n\n> The nf-core framework for community-curated bioinformatics pipelines.\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.\n",
  • schema_lint - Input mimetype is missing or empty
  • local_component_structure - fusioninspector_workflow.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.2.0
  • Run at 2025-04-14 08:10:31

Copy link
Contributor

@atrigila atrigila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I wasn't able to run a test on these changes yet but here are a few questions.

docs/usage.md Outdated
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | :white_check_mark: |
| `strandedness` | Strandedness: forward or reverse. | :white_check_mark: |
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File must exist, has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". It's recommended to always provide the FASTQ file(s) because the pipeline will be able to create any missing files from these. The FASTQ files are required to run `salmon`, `fusioninspector` and `fusioncatcher`. | :grey_question: |
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File must exist, has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | :x: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't fastq_2 always required, or at least suggested? I am asking because of this comment: #613 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right! I'll change it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this section:

    // Check if the file is not a directory or is a URL and return whether it's empty or not
    def is_url = ["https://", "ftp://", "http://"].findAll { it -> path.startsWith(it) }.size() > 0
    if(is_url || !path_to_check.toFile().isDirectory()) {
        return !path_to_check.isEmpty()
    }
    ```
    
    I think this fails when you use an s3 directory as input file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this and it works with s3 directories :). The reason this doesn't work for http(s) and ftp is because of the way these protocols work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tested it but it failed for me? Perhaps I used an incorrect command? Would you mind sharing me how you did?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried supplying a fasta from iGenomes, but it failed in this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange... did you get an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading plugin nf-amazon@2.9.2
ERROR ~ Unexpected error [UnsupportedOperationException]

 -- Check script 'nf-core-rnafusion/./workflows/../subworkflows/local/build_references.nf' at line: 240 or see '.nextflow.log' file for more details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get that error too now 😓 Thanks for pointing this out! I'm currently adding some tests for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it and added tests, it should be okay now

@atrigila
Copy link
Contributor

I tested your code with:

nextflow run nf-core-rnafusion/ -profile test_build,docker --outdir here --input /workspace/nf-core-rnafusion/tests/yml/bam.yml  --references_only false --tools starfusion,fusionreport,salmon,fusioninspector

It failed due to the following error:

Missing output file(s) `*.txt` expected by process `RNAFUSION:QC_WORKFLOW:PICARD_COLLECTINSERTSIZEMETRICS (test)

If you could open an issue about that error it would be great. Running it again with --skip_qc solved the above issue:

nextflow run nf-core-rnafusion/ -profile test_build,docker --outdir here --input /workspace/nf-core-rnafusion/tests/yml/bam.yml  --references_only false --tools starfusion,fusionreport,salmon,fusioninspector --skip_qc

This worked well, with only a final error related to #649.

The core functionality introduced in this PR (especially the BAM handling steps) worked as expected, and it's great to see the STAR alignment can now be skipped, this should help speed up CI tests considerably. It might be a good idea to eventually add a BAM/CRAM to FASTQ conversion step to the pipeline to remove the dependency on FASTQs. Would love to see CI tests covering these new changes soon.

I would consider this PR approved so it doesn’t block further development, but it’s of course open for comments and feedback from others.

docs/usage.md Outdated
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | :white_check_mark: |
| `strandedness` | Strandedness: forward or reverse. | :white_check_mark: |
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File must exist, has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". It's recommended to always provide the FASTQ file(s) because the pipeline will be able to create any missing files from these. The FASTQ files are required to run `salmon`, `fusioninspector` and `fusioncatcher`. | :grey_question: |
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File must exist, has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | :x: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TODO: unify as if(tools.contains("fusioninspector")) once nextflow bug fixed
def run_fusioninspector = tools.contains("fusioninspector")
if(run_fusioninspector) {
if(run_fusioninspector && !params.skip_vcf) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a dependency matrix in one place instead of spreading out the logic in multiple if statements that are harder to maintain? If you agree, we can open an issue (not for this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would love that! But that seems really complicated with this pipeline, so might be better for a next release?

@nvnieuwk nvnieuwk linked an issue Apr 15, 2025 that may be closed by this pull request
@nvnieuwk nvnieuwk linked an issue Apr 15, 2025 that may be closed by this pull request
@nvnieuwk nvnieuwk merged commit 69f4ac6 into nf-core:dev Apr 16, 2025
11 checks passed
@nvnieuwk nvnieuwk deleted the bam-input branch April 16, 2025 07:57
@atrigila atrigila mentioned this pull request Sep 16, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow BAM/CRAM input reads_junction in STARFUSION_WORKFLOW should not join with reads

3 participants