Skip to content

samples instead of conditions in contrast#260

Draft
matbonfanti wants to merge 3 commits intodevfrom
bugfix_mle_from_constrast_with_count_table
Draft

samples instead of conditions in contrast#260
matbonfanti wants to merge 3 commits intodevfrom
bugfix_mle_from_constrast_with_count_table

Conversation

@matbonfanti
Copy link

@matbonfanti matbonfanti commented Oct 9, 2025

When running a MLE analysis from a precomputed count matrix and a contrast file, the pipeline does not work because in the changes of #252 the condition labels in the contrast file should be mapped to sample names using the samplesheet, but the samplesheet when the count matrix is provided can be omitted.

This can be tested with an input like:

count_table: "https://raw.githubusercontent.com/nf-core/test-datasets/crisprseq/testdata/count_table.tsv"
analysis: 'screening'
contrasts: "https://raw.githubusercontent.com/nf-core/test-datasets/crisprseq/testdata/rra_contrasts.txt"
outdir: "output"
mle: true

In this PR, the issue is fixed by assuming that in this case the contrast file contains sample labels instead of condition labels, hence omitting the mapping from conditions to samples.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/crisprseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@matbonfanti matbonfanti added this to the 3.0.0 milestone Oct 9, 2025
@matbonfanti matbonfanti requested a review from mirpedrol October 9, 2025 10:22
@matbonfanti matbonfanti self-assigned this Oct 9, 2025
@matbonfanti matbonfanti added the bug Something isn't working label Oct 9, 2025
@github-actions
Copy link

github-actions bot commented Oct 9, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 9caf462

+| ✅ 278 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗  21 tests had warnings |!
Details

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • local_component_structure - find_adapters.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - gpt_prepare_query.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - alignment_summary.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - clonality_classifier.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - drugz.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - hitselection.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - orient_reference.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - preprocessing_summary.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - matricescreation.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - venndiagram.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - extract_umis.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - template_reference.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - cigar_parser.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - clustering_summary.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
  • local_component_structure - crisprseq_plotter.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.3.2
  • Run at 2025-10-09 11:58:04

@mirpedrol
Copy link
Member

I am not following the story here. Could you explain to me what the files would look like when running the pipelines from fastq files or from a count matrix with an example? Thanks!

@matbonfanti
Copy link
Author

yep, sure...

When you start from fastq files, you have to provide a sample sheet that contains sample labels and condition labels. With the contrast file you specify the condition labels that you want to compare, and the condition labels are then mapped to samples when it is required to specify the model for the differential analysis. An example is the main screening test:

When you start providing directly the count matrix, you do not need to give a sample sheet and so you do not have a place where to provide condition labels. In this case you give:

I totally get that having a file which may be structured differently depending on the type of run is not ideal, but I think that other solutions are not ideal, either. I think that one possibility to make everything consistent would be specifying the contrast by sample names in all the cases, but I guess this would be too radical.

Anyway, I would be happy to discuss this further!

@mirpedrol
Copy link
Member

What do you think if we require to provide a samplesheet too, even if a contrast file is provided? Then, I would make the fastq files not required, so only sample ID and condition will have to be provided, and we can read the mapping of samples form the samplesheet

@matbonfanti
Copy link
Author

I just want to be sure I’m following your reasoning: is the concern mainly about the pipeline being runnable in non-standard cases without a samplesheet, or about the possible inconsistency where the contrast file might contain either sample or condition labels depending on the context?

Anyway, I’m fine with your proposal. It is actually a good time to make the change, since it would be a breaking one and we already have a major release coming up.

My only concern is that making the FASTQ files non-compulsory would also affect the standard (and more common) use case, where the pipeline does require them. We could add an extra ad hoc check for the presence of FASTQ files, but obviously it is nicer to have this check directly at the level of the samplesheet parsing.

@mirpedrol
Copy link
Member

I am more concerned about having a contrast file that contains different columns depending on the run.
For example, if I run the pipeline once, obtain the counts matrix, and want to re-run it with some different parameters but starting from the count matrix, I can't reuse my contrasts file.

I agree it's better for validation to check for fastq files when parsing the samplesheet, but I think it's the best trade off in this situation.
We should anyway try to think of a better way to handle our samplesheet validation, since we are already being very permissive because we have to use the same JSON schema for both subworkflows of the pipeline. We can tackle this later in a different PR.

@matbonfanti
Copy link
Author

I see your point, and I agree.

In that sense, my preference would be to simplify the pipeline by removing the condition column altogether, so that the contrast file always refers to samples rather than conditions. This might make the contrast IDs slightly less readable, but in many real use cases (like the one that prompted this PR), users typically run multiple contrasts on the same dataset, meaning the condition labels end up being one-to-one with the sample IDs anyway.

That said, I’m aware this change could be quite disruptive compared to previous releases. How do you feel about it?

@mirpedrol
Copy link
Member

I am not a user of this analysis, so I rely on your experience for this 😄
It sounds good to me to get rid of the "conditions" column, and modify how the contrast file looks like. If you think it will be too disruptive, we could try posting on the slack channel to get people's opinion (hopefully they receive a notification and read it)

@matbonfanti matbonfanti marked this pull request as draft October 23, 2025 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants