Conversation
|
|
I am not following the story here. Could you explain to me what the files would look like when running the pipelines from fastq files or from a count matrix with an example? Thanks! |
|
yep, sure... When you start from fastq files, you have to provide a sample sheet that contains sample labels and condition labels. With the contrast file you specify the condition labels that you want to compare, and the condition labels are then mapped to samples when it is required to specify the model for the differential analysis. An example is the main screening test:
When you start providing directly the count matrix, you do not need to give a sample sheet and so you do not have a place where to provide condition labels. In this case you give:
I totally get that having a file which may be structured differently depending on the type of run is not ideal, but I think that other solutions are not ideal, either. I think that one possibility to make everything consistent would be specifying the contrast by sample names in all the cases, but I guess this would be too radical. Anyway, I would be happy to discuss this further! |
|
What do you think if we require to provide a samplesheet too, even if a contrast file is provided? Then, I would make the fastq files not required, so only sample ID and condition will have to be provided, and we can read the mapping of samples form the samplesheet |
|
I just want to be sure I’m following your reasoning: is the concern mainly about the pipeline being runnable in non-standard cases without a samplesheet, or about the possible inconsistency where the contrast file might contain either sample or condition labels depending on the context? Anyway, I’m fine with your proposal. It is actually a good time to make the change, since it would be a breaking one and we already have a major release coming up. My only concern is that making the FASTQ files non-compulsory would also affect the standard (and more common) use case, where the pipeline does require them. We could add an extra ad hoc check for the presence of FASTQ files, but obviously it is nicer to have this check directly at the level of the samplesheet parsing. |
|
I am more concerned about having a contrast file that contains different columns depending on the run. I agree it's better for validation to check for fastq files when parsing the samplesheet, but I think it's the best trade off in this situation. |
|
I see your point, and I agree. In that sense, my preference would be to simplify the pipeline by removing the condition column altogether, so that the contrast file always refers to samples rather than conditions. This might make the contrast IDs slightly less readable, but in many real use cases (like the one that prompted this PR), users typically run multiple contrasts on the same dataset, meaning the condition labels end up being one-to-one with the sample IDs anyway. That said, I’m aware this change could be quite disruptive compared to previous releases. How do you feel about it? |
|
I am not a user of this analysis, so I rely on your experience for this 😄 |
When running a MLE analysis from a precomputed count matrix and a contrast file, the pipeline does not work because in the changes of #252 the condition labels in the contrast file should be mapped to sample names using the samplesheet, but the samplesheet when the count matrix is provided can be omitted.
This can be tested with an input like:
In this PR, the issue is fixed by assuming that in this case the contrast file contains sample labels instead of condition labels, hence omitting the mapping from conditions to samples.
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).