Skip to content

v2.0.0: Refactor for sample-wise parameterisation#171

Open
nschan wants to merge 145 commits intonf-core:devfrom
nschan:refactor-assemblers
Open

v2.0.0: Refactor for sample-wise parameterisation#171
nschan wants to merge 145 commits intonf-core:devfrom
nschan:refactor-assemblers

Conversation

@nschan
Copy link
Collaborator

@nschan nschan commented Jul 9, 2025

Updated Feb 04 2026

As suggested here this is full refactor of genomeassembler to support sample-level parameterisation of everything.

Why?
Often when doing genome assembly, we do not know what works best. With this change, this pipeline can be used to compare different settings for the same set of reads, to compare the assembly outcome. Samples that share the same value in group will be combined during reporting and preprocessing to facilitate comparisons of strategies on the same input(s).

Per-sample parameterisation

Initially, I implemented this poorly, based largely around join()'ing various maps back to other maps. nextflow does not have a joinoperator for maps, so this was a big mess and turned out to be hard to read, annoying to write, and constantly blocking. To summarise: bad idea, cannot recommend. This attempt contributes to the large number of commits in this branch.

While contemplating my failure to implement sample-level parameterisation, I realised that the solution to this problem is "meta-stuffing", also referred to as "meta-smuggling" by @prototaxites, who seems to have arrived at a similar conclusion at around the same time.
In the purest form, this would implement everything in a single meta-map, which is in slot 0 of the channel traveling through the pipeline. I will use meta to refer to [0] of list-channels from here on.
Note: A pure meta-stuffing implementation would have required additional refactoring of some subworkflows, in particular of QC which takes more than one input channel, something I did not want to do.

How this works:
Everything goes into meta: params are turned into k/v pairs for each sample, unless the samplesheet contains a different value for that sample with the same key, resulting in a massive meta-map. values are pulled from meta as required for channel inputs, meta is recreated / updated from channel outputs. This largely enables flow-control at the sample level, when combined with branch, filter and related operators. In some cases, joins cannot be avoided (or would need to be traded for concurrent processes), but I tried to minimise them.
This means that the pipeline is essentially free of if { } statements, since flow control is done via channels. This also means that the pipeline DAG is always rendered in full, irrespective of whether the nodes will actually be visited by a sample. It might be possible to optimise DAG rendering by inspecting all created meta-maps and creating some global variables based on their content, to again conditionally include subworkflows, but this is beyond the scope of this refactor.

I made an effort to provide flexibility in combining assemblers, polishers, or scaffolding tools, as I thought it was reasonable, but this does not offer full-factorial combinations, which become especially tricky if things should happen in order.

Generally, I am very happy with this approach, it offers great flexibility, and is surprisingly nice to write once meta has been constructed. Given that global parameters can still be set, the samplesheet may or may not be kind of wide, for a single sample everything could be done via params and the only column in the samplesheet would be sample. In the most ridiculous case of setting all parameters differently for each sample, the samplesheet could grow to around 50 columns.

Grouping

Grouping is implemented as an extension of "meta-smuggling", essentially smuggling multiple meta maps through a channel (inside meta), replacing the value of meta.id with the group id.
Here is the relevant code for grouping and un-grouping while maintaining meta-maps:

    some_channel
         // Move group information into channel, if it exists
        .filter { it -> it.meta.group }
        .map { it -> [it.meta, it.meta.group, it.meta.ontreads] }
        // Group by group
        .groupTuple(by: 1)
        // Collect all sample-meta into a group meta slot named metas
        // Use unique reads; user responsible to group correctly
        .map {
            it ->
                [
                    [
                        id: it[1], // the group
                        metas: it[0]
                    ],
                    it[2].unique()[0] // Ontreads
                ]
        }

After this input channel has been processed, the samples are
recreated from meta[metas]:

    process.OUT
        // Take samples with metas in slot [0]
        .filter { it -> it[0].metas }
        .flatMap { it ->
            // $it looks like [meta, output_path]
            // recreate meta from metas and update path.
            it[0].metas
                  .collect { meta -> [
                                meta: meta - meta.subMap("ontreads") + [ontreads: it[1]]
                                ]
                            }
        }
        .mix(
            process.OUT
                .filter { it -> !it[0].metas }
        )

More

Since switching to meta-stuffing made things much easier, I have also added an HiC scaffolding subworkflow, and support for dorado polish (experimental, as dorado polish does not work reliably) in the polishing subworkflow

@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 9, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@nschan nschan changed the title Refactor for sample-wise parameterisation v2.0.0: Refactor for sample-wise parameterisation Sep 4, 2025
@nschan
Copy link
Collaborator Author

nschan commented Sep 4, 2025

Since the initial start of this refactor, I noticed that when doing multiple assemblies from the same set of reads it is kind of a waste to send those reads through preprocessing multiple times. I also figured that assemblies from the same set of reads are likely to be compared to each other. To reduce redundant work and make comparisons easier, there is now a group parameters, that can be used to put samples using the same reads into a group. Note: putting samples with different inputs into the same group will give wrong results.
I also noticed over the course of refactoring that there is some information that simply needs to be passed to processes (mostly for config purposes), since fetching those values from params kind of defeats the idea of parameterising per-sample. Those get stuffed into meta as needed. Generally, I am storing all information in the channel-map during "transit" between processes. Overall, this results in a lot of not super-concise channel manipulation.
I have now run some more extensive tests, and overall the logic seems to work as intended.

Copy link
Contributor

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I've done my best but I found it pretty hard to read the code in this pipeline. so I can't approve this at this point... I've left a few comments. Here are some more tips to help with the readability:

  1. Don't use it in closures, try to set a variable name for each item in the channel entry instead (e.g. instead of .map { it -> ...} do .map { meta, file1, file2 -> ...}. This makes it easier for me to understand what is in the channel at that point and will make it easier for future you (and others) to work on the pipeline later.
  2. Use more clear variable names
  3. Try to put some more comments above big code blocks with a short explanation of what this piece of code is for. (Especially on harder to understand pieces of code).

But anyways, I'm still really impressed with what you've done here and this really will be a massive improvement to the pipeline!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not needed in nf-core pipelines, you can find all parameters on the website: https://nf-co.re/genomeassembler/dev/parameters/

docs/usage.md Outdated
> [!NOTE]
> The parameter names will be used in subsequent sections. Since all parameters can be provided per-sample or pipeline wide, no examples will be given.

The list of all parameters that can be provided globally is available [here](params.md), parameters that can be set per sample are provided at the [end of this page](#sample-parameters).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The list of all parameters that can be provided globally is available [here](params.md), parameters that can be set per sample are provided at the [end of this page](#sample-parameters).
The list of all parameters that can be provided globally is available [here](https://nf-co.re/genomeassembler/parameters/), parameters that can be set per sample are provided at the [end of this page](#sample-parameters).

nschan and others added 7 commits February 13, 2026 09:38
* Template update for nf-core/tools version 3.2.1

* Template update for nf-core/tools version 3.3.1

* merge template 3.3.1 - fix linting

* update pre-commit

* merge template 3.3.1 - fix linting

* pre-commit config?

* pre-commit config?

* reinstall links

* try larger runner

* smaller run, disable bloom filter for hifiasm test

* updated test snapshot

* updated test snapshot

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* Update .github/actions/nf-test/action.yml

Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>

* Update docs/output.md

Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>

* remove .nf-test.log

---------

Co-authored-by: Niklas Schandry <niklas@bio.lmu.de>
Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>
@nschan nschan force-pushed the refactor-assemblers branch from 4e4e735 to 9bb1748 Compare February 13, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants