Skip to content

[FR] Allow inequality conditions in block_by #100

@etiennebacher

Description

@etiennebacher

Is your feature request related to a problem? Please describe.
It is nice to be able to use block_by to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows if year.y is larger than year.x, i.e I'd like to block matching by the condition year.x > year.y.

I don't know how hard that would be to implement. I could track the usage of block_by (renamed salt in the internals) until the method new() for the Shingleset struct in Rust, but I don't know how that works exactly.

Describe the solution you'd like
Not sure of the syntax this should take. It could be something similar to dplyr::join_by(), so that one could do block_by = block_by(year.x > year.y). Also, that only works if the variable has a different name in each dataset.

Describe alternatives you've considered
Currently I don't use block_by. Instead I match on the full data and filter out by my condition afterwards.

Additional context
/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions