-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Is your feature request related to a problem? Please describe.
It is nice to be able to use block_by to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows if year.y is larger than year.x, i.e I'd like to block matching by the condition year.x > year.y.
I don't know how hard that would be to implement. I could track the usage of block_by (renamed salt in the internals) until the method new() for the Shingleset struct in Rust, but I don't know how that works exactly.
Describe the solution you'd like
Not sure of the syntax this should take. It could be something similar to dplyr::join_by(), so that one could do block_by = block_by(year.x > year.y). Also, that only works if the variable has a different name in each dataset.
Describe alternatives you've considered
Currently I don't use block_by. Instead I match on the full data and filter out by my condition afterwards.
Additional context
/