Skip to content

Refactor comparison feature generation #214

@riley-harper

Description

@riley-harper

There is a really big if statement in hlink.linking.core.comparison_feature.generate_comparison_feature. This is another good opportunity to break out each branch into its own thing to make this code simpler. Like #205, this could lead to custom comparison feature types, but let's start with just refactoring. Comparison features are more complicated than column mappings and require more than one operation.

My first thought is that to generate a comparison feature we will need 3 operations defined.

  1. Extract sub-features. Some comparison features have sub-features which need to be recursively generated first.
  2. Compute SQL. Compute the SQL string to output for this comparison feature. Right now we do a lot of string interpolation here. We could use something more similar to an expression API instead. But that's probably best left as a separate issue.
  3. Determine input columns. This step is necessary for a bit of code in matching's explode step. In my mind we should just carry along all of the columns each time, and so we shouldn't need to explicitly specify these columns. Spark should handle that for us. But as it stands we will need to have this third operation to support matching.

Operation 3 is questionable. Right now matching has its own code to inspect the comparison features and determine the input columns itself. So it may be best to leave that be for now, then come back later to try to remove that code from matching. Operations 1 and 2 feel solid.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions