Refactor comparison feature generation

There is a really big if statement in hlink.linking.core.comparison_feature.generate_comparison_feature. This is another good opportunity to break out each branch into its own thing to make this code simpler. Like #205, this could lead to custom comparison feature types, but let's start with just refactoring. Comparison features are more complicated than column mappings and require more than one operation.

My first thought is that to generate a comparison feature we will need 3 operations defined.

1. Extract sub-features. Some comparison features have sub-features which need to be recursively generated first.
2. Compute SQL. Compute the SQL string to output for this comparison feature. Right now we do a lot of string interpolation here. We could use something more similar to an expression API instead. But that's probably best left as a separate issue.
3. Determine input columns. This step is necessary for a bit of code in matching's explode step. In my mind we should just carry along all of the columns each time, and so we shouldn't need to explicitly specify these columns. Spark should handle that for us. But as it stands we will need to have this third operation to support matching.

Operation 3 is questionable. Right now matching has its own code to inspect the comparison features and determine the input columns itself. So it may be best to leave that be for now, then come back later to try to remove that code from matching. Operations 1 and 2 feel solid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor comparison feature generation #214

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor comparison feature generation #214

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions