Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability#32
Open
richardwu wants to merge 14 commits intoHoloClean:devfrom
Conversation
|
How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall. |
Collaborator
Author
|
Pushed up a patch to update single/co-occur stats after each EM iteration for I attempted to do more iterations but there is an issue with how we use |
|
Sounds good. |
32ae5ef to
fb01e01
Compare
(singular value, old 'init_value').
iterations for repair.
multiple init values by specifying init values in raw data separated by |||.
fb01e01 to
39088f4
Compare
Collaborator
Author
|
Newest results with this patch with fix to Latest changes:
Ready for another review 👀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:
current_valueand renamed from e.g.InitFeaturizertoCurrentFeaturizercurrent_values incell_domainwith inferred values frominf_vals_domcurrent_values (featurizers such asCurrentAttrFeaturizerorCurrentXFeaturizer) can take advantage of the updated current valuesInitSimFeaturizerwhere it wasn't computing the similarity metrics correctly between theinit_valueand values in the domainNULLvalues inNullDetectorcurrent_valueis initialized with the value frominit_valueswith the highest sum of co-occurrence probabilities with the otherinit_valuesin the tupleI've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.
NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.