Fix issue with alignments being discarded after retokenization #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR primarily fixes the issue we observed with alignments being discarded whenever a user retokenizes either the source text or the target text. Now, if a user makes changes to either the source text or the target text, we do the following:
iin the new text gets mapped to tokenjin the old text. If tokenjwas involved in any alignments, then tokeniwill now be involved in those same alignments. This will be true regardless of changes we make to other tokens.jin the original text, and (2)jwas previously involved in an alignment, all alignments to/from that token will be removed.The LCS computation is done using the standard dynamic programming solution, which is
O(|S1| x |S2|), whereS1andS2are the original and new text sequences. My impression from testing these changes is that they result in behavior that is probably close to the most natural and labor-saving we can hope for, short of running an ML-based aligner in the app.This PR also fixes an issue I discovered while testing in which the interface will error if the
srcPosis set to an index that is deleted when the user edits the source text. Now, it tries to recomputesrcPosbased on the LCS. In the case where the user deletes thesrcPostoken in their edits, thesrcPossimply defaults to the last token in the new sequence.Lastly, my editor made some non-functional formatting changes to some of the source files. I'm hoping that's okay and that we can just ignore them, but let me know if not.
I would appreciate some additional testing by someone else before these changes are merged.