Skip to content

Conversation

@wgantt
Copy link

@wgantt wgantt commented Mar 22, 2023

This PR primarily fixes the issue we observed with alignments being discarded whenever a user retokenizes either the source text or the target text. Now, if a user makes changes to either the source text or the target text, we do the following:

  1. Compute the longest common subsequence (LCS) between the original text and the new text. This will yield a mapping from tokens in the new text to tokens in the old text (assuming at least some of the original tokens are preserved).
  2. Suppose token i in the new text gets mapped to token j in the old text. If token j was involved in any alignments, then token i will now be involved in those same alignments. This will be true regardless of changes we make to other tokens.
  3. In cases where (1) the LCS cannot establish a mapping to a token j in the original text, and (2) j was previously involved in an alignment, all alignments to/from that token will be removed.

The LCS computation is done using the standard dynamic programming solution, which is O(|S1| x |S2|), where S1 and S2 are the original and new text sequences. My impression from testing these changes is that they result in behavior that is probably close to the most natural and labor-saving we can hope for, short of running an ML-based aligner in the app.

This PR also fixes an issue I discovered while testing in which the interface will error if the srcPos is set to an index that is deleted when the user edits the source text. Now, it tries to recompute srcPos based on the LCS. In the case where the user deletes the srcPos token in their edits, the srcPos simply defaults to the last token in the new sequence.

Lastly, my editor made some non-functional formatting changes to some of the source files. I'm hoping that's okay and that we can just ignore them, but let me know if not.

I would appreciate some additional testing by someone else before these changes are merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant