Skip to content

[add] integrity checks for columns#8

Open
gpetho wants to merge 2 commits intonytud:masterfrom
gpetho:integrity
Open

[add] integrity checks for columns#8
gpetho wants to merge 2 commits intonytud:masterfrom
gpetho:integrity

Conversation

@gpetho
Copy link
Contributor

@gpetho gpetho commented Jan 18, 2022

I propose adding two integrity checks to tsvhandler.py which would make debugging various errors in emtsv significantly easier.

  1. Check whether a target field of the current module is already present in the input.
    Duplicate column names in a module's output cause a problem in the next module downstream because they serve as keys in the field_names dict, cf. row 22: field_names = {name: i for i, name in enumerate(fields)}. Normally, after row 23 field_names contains exactly twice as many elements as the number of columns. This is not true if there are duplicate column names.
    This problem might sound far-fetched, but it does happen, for example if you tried to run emStanza according to the earlier official instructions ("If emstanza-lem is run after emstanza-pos, those three columns are duplicated.").
    Because of this, a module should not be allowed to output a column with the same name as its input.
    This is solved in rows 17 to 20.

  2. Check whether the output line for a token contains exactly the right number of columns.
    If a module is buggy, it happens very often that some or all of its output tokens contain an incorrect number of columns, i.e. not exactly as many columns as the header. Ideally it is of course the modules' authors who should guarantee that their implementation of process_sentence() always outputs the correct number of columns, but this is easier said than done. For this reason, tsvhandler.process() should double-check every token line that it yields whether it contains exactly as many columns as it should, except when the module explicitly specifies that it wants to return free-format output.
    Note that just checking whether the length of the returned list of process_sentence() is equal to the expected number of columns is not sufficient, since the elements of this list might contain tab characters, like in the case of the bug I have just referred to.
    The changes in rows 72 to 86 in commit c13450c handle this issue. I have adjusted the tracking of line numbers so that in case of an incorrect number of columns xtsv will report the actual line number of the offending token, not the number of the empty line after the sentence that the offending token appears in, like it normally does. A side effect of this change is that when an exception is raised by process_sentence(), what will be reported by process() is not the line number of the end of the current sentence, but of the line at the end of the previous sentence. Since this is exactly as uninformative as the current exception handling messages, but at least provides an exact pointer to the problematic line for the ValueError raised when the column numbers are incorrect, this is still much better than leaving the tracking of lines as it is now.
    The condition in lines 78-79 of commit 3938961 allows the module to skip this integrity check and return whatever it wants. This is meant for finalizers so that they can return free-format output.

@dlazesz
Copy link
Collaborator

dlazesz commented Jan 25, 2022

For (1), I propose the following minor modifications:

def process_header(fields, source_fields, target_fields, track_stream):
    fields_set = set(fields)
    if not source_fields.issubset(fields_set):
        raise HeaderError('Input ({0}) does not have the required field names ({1}). '
                          'The following field names found: {2}'.
                          format(track_stream['file_name'], sorted(source_fields), fields))
    duplicate_fields = fields_set.intersection(target_fields)
    if len(duplicate_fields) > 0:
        raise HeaderError('Input ({0}) already contains one or more target field names ({1}).'.
                          format(track_stream['file_name'], sorted(duplicate_fields)))

For (2) I have concerns:

  1. There are a number of decisions which can and should be made before procesing the input and be reused afterwards:
    • getattr(internal_app, 'free_format_output', False)
    • len(field_names) // 2
  2. I wolud prefer a way to disable the extra checking (default and enable it when there are concenrns) e.g. an extra parameter for process()
  3. IMO Adding free format output would need a major redesign of the function and it is a separate issue. So I would extract it from the commit and do it separately. It would be better to add a new static property after the redesign.
  4. I do not like the idea of joining something just to split it again. I would count instead: joined_tok.count('\t') == len(field_names) // 2

What do you think? Could you make the requested modifications in this PR?

@gpetho
Copy link
Contributor Author

gpetho commented Jan 26, 2022

Thank you for checking my suggestions.

The modification for (1): sure, why not. There's no point in casting fields to a set twice, so this makes perfect sense.

For (2): sorry, I don't quite get what you mean, and I think you might have misunderstood what I had in mind.

  1. is absolutely no problem, sure, if you prefer to put it this way, I really don't mind. You are off by one though, it should be tok.count('\t') == len(field_names) // 2 - 1 if you're counting tabs instead of columns.

I don't think I agree with your other points (but maybe I misunderstood something):

I'm not sure what you mean by the "decisions" in 1. Whether the number of columns (fields) that a module returns in process_sentence() is equal to len(field_names) // 2 should not really be a matter of decision (unless you mean the decision whether you are writing an internal module or a finaliser). Since field_names contains exactly the field names in the input plus those added by the module (i.e. exactly the names in the output file's tsv header) plus the list indices of these field names, and we know that the field names are necessarily unique (which is guaranteed by the check proposed in (1)), the number of columns in the return value of process_sentence is necessarily len(field_names) // 2. Otherwise the output would not be a valid tsv file, as some rows would contain either more or less columns than the header.

The only scenario where this should become a matter of decision is a module the output of which does not feed into any further modules, i.e. which is strictly speaking a finaliser. The exact reason why I added the free_format_output attribute for the internal_app object is to allow a module that is always run at the end of a pipeline to disable the checking, normally because it's a finaliser that wants to return free format output in the strict sense of the word.

I very much disagree with your remark in 3. that "Adding free format output would need a major redesign of the function". This is absolutely not the case. At the moment, almost entirely free format output is already possible. If a) the output header can be suppressed, and b) the return value of process_sentence() can be any string whatsoever (instead of a list), i.e. it can contain any number of tab-separated columns (including one), and c) the newlines after each sentence can be suppressed, then you have free-format output, there is nothing more to it. All three prerequisites for free-format output are currently met: a) there's already an internal_app attribute for suppressing the header, b) xtsv does not enforce the number of output columns, so process_sentence() can easily return something like [['This is the processed sentence.']], i.e. a single "token" containing a one-element list, which is "joined" by process() to a single "column", and c) you have just merged the option to suppress the newlines after sentences using the add_newline_after_sentence attribute (thank you).

The output isn't really entirely free-format because of the newline (i.e. not the empty line) that is still added after the sentence in '{0}\n'.format('\t'.join(tok)). So in the output the return value of each call of process_sentence() appears on its own line. This is why I added

if getattr(internal_app, 'free_format_output', False):
     yield tokens

in 3938961 . process_sentence() returns just the string 'This is the processed sentence.', no joining, adding tabs and newlines, etc. is necessary, end of story. This return value is simply yielded to build_pipeline(), which in turn returns it, and it is written to the output, assuming that it's the final module of the pipeline.

In addition, this if condition skips the column number check. As I have explained, currently tsvhandler.process() can already yield free-format output (apart from the added newline) if that's what process_sentence() happens to return. Now if you were to add the checking of the number of columns in the output without an option to disable this check, then this situation would change, as this policy would make it unreasonably difficult to generate free-format output from a module. If the module's input had, say, 3 fields, then [['This is the processed sentence.']] would now be an invalid return value, since the token would have to have 3 elements, (since no target field is obviously added), so it would have to look like this: [['This is the processed sentence.', '', '']], and not just the newline would be added at the end of the sentence, but a sequence of tabs as well.

(You could still generate free-format output in principle if you really want to. The internal_app object would have to have an attribute that is initialized to the empty string, process_sentence would append its free-form output to this object attribute for each sentence, and would always return an empty list of tokens and suppress the newline. You would implement the final_output() method to return this aggregated output, which is always called by tsvhandler.process() when the input stream ends if it exists. This is an awful hack though.)

So the reason why I have added the free_format_output condition is to make it possible to disable the checking of number of columns and return any output you want for each input sentence. This is not just a theoretical idea. I wrote an emtsv finaliser module last week that uses the proposed functionality to return free-form output, so it clearly works.

Finally, if my interpretation of your point 2. is correct, you effectively mean that instead of getattr(internal_app, 'free_format_output', False), you would prefer getattr(internal_app, 'free_format_output', True) since this is what would disable the extra checking by default. I strongly disagree, since this would defeat the whole idea of enforcing the validity of an xtsv module's output. I can only quote what I wrote earlier: "Ideally it is of course the modules' authors who should guarantee that their implementation of process_sentence() always outputs the correct number of columns, but this is easier said than done." Invalid output is usually not intentional but the result of some bug, so it should be intercepted right where it's generated in order to make it reasonably easy to debug, and should not be allowed to be propagated, so that it eventually breaks the pipeline two modules downstream, like in the example I was referring to above. Whether this is done as an output check like I proposed, i.e. verifying whether the output of a module contains the correct number of columns, or whether it is done instead as an input check in the next module is something that might be worth considering. But in my opinion it's crucial for correct inter-module communication, which is the very point of xtsv, that the integrity of the tsv data format is guaranteed one way or the other.

@sassbalint
Copy link
Contributor

@dlazesz @gpetho
Balázs and Gergő, I think we should organize a meeting for three of us to resolve the above concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants