Conversation
|
For (1), I propose the following minor modifications: For (2) I have concerns:
What do you think? Could you make the requested modifications in this PR? |
|
Thank you for checking my suggestions. The modification for (1): sure, why not. There's no point in casting For (2): sorry, I don't quite get what you mean, and I think you might have misunderstood what I had in mind.
I don't think I agree with your other points (but maybe I misunderstood something): I'm not sure what you mean by the "decisions" in 1. Whether the number of columns (fields) that a module returns in The only scenario where this should become a matter of decision is a module the output of which does not feed into any further modules, i.e. which is strictly speaking a finaliser. The exact reason why I added the I very much disagree with your remark in 3. that "Adding free format output would need a major redesign of the function". This is absolutely not the case. At the moment, almost entirely free format output is already possible. If a) the output header can be suppressed, and b) the return value of The output isn't really entirely free-format because of the newline (i.e. not the empty line) that is still added after the sentence in in 3938961 . In addition, this (You could still generate free-format output in principle if you really want to. The internal_app object would have to have an attribute that is initialized to the empty string, process_sentence would append its free-form output to this object attribute for each sentence, and would always return an empty list of tokens and suppress the newline. You would implement the So the reason why I have added the Finally, if my interpretation of your point 2. is correct, you effectively mean that instead of |
I propose adding two integrity checks to
tsvhandler.pywhich would make debugging various errors in emtsv significantly easier.Check whether a target field of the current module is already present in the input.
Duplicate column names in a module's output cause a problem in the next module downstream because they serve as keys in the
field_namesdict, cf. row 22:field_names = {name: i for i, name in enumerate(fields)}. Normally, after row 23field_namescontains exactly twice as many elements as the number of columns. This is not true if there are duplicate column names.This problem might sound far-fetched, but it does happen, for example if you tried to run emStanza according to the earlier official instructions ("If emstanza-lem is run after emstanza-pos, those three columns are duplicated.").
Because of this, a module should not be allowed to output a column with the same name as its input.
This is solved in rows 17 to 20.
Check whether the output line for a token contains exactly the right number of columns.
If a module is buggy, it happens very often that some or all of its output tokens contain an incorrect number of columns, i.e. not exactly as many columns as the header. Ideally it is of course the modules' authors who should guarantee that their implementation of
process_sentence()always outputs the correct number of columns, but this is easier said than done. For this reason,tsvhandler.process()should double-check every token line that it yields whether it contains exactly as many columns as it should, except when the module explicitly specifies that it wants to return free-format output.Note that just checking whether the length of the returned list of
process_sentence()is equal to the expected number of columns is not sufficient, since the elements of this list might contain tab characters, like in the case of the bug I have just referred to.The changes in rows 72 to 86 in commit c13450c handle this issue. I have adjusted the tracking of line numbers so that in case of an incorrect number of columns xtsv will report the actual line number of the offending token, not the number of the empty line after the sentence that the offending token appears in, like it normally does. A side effect of this change is that when an exception is raised by
process_sentence(), what will be reported byprocess()is not the line number of the end of the current sentence, but of the line at the end of the previous sentence. Since this is exactly as uninformative as the current exception handling messages, but at least provides an exact pointer to the problematic line for theValueErrorraised when the column numbers are incorrect, this is still much better than leaving the tracking of lines as it is now.The condition in lines 78-79 of commit 3938961 allows the module to skip this integrity check and return whatever it wants. This is meant for finalizers so that they can return free-format output.