[JEP-206] Define API for gathering command output in a local encoding#61
[JEP-206] Define API for gathering command output in a local encoding#61svanoort merged 15 commits intojenkinsci:masterfrom
Conversation
| private static final String COOKIE = "JENKINS_SERVER_COOKIE"; | ||
|
|
||
| /** | ||
| * Charset name to use for transcoding, or the empty string for node system default, or null for no transcoding. |
There was a problem hiding this comment.
Wait, empty string and null have different handling?
Uh..... 🐛 - if the behavior is different that should be a custom string, i.e. 'DEFAULT' or 'PRESERVE' or something, because otherwise this is going to trip us up somewhere.
There was a problem hiding this comment.
It is internal only, but sure a constant could be introduced for clarity.
| if (cs.equals(StandardCharsets.UTF_8)) { // transcoding unnecessary as output was already UTF-8 | ||
| return null; | ||
| } else { // decode output in specified charset and reëncode in UTF-8 | ||
| return StandardCharsets.UTF_8.encode(cs.decode(ByteBuffer.wrap(data))); |
There was a problem hiding this comment.
🐛 We should do everything we can to allocate and reuse a buffer here -- i.e. create a transient byte[] / Buffer that is used for storing encoded output, rather than StandardCharsets.UTF_8.encode -- otherwise we're going to generate massive amounts of memory garbage.
There was a problem hiding this comment.
Well, hardly “massive” as this is typically dwarfed by unrelated overhead related to launching and joining processes, but I can check if there is some premature optimization possible here if you care.
There was a problem hiding this comment.
Separate paths for optimization - process launch/join has some fixed overheads and some scaling issues that are addressed by changes in the algorithm.
If we're injecting new APIs though we should ensure the API is friendly to efficient implementations -- coupling directly to newly-allocated-and-not-reused byte[] will bite us.
| @Override public byte[] getOutput(FilePath workspace, Launcher launcher) throws IOException, InterruptedException { | ||
| return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() { | ||
| @Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException { | ||
| byte[] buf = FileUtils.readFileToByteArray(f); |
There was a problem hiding this comment.
🐛 Allocating and reading the whole file rather than doing streaming handling (or doing a chunk at a time with the buffer). This is a problem because it will generate tons of memory garbage and may create demand excessive memory spikes that impact stability.
Also, streaming APIs are just a more natural fit with our normal logging model if possible.
There was a problem hiding this comment.
This is only for returnStdout: true, where we are anyway returning a String to the program. There is no reason to even attempt to stream anything.
There was a problem hiding this comment.
+1 to @jglick. It would be reasonable if the operation supported any parameters (e.g. tail or so), but currently the change does not make the code worse
| @@ -149,7 +175,12 @@ private static class WriteLog extends MasterToSlaveFileCallable<Long> { | |||
| // TODO is this efficient for large amounts of output? Would it be better to stream data, or return a byte[] from the callable? | |||
There was a problem hiding this comment.
I agree with this comment... really really this is the place to stream it.
There was a problem hiding this comment.
See the other set of PRs which deprecate this code path anyway.
svanoort
left a comment
There was a problem hiding this comment.
Main issue: allocating and using large byte arrays rather than streaming data. This was bad when we were doing it once, but now we're doing it potentially multiple times... have no objection if we use limited buffers/byte arrays to process a chunk at a time in a streaming fashion.
That happens in #60. Prior to that, the current output processing mode of |
|
So to summarize, if this is in incorporated into #62, which performs streaming transcoding where necessary, then memory buffers are allocated only in case you use the relatively uncommon mode |
oleg-nenashev
left a comment
There was a problem hiding this comment.
LGTM, excepting the test code. Some refactoring would be helpful there.
🐝 anyway
| * @param cs the character set in which process output is expected to be | ||
| */ | ||
| public void charset(@Nonnull Charset cs) { | ||
| // by default, ignore |
There was a problem hiding this comment.
Add some FINE/DEBUG logging to indicate that the method is not implemented?
There was a problem hiding this comment.
Could do that. In practice there are only three implementations anyway, all in this plugin.
| * If not called, no translation is performed. | ||
| */ | ||
| public void defaultCharset() { | ||
| // by default, ignore |
| @Override public byte[] getOutput(FilePath workspace, Launcher launcher) throws IOException, InterruptedException { | ||
| return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() { | ||
| @Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException { | ||
| byte[] buf = FileUtils.readFileToByteArray(f); |
There was a problem hiding this comment.
+1 to @jglick. It would be reasonable if the operation supported any parameters (e.g. tail or so), but currently the change does not make the code worse
| } | ||
|
|
||
| @Issue("JENKINS-31096") | ||
| @Test public void encoding() throws Exception { |
There was a problem hiding this comment.
This test is unreadable IMHO. Would it be possible to split it to several tests or at least explicitly indicate test stages? Assert logic could be also refactored to a separate method
There was a problem hiding this comment.
Yes, it could be refactored to use a helper assertion method with parameters.
|
@jglick Will |
Well, with a combined total of <200 installations, and probably those plugins are mistakes that should be deleted, but yes.
Sure. |
| @Issue("JENKINS-31096") | ||
| @Test public void encoding() throws Exception { | ||
| JavaContainer container = dockerUbuntu.get(); | ||
| DumbSlave s = new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", "-Dfile.encoding=ISO-8859-1")); |
There was a problem hiding this comment.
Note that this test can run without jenkinsci/docker-fixtures#19 since it is only concerned with file contents, not name. (Passing -Dsun.jnu.encoding=… to the SSH launcher does not seem to work for that.)
|
In jenkinsci/jep#128 (comment) @svanoort asked what it would take to ensure that newly produced non-ASCII content from durable tasks is correctly transcoded into builds which were started using a non-UTF-8 character set but continued running across the upgrade. AFAICT |
svanoort
left a comment
There was a problem hiding this comment.
As far as I can tell this is fine -- I just need to double-check the interaction with the CountingOutputStream and the transcoding and then it'll be blessed.
Although I know we're going to have some blowback from the incompatible aspects of this change no matter what we do (so I want to release after the CountingOutputStream fix).
| if (transcoded == null) { | ||
| sink.write(buf); | ||
| } else { | ||
| Channels.newChannel(sink).write(transcoded); |
There was a problem hiding this comment.
This may be writing a number of bytes to CountingOutputStream which is different than toRead (when the full write succeeds). Probably the transcoding needs to be done on the master side, which will be a little trickier since any defaultCharset call needs to happen on the agent side.
There was a problem hiding this comment.
Note that the bug, if there is one, is rendered obsolete by #62 which renders this dead code.
svanoort
left a comment
There was a problem hiding this comment.
I think we do indeed need to handle differences in byte counts based on transcoding or we may have some odd and complex-to-solve bugs here.
Thanks to Jesse for confirming my understanding there is correct.
|
I suspect so, but plan to try to reproduce such a problem in a test to confirm. |
…transcoding must be done on the master side.
|
Reproduced the suspected problem (some characters missing from output when doing transcoding) in the test and fixed it. I now suspect that #62 suffers from an analogous bug, but that is at least further down the road. |
svanoort
left a comment
There was a problem hiding this comment.
@jglick Correct me if I'm wrong here, but if we do the transcoding master-side, won't this run into issues with external log storage that expects all content in UTF-8?
I may be misinterpreting something here (reviewing these already takes longer than it ought to due to tracing the all the transformations) but I think we can resolve this in the following way:
- Keep the CountingOutput stream on the remote and still report its byte counts
- Do the transcoding on the remote agent (wrapping or wrapped by
the CountingOutputStream) and still send UTF-8 back.
Alternately, we can keep a count of remote byte offsets + local character or byte offset after transcoding.
Encodings are hard.
No, because as mentioned before (#61 (comment)), the affected code is no longer used (a deprecated code path) when used with external log storage. |
svanoort
left a comment
There was a problem hiding this comment.
I think it's okay after taking a look at that, but honestly don't have 100% confidence because there's such a maze of code changes in different places now and so many different plugin combinations possible.
On the plus side, we can trust that if this results in regressions, @jglick will make it his highest priority to resolve them.
@jglick @oleg-nenashev You guys absolutely MUST have test cases for character encoding situations around external log storage though if you don't already (enough to exercise different UTF-8, UTF-16, and 1-byte encodings well enough to generate Mojibake if something isn't kosher in encoding handling). Hopefully you already have that though.
The interesting logic is already covered by tests here; the external logging systems just assume they are getting UTF-8, and keep it that way. Anyway, as we develop “live” tests for such implementations we can include some sanity checks for encoding handling. Certainly it is a far lower priority since any problems would not be regression risks. |
JEP-206
Extracted from #29. Downstream of #78.