[JEP-206] Define API for gathering command output in a local encoding by jglick · Pull Request #61 · jenkinsci/durable-task-plugin

jglick · 2018-02-12T22:17:46Z

Extracted from #29. Downstream of #78.

…coding.

svanoort · 2018-02-16T20:44:16Z

src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java

    private static final String COOKIE = "JENKINS_SERVER_COOKIE";

+    /**
+     * Charset name to use for transcoding, or the empty string for node system default, or null for no transcoding.


Wait, empty string and null have different handling?

Uh..... 🐛 - if the behavior is different that should be a custom string, i.e. 'DEFAULT' or 'PRESERVE' or something, because otherwise this is going to trip us up somewhere.

It is internal only, but sure a constant could be introduced for clarity.

svanoort · 2018-02-16T20:47:53Z

src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java

+                if (cs.equals(StandardCharsets.UTF_8)) { // transcoding unnecessary as output was already UTF-8
+                    return null;
+                } else { // decode output in specified charset and reëncode in UTF-8
+                    return StandardCharsets.UTF_8.encode(cs.decode(ByteBuffer.wrap(data)));


🐛 We should do everything we can to allocate and reuse a buffer here -- i.e. create a transient byte[] / Buffer that is used for storing encoded output, rather than StandardCharsets.UTF_8.encode -- otherwise we're going to generate massive amounts of memory garbage.

Well, hardly “massive” as this is typically dwarfed by unrelated overhead related to launching and joining processes, but I can check if there is some premature optimization possible here if you care.

Separate paths for optimization - process launch/join has some fixed overheads and some scaling issues that are addressed by changes in the algorithm.

If we're injecting new APIs though we should ensure the API is friendly to efficient implementations -- coupling directly to newly-allocated-and-not-reused byte[] will bite us.

svanoort · 2018-02-16T20:51:26Z

src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java

+        @Override public byte[] getOutput(FilePath workspace, Launcher launcher) throws IOException, InterruptedException {
+            return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() {
+                @Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException {
+                    byte[] buf = FileUtils.readFileToByteArray(f);


🐛 Allocating and reading the whole file rather than doing streaming handling (or doing a chunk at a time with the buffer). This is a problem because it will generate tons of memory garbage and may create demand excessive memory spikes that impact stability.

Also, streaming APIs are just a more natural fit with our normal logging model if possible.

This is only for returnStdout: true, where we are anyway returning a String to the program. There is no reason to even attempt to stream anything.

+1 to @jglick. It would be reasonable if the operation supported any parameters (e.g. tail or so), but currently the change does not make the code worse

svanoort · 2018-02-16T20:52:05Z

src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java

@@ -149,7 +175,12 @@ private static class WriteLog extends MasterToSlaveFileCallable<Long> {
                        // TODO is this efficient for large amounts of output? Would it be better to stream data, or return a byte[] from the callable?


I agree with this comment... really really this is the place to stream it.

See the other set of PRs which deprecate this code path anyway.

svanoort

Main issue: allocating and using large byte arrays rather than streaming data. This was bad when we were doing it once, but now we're doing it potentially multiple times... have no objection if we use limited buffers/byte arrays to process a chunk at a time in a streaming fashion.

jglick · 2018-02-19T18:10:58Z

process a chunk at a time in a streaming fashion

That happens in #60. Prior to that, the current output processing mode of FileMonitoringController is not streaming. This PR does not change that.

jglick · 2018-02-19T18:54:11Z

So to summarize, if this is in incorporated into #62, which performs streaming transcoding where necessary, then memory buffers are allocated only in case you use the relatively uncommon mode returnStdout: true (in which case we are already allocating a String, and doing more with it besides—typically saving it to program.dat); and even then only when the process was actually using a non-UTF-8 output encoding, which in practice means stuff running on Windows. This does not seem like a valuable target for optimization.

jglick · 2018-02-19T18:55:00Z

sorry @abayer & @rsandell, you cannot be reviewers any more

oleg-nenashev

LGTM, excepting the test code. Some refactoring would be helpful there.

🐝 anyway

oleg-nenashev · 2018-04-02T13:28:15Z

src/main/java/org/jenkinsci/plugins/durabletask/DurableTask.java

+     * @param cs the character set in which process output is expected to be
+     */
+    public void charset(@Nonnull Charset cs) {
+        // by default, ignore


Add some FINE/DEBUG logging to indicate that the method is not implemented?

Could do that. In practice there are only three implementations anyway, all in this plugin.

oleg-nenashev · 2018-04-02T13:28:20Z

src/main/java/org/jenkinsci/plugins/durabletask/DurableTask.java

+     * If not called, no translation is performed.
+     */
+    public void defaultCharset() {
+        // by default, ignore


oleg-nenashev · 2018-04-02T13:32:17Z

src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java

+        @Override public byte[] getOutput(FilePath workspace, Launcher launcher) throws IOException, InterruptedException {
+            return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() {
+                @Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException {
+                    byte[] buf = FileUtils.readFileToByteArray(f);


+1 to @jglick. It would be reasonable if the operation supported any parameters (e.g. tail or so), but currently the change does not make the code worse

oleg-nenashev · 2018-04-02T13:34:05Z

src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java

    }

+    @Issue("JENKINS-31096")
+    @Test public void encoding() throws Exception {


This test is unreadable IMHO. Would it be possible to split it to several tests or at least explicitly indicate test stages? Assert logic could be also refactored to a separate method

Yes, it could be refactored to use a helper assertion method with parameters.

oleg-nenashev · 2018-04-02T14:32:58Z

@jglick Will defaultCharset() need to be invoked in other Durable Task implementations apart from Pipeline in order to retain the default behavior? IIUC yes. There are other durable task implementations, so it would need explicit changelog

jglick · 2018-04-03T18:37:42Z

There are other durable task implementations

Well, with a combined total of <200 installations, and probably those plugins are mistakes that should be deleted, but yes.

so it would need explicit changelog

Sure.

jglick · 2018-04-04T16:22:45Z

src/test/java/org/jenkinsci/plugins/durabletask/BourneShellScriptTest.java

+    @Issue("JENKINS-31096")
+    @Test public void encoding() throws Exception {
+        JavaContainer container = dockerUbuntu.get();
+        DumbSlave s = new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", "-Dfile.encoding=ISO-8859-1"));


Note that this test can run without jenkinsci/docker-fixtures#19 since it is only concerned with file contents, not name. (Passing -Dsun.jnu.encoding=… to the SSH launcher does not seem to work for that.)

jglick · 2018-06-26T22:54:22Z

In jenkinsci/jep#128 (comment) @svanoort asked what it would take to ensure that newly produced non-ASCII content from durable tasks is correctly transcoded into builds which were started using a non-UTF-8 character set but continued running across the upgrade. AFAICT FileMonitoringController.transcodingCharset and .maybeTranscode and (in #62) Watcher.run would need to know the correct target encoding. For the Watcher.run, this information is available from TaskListener, but only as of Jenkins 2.91 (jenkinsci/jenkins#3122) and not even then from this context since it is ProtectedExternally. For other usages (writeLog, getOutput), we do not even have a TaskListener. So the target encoding would need to be injected explicitly, probably in the DurableTask.charset & .defaultCharset methods, with the caller (DurableTaskStep.Execution.start) looking up the Run.getCharset when a Run is available. Doable, but makes the code more complicated, and hardly seems worth the bother for a very minor bug affecting a small minority of users and even then at most once.

svanoort

As far as I can tell this is fine -- I just need to double-check the interaction with the CountingOutputStream and the transcoding and then it'll be blessed.

Although I know we're going to have some blowback from the incompatible aspects of this change no matter what we do (so I want to release after the CountingOutputStream fix).

jglick

Possible bug noticed in #74 but actually applicable only here.

jglick · 2018-07-17T01:42:18Z

src/main/java/org/jenkinsci/plugins/durabletask/FileMonitoringTask.java

+                        if (transcoded == null) {
+                            sink.write(buf);
+                        } else {
+                            Channels.newChannel(sink).write(transcoded);


This may be writing a number of bytes to CountingOutputStream which is different than toRead (when the full write succeeds). Probably the transcoding needs to be done on the master side, which will be a little trickier since any defaultCharset call needs to happen on the agent side.

Note that the bug, if there is one, is rendered obsolete by #62 which renders this dead code.

svanoort

I think we do indeed need to handle differences in byte counts based on transcoding or we may have some odd and complex-to-solve bugs here.

Thanks to Jesse for confirming my understanding there is correct.

jglick · 2018-07-17T18:04:58Z

I suspect so, but plan to try to reproduce such a problem in a test to confirm.

…transcoding must be done on the master side.

should be addressed

jglick · 2018-07-25T21:18:07Z

Reproduced the suspected problem (some characters missing from output when doing transcoding) in the test and fixed it. I now suspect that #62 suffers from an analogous bug, but that is at least further down the road.

svanoort

@jglick Correct me if I'm wrong here, but if we do the transcoding master-side, won't this run into issues with external log storage that expects all content in UTF-8?

I may be misinterpreting something here (reviewing these already takes longer than it ought to due to tracing the all the transformations) but I think we can resolve this in the following way:

Keep the CountingOutput stream on the remote and still report its byte counts
Do the transcoding on the remote agent (wrapping or wrapped by
the CountingOutputStream) and still send UTF-8 back.

Alternately, we can keep a count of remote byte offsets + local character or byte offset after transcoding.

Encodings are hard.

jglick · 2018-07-31T22:08:15Z

won't this run into issues with external log storage

No, because as mentioned before (#61 (comment)), the affected code is no longer used (a deprecated code path) when used with external log storage.

svanoort

I think it's okay after taking a look at that, but honestly don't have 100% confidence because there's such a maze of code changes in different places now and so many different plugin combinations possible.

On the plus side, we can trust that if this results in regressions, @jglick will make it his highest priority to resolve them.

@jglick @oleg-nenashev You guys absolutely MUST have test cases for character encoding situations around external log storage though if you don't already (enough to exercise different UTF-8, UTF-16, and 1-byte encodings well enough to generate Mojibake if something isn't kosher in encoding handling). Hopefully you already have that though.

jglick · 2018-08-07T15:14:28Z

test cases for character encoding situations around external log storage

The interesting logic is already covered by tests here; the external logging systems just assume they are getting UTF-8, and keep it that way. Anyway, as we develop “live” tests for such implementations we can include some sanity checks for encoding handling. Certainly it is a far lower priority since any problems would not be regression risks.

[JENKINS-31096] Define API for gathering command output in a local en…

e703cc8

…coding.

jglick requested review from abayer, oleg-nenashev, rsandell and svanoort February 12, 2018 22:17

This was referenced Feb 12, 2018

[JEP-206] Gather command output in a local encoding jenkinsci/workflow-durable-task-step-plugin#64

Merged

[JENKINS-38381] API to receive asynchronous notifications #29

Closed

ByteBuffer.array is not what I wanted.

cc53b18

This was referenced Feb 13, 2018

Watching plus Unicode #62

Closed

[JEP-206] Always use UTF-8 for per-step log files jenkinsci/workflow-support-plugin#56

Merged

svanoort reviewed Feb 16, 2018

View reviewed changes

svanoort requested changes Feb 16, 2018

View reviewed changes

jglick added 2 commits February 19, 2018 13:37

Merge branch 'master' into UTF-8-JENKINS-31096

6df38ba

Introduce constant for SYSTEM_DEFAULT_CHARSET as suggested by @svanoort.

d04d08c

jglick requested review from svanoort and removed request for abayer and rsandell February 19, 2018 18:54

Merged with jenkinsci#66.

761202e

oleg-nenashev approved these changes Apr 2, 2018

View reviewed changes

jglick commented Apr 4, 2018

View reviewed changes

jglick added 2 commits June 7, 2018 18:00

Merge branch 'incrementals' into UTF-8-JENKINS-31096

6e94552

Suggestions from @oleg-nenashev.

6da5620

Merge branch 'incrementals' into UTF-8-JENKINS-31096

7208013

jglick requested a review from dwnusbaum June 22, 2018 21:38

jglick added 2 commits June 25, 2018 17:03

Merge branch 'writeLog-JENKINS-37575' into UTF-8-JENKINS-31096

d096a91

Merge branch 'writeLog-JENKINS-37575' into UTF-8-JENKINS-31096

7b01461

Merge branch 'writeLog-JENKINS-37575' into UTF-8-JENKINS-31096

b3b0e4e

svanoort reviewed Jul 12, 2018

View reviewed changes

jglick mentioned this pull request Jul 17, 2018

[JENKINS-37575] Keep track of how much content has been copied by writeLog even if the callable is interrupted #74

Merged

jglick commented Jul 17, 2018

View reviewed changes

svanoort previously requested changes Jul 17, 2018

View reviewed changes

jglick added the work-in-progress label Jul 25, 2018

For purposes of lastLocation we must count bytes, not characters, so …

32de5bf

…transcoding must be done on the master side.

jglick removed the work-in-progress label Jul 25, 2018

jglick requested a review from svanoort July 25, 2018 21:16

svanoort requested changes Jul 31, 2018

View reviewed changes

jglick requested a review from svanoort July 31, 2018 22:08

Merge branch 'master' into UTF-8-JENKINS-31096

f8c292c

jglick changed the title ~~[JENKINS-31096] Define API for gathering command output in a local encoding~~ [JEP-206] Define API for gathering command output in a local encoding Aug 6, 2018

jglick added 2 commits August 6, 2018 16:03

Updated parent and reincrementalified.

1aa8974

Merge branch 'parent' into UTF-8-JENKINS-31096

9dd828f

svanoort approved these changes Aug 6, 2018

View reviewed changes

svanoort merged commit e1ef73b into jenkinsci:master Aug 6, 2018

jglick deleted the UTF-8-JENKINS-31096 branch August 7, 2018 15:14

		@@ -149,7 +175,12 @@ private static class WriteLog extends MasterToSlaveFileCallable<Long> {
		// TODO is this efficient for large amounts of output? Would it be better to stream data, or return a byte[] from the callable?

Comments

Conversation

jglick commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svanoort left a comment

Choose a reason for hiding this comment

Uh oh!

jglick commented Feb 19, 2018

Uh oh!

jglick commented Feb 19, 2018

Uh oh!

jglick commented Feb 19, 2018

Uh oh!

oleg-nenashev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oleg-nenashev commented Apr 2, 2018

Uh oh!

jglick commented Apr 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jglick commented Jun 26, 2018

Uh oh!

svanoort left a comment

Choose a reason for hiding this comment

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svanoort left a comment

Choose a reason for hiding this comment

Uh oh!

jglick commented Jul 17, 2018

Uh oh!

jglick commented Jul 25, 2018

Uh oh!

svanoort left a comment

Choose a reason for hiding this comment

Uh oh!

jglick commented Jul 31, 2018

jglick commented Feb 12, 2018 •

edited

Loading