[JENKINS-38223] Introduced FlowNode.isActive by jglick · Pull Request #45 · jenkinsci/workflow-api-plugin

jglick · 2017-07-27T22:01:40Z

JENKINS-38223 and also JENKINS-26139: a more useful API than isRunning for some purposes.

@reviewbybees

ghost · 2017-07-27T22:06:28Z

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

svanoort · 2017-07-28T17:42:15Z

I'd like to review this before merge if that's okay @jglick ? (Holding a slot because I'm pretty heavily booked today and it might get merged before I can take a look at it)

jglick · 2017-07-31T15:19:51Z

Sure. There are multiple plausible implementations for this, including the original unoptimized version which seemed to be a critical performance problem; listening for the end node seemed the most straightforward optimization to me.

jglick · 2017-07-31T15:47:36Z

Downstream test failures, checking them.

…d not recognize GraphListener as an extension point.

svanoort · 2017-08-03T02:52:27Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

        return getExecution().isCurrentHead(this);
    }

+    private static final Map<FlowExecution, Set<BlockStartNode>> unclosedBlocks = new WeakHashMap<>();


So, how will this behave if the master restarts? Normally the FlowExecution would be safe to use as a weak key because we only care about running flows, where a strong reference will be held. But won't this be unpopulated when we restart, meaning we can't refer to it for information on running flows?

and by 'information on running flows' I mean that this cache won't have the current unclosed blocks present I think?

Right, this case needs to be tested.

svanoort · 2017-08-03T02:56:46Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+        @Override public void onNewHead(FlowNode node) {
+            if (node instanceof BlockStartNode || node instanceof BlockEndNode) {
+                FlowExecution exec = node.getExecution();
+                synchronized (unclosedBlocks) {


Ehhhh, any chance we could find a way to avoid all flows with BlockStartNodes and BlockEndNodes contending for this lock? Maybe make a cache per FlowExecution or something?

The work done inside the monitor is so trivial that contention is not likely to be an issue: HotSpot should just switch to a spin lock.

svanoort · 2017-08-03T02:59:39Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java


+    private static final Map<FlowExecution, Set<BlockStartNode>> unclosedBlocks = new WeakHashMap<>();
+    @Restricted(DoNotUse.class)
+    @Extension public static class BlockListener implements GraphListener.Synchronous {


Overall this isn't a bad way to approach this. I'd considered the possibility of storing some information on nesting of blocks to facilitate other lookups (with a more complex structure though) but this approach will get the job done nicely.

svanoort · 2017-08-03T03:01:40Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+                    return blocks.contains((BlockStartNode) this);
+                } else {
+                    // Need workflow-cps 2.33+ to use the optimization.
+                    LOGGER.log(Level.FINE, "falling back to old isActive implementation for {0}", this);


I like that you did this, a lot. The implementation is now in workflow-api, which is much more reasonable for this.

svanoort · 2017-08-03T03:02:46Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+     * Unlike {@link #isRunning}, this behaves intuitively for a {@link BlockStartNode}:
+     * it will be considered active until the {@link BlockEndNode} is added.
+     */
+    public final boolean isActive() {


svanoort · 2017-08-03T03:05:45Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+            synchronized (unclosedBlocks) {
+                Set<BlockStartNode> blocks = unclosedBlocks.get(exec);
+                if (blocks != null) {
+                    return blocks.contains((BlockStartNode) this);


Note that while initially it would seem that this check fully covers the case where you have restated the master with an in-progress flow and lost the map, there's a potential to be inside a long running block (for example a parallel running automation)... and then begin another block, which populates the map... but now you're missing the block start for the parallel start.

Which would generate a false negative for the parallel Edit: the parallel start node being active... if my chain of logic holds up here, and it may not (bearing in mind it's presently 2300 hours).

svanoort · 2017-08-03T03:08:42Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+                    List<FlowNode> headNodes = exec.getCurrentHeads();
+                    AbstractFlowScanner scanner = (headNodes.size() > 1) ? new DepthFirstScanner() : new LinearBlockHoppingScanner();
+                    return scanner.findFirstMatch(headNodes, Predicates.equalTo(this)) != null;
+                }


Future enhancement if we do something like this: if we gotta scan the whole flow, fully populate the cache of the block structure along the way, so we never have to do that full scan again.

svanoort

Blocking for now until the question about the persistence of the map of block structures is answered since it looks like a bug. The excessive synchronization concern would be nice to address too but not a full blocker. Likely both can be solved at once by the right data structure or moving the Map of block starts to somewhere specific to a FlowExecution

jglick

Needs test coverage particularly across restarts.

jglick · 2017-08-03T16:09:53Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

        return getExecution().isCurrentHead(this);
    }

+    private static final Map<FlowExecution, Set<BlockStartNode>> unclosedBlocks = new WeakHashMap<>();


Right, this case needs to be tested.

jglick · 2017-08-03T16:10:51Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+        @Override public void onNewHead(FlowNode node) {
+            if (node instanceof BlockStartNode || node instanceof BlockEndNode) {
+                FlowExecution exec = node.getExecution();
+                synchronized (unclosedBlocks) {


The work done inside the monitor is so trivial that contention is not likely to be an issue: HotSpot should just switch to a spin lock.

When running inside parallel, it misidentified block start nodes as active when they were not.

since rewritten

jglick · 2017-08-03T19:28:05Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+        }
+        if (this instanceof BlockStartNode) {
+            for (FlowNode headNode : currentHeads) {
+                if (new LinearBlockHoppingScanner().findFirstMatch(headNode, Predicates.equalTo(this)) != null) {


Not as fast as the GraphListener-based approach I tried first, but still far faster than the original LogActionImpl.isRunning, and more correct in the face of parallel as well. With the core and workflow-durable-task-step patches plus this, the test case at n=1000 completes in ~9m, not as impressive as the ~4m with the cache of BlockStartNode…but compare to the 33m baseline. As noted in JENKINS-45553, most of the calls to isActive would go away with fixes of JENKINS-38381 & JENKINS-36547, reducing pressure to optimize this further.

Would be nice to get more detail on how you're doing these tests for performance? The improvements sound good, but would like to know what specifically is being measured.

@jglick 🐛 If you use this implementation you're potentially scanning all the nodes before a parallel multiple times, remember? So if you have this:

stage ('some long stuff') { for(int i=0; i<1000; i++) { echo("say something for the $i-th time") } } parallel withNBranches

then you're scanning 'some long stuff' up to N times for N*1000+4 nodes.

In this specific testcase you're seeing faster build times but you're optimizing just for that case.

Better solution: go back to the cache-like method, but make the cache persisted within the FlowExecution or something like that (and rebuild it if not present).

Long term we need to support this within either the FlowNodeStorage or provide a lookup object with graph structure information that we can use to accelerate lookups (AKA Bismuth Level 3, the tree-structured view of blocks).

Ref: jenkinsci/workflow-support-plugin#19 (comment) for the same class of problem cropping up again.

If we don't provide a comprehensive lookup for block info, the next best thing we can do is provide something that scans along nodes within a branch before going beyond the ParallelStartNode. Basically like a ForkScanner mostly but without so strictly enforcing iteration order. Might be easier now that we have trivial ways to check if a node is a ParallelStartNode.

So if you have this

Your example is unclear—is that stage before parallel, or repeated in each branch? If the former, the contents of the long stage should be irrelevant because for each active head we will quickly hit the StepEndNode for the stage and jump right over the long sequence of StepAtomNodes for echos. If the latter, then yes we will traverse ~1000n nodes for a call to isActive. Not great, but probably no worse than the previous behavior.

I am not following why you are blocking this PR. Yes there could perhaps be some more highly optimized version using caches, at the expense of more complex code. Persisting the cache is a bad idea—DRY. I considered recreating the cache on demand after load, which would work fine if you could assume a recent version of workflow-job so that FlowExecutionListener is implemented. The tricky part was handling the case of an old workflow-job—just as my main error in the original PR was neglecting to handle the case of an old workflow-cps, and thus no GraphListener. The code here is at least simple enough to read and test, and in practice seems to perform much better than what was there before. Better to get it out there and make improvements later.

As noted previously, individual calls to isActive are actually pretty fast and further optimization may not really be needed—the main remaining issues seem to stem from the fact that there are far too many calls, for stupid reasons (copyLogs and a limited Task API).

Would be nice to get more detail on how you're doing these tests for performance?

See here and just measuring total build time.

@jglick You need to ensure that at least the parallel is not generating abusive performance if there are a lot of steps preceding the parallel - this reverts the previous fix that avoided that. If that's solved, this is ready to go forward.

jglick · 2017-08-03T19:30:28Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

 public abstract class FlowNode extends Actionable implements Saveable {
    private transient List<FlowNode> parents;
-    private List<String> parentIds;
+    private final List<String> parentIds;


just some warnings

jglick · 2017-08-03T19:30:43Z

pom.xml

    </pluginRepositories>
    <properties>
-        <jenkins.version>1.642.3</jenkins.version>
+        <jenkins.version>2.7.3</jenkins.version>


test against workflow-cps 2.33

jglick · 2017-08-03T19:31:27Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

-        // TODO this should probably also be _anime in case this is a step node with a body and the body is still running (try FlowGraphTable for example)
-        if (isRunning())        c = c.anime();
+        if (isActive()) {
+            c = c.anime();


Visible in flow graph table. You can now see at a glance which blocks are still running.

jglick · 2017-08-10T13:40:13Z

Possible optimization: implement GraphListener and FlowExecutionListener; per FlowExecutionOwner (no need for weak keys but can remove entries once completed), keep a Map<String, Boolean> of start node IDs to see if they are closed. If workflow-cps is too old, the IDs will never be registered, so we fall back to the algorithmic version. If workflow-job is too old, the flow will never be registered, so ditto. If we restart, we use the optimized version for start nodes registered after the restart, and the algorithmic version for earlier ones. Should be reasonably simple and safe while speeding up most cases.

svanoort · 2017-08-10T13:54:53Z

Yes @jglick that's exactly along the lines I was thinking -- it avoids one case of abusive performance with a long run of nodes followed by a dense parallel

jglick · 2017-08-18T14:58:57Z

pom.xml

            <groupId>org.jenkins-ci.plugins.workflow</groupId>
            <artifactId>workflow-cps</artifactId>
-            <version>2.27</version>
+            <version>2.33</version>


Also tried FlowNodeTest with workflow-cps 2.32 (no GraphListener); workflow-job 2.9 (no FlowExecutionListener); and both together. Passed in all cases, albeit without optimization.

jglick · 2017-08-18T15:02:48Z

New version brought test project time down to under 4m, the best yet.

jglick · 2017-08-18T15:11:05Z

one case of abusive performance with a long run of nodes followed by a dense parallel

As noted in f58028a, the original implementation was not correct so I am not going back to it. Theoretically there could be degraded performance in a weird case like

[0..9999].each {
  echo 'what a dumb idea'
}
def branches = [:]
[0..9999].each {x ->
  branches["b$x"] = {sleep 999}
}
parallel branches

and then restarted during the sleeps, since checking liveness of each branch BlockStartNode will require doing a blockhopping scan from each 10k sleep head nodes past each 10k echo prior nodes. Whether this is worth optimizing again for, I do not know—seems an unlikely case. Probably onResumed could be improved to process all prior block start/end nodes using a DepthFirstScanner, which would fix this case (so long as workflow-cps and workflow-job are sufficiently new).

svanoort

This is a really clever approach and should be high-performance in the face of large flow graphs. 👍 👍

Noted what I think is one problematic bug in an edge case. Mainly since this is performance-critical code, I'd like to see the hot path optimized more and see the Listener logic include some explanatory comments -- while you and I can follow it right now with all the context in memory, I'm skeptical that'll be true in the future.

In general though this is The Right Thing To Do (tm) and done well, just needs explanation and minor tuning.

svanoort · 2017-08-18T15:56:02Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+     */
+    @Exported(name="running")
+    public final boolean isActive() {
+        if (this instanceof FlowEndNode) { // cf. JENKINS-26139


svanoort · 2017-08-18T16:02:09Z

src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java

+        });
+    }
+    private static void assertActiveSteps(WorkflowRun b, String... expected) {
+        List<String> actual = new ArrayList<>();


🐜 Did you mean to omit the second type? I only ask because you made a point of adding it for the other ones.

The “second type”? This is -source 7, so I am using diamond inference.

svanoort · 2017-08-18T16:02:29Z

src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java

+    }
+    private static void assertActiveSteps(WorkflowRun b, String... expected) {
+        List<String> actual = new ArrayList<>();
+        for (FlowNode n : new FlowGraphWalker(b.getExecution())) {


🐜 DepthFirstScanner, not FlowGraphWalker ;)

Either works fine, and this is test code so it does not really matter.

svanoort · 2017-08-18T16:06:31Z

src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java

+    @Rule public LoggerRule logging = new LoggerRule().record(FlowNode.class, Level.FINER);
+
+    @Issue("JENKINS-38223")
+    @Test public void isActive() {


This is a really well-written testcase. 👍 👍

Though it would be useful to have more test coverage going forward.

svanoort · 2017-08-18T16:14:31Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+        }
+        @Override public void onRunning(FlowExecution execution) {
+            LOGGER.finer("FlowExecutionListener working");
+            assert !startNodesAreClosedByFlow.containsKey(execution.getOwner());


Mixed feelings about the assert here -- would rather we log something rather than accept an AssertionError.

Again, assertions are for tests, not production.

svanoort · 2017-08-18T16:16:58Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+        final Map<FlowExecutionOwner, Map<String, Boolean>> startNodesAreClosedByFlow = new HashMap<>();
+        static Map<FlowExecutionOwner, Map<String, Boolean>> startNodesAreClosedByFlow() {
+            FlowL flowL = ExtensionList.lookup(FlowExecutionListener.class).get(FlowL.class);
+            return flowL != null ? flowL.startNodesAreClosedByFlow : /* ? */ new HashMap<FlowExecutionOwner, Map<String, Boolean>>();


A bit confused about the intent here?

🐛 I think this may generate incorrect results if we somehow call this method prematurely -- the map can be written to by the GraphListener I think and that write will be lost.

Would be very hard to test I think though -- also I think the ExtensionList lookup should ensure the extension is loaded.

Would prefer to perhaps throw an exception if extension not loaded.

If flowL == null? Should never happen unless Jenkins is basically trashed anyway (in which case we would see all sorts of errors earlier). But FindBugs forced me to put in the check.

svanoort · 2017-08-18T16:30:57Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+            LOGGER.finer("GraphListener working");
+            if (node instanceof BlockStartNode || node instanceof BlockEndNode) {
+                Map<String, Boolean> startNodesAreClosed = FlowL.startNodesAreClosedByFlow().get(node.getExecution().getOwner());
+                if (startNodesAreClosed != null) {


Mention this is safe to do because the FlowExecutionListener has ensured that the FlowExecution is only null if flow is done (otherwise always set for all running builds).

Not sure what you are saying. That what is safe to do?

It will be null if we have an old workflow-job, FWIW.

svanoort · 2017-08-18T16:32:03Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+                Map<String, Boolean> startNodesAreClosed = FlowL.startNodesAreClosedByFlow().get(node.getExecution().getOwner());
+                if (startNodesAreClosed != null) {
+                    if (node instanceof BlockStartNode) {
+                        assert !startNodesAreClosed.containsKey(node.getId());


Again would prefer not to risk AssertionErrors here -- instead log something. Best not to risk breaking builds in flight if something wacky happens with the maps.

Assertions will normally be disabled in production systems. These are here to make sure mistakes will be caught in tests.

svanoort · 2017-08-18T16:41:10Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+     * it will be considered active until the {@link BlockEndNode} is added.
+     */
+    @Exported(name="running")
+    public final boolean isActive() {


Because this method is an extremely hot path and performance-critical for pipeline, I would strongly prefer to remove all the logging and order conditionals to put the common case first. If you want logging, have two versions of this method, one of which includes logging and one which doesn't, and have a static boolean used to decide which version to invoke (toggleable via scriptconsole).

Specifically: I think the most common case is a node either being in the current heads (active=true) or NOT being a block start/end node AND not being in the current heads (active=false). I think putting that conditional first may make the hottest path perform well.

(I'm aware there's all sort of speculative branch prediction and whatnot that happens at the hardware level and maybe some re-ordering in the JIT compilation where allowed, but can't hurt).

Happy to take a crack at that ordering myself in an add-on PR if you like.

remove all the logging

If you look more carefully you will see that all of the logging calls use a message format with arguments that require no method calls, so the cost of the statement comes down to

create a small array

log checks that logger is enabled; it is not, so returns

This is cheap enough to leave in.

put the common case first

Read more carefully. I am putting the cheap checks first. Calling CpsFlowExecution.getCurrentHeads requires constructions of a list and a map traversal—pretty cheap, but more work than the cases being handled before it.

svanoort · 2017-08-18T16:48:03Z

src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java

+                    LOGGER.log(Level.FINER, "quick closed={0}", closed);
+                    return !closed;
+                } else {
+                    LOGGER.log(Level.FINER, "no record of {0} in {1}, presumably GraphListener not working", new Object[] {this, exec});


Worth mentioning that this will detect cases where GraphListeners as extensions can't be invoked for all FlowNodes.

Method m = hudson.util.ReflectionUtils.findMethod(CpsFlowExecution.class, "getListenersToRun", null); if (m != null) { return true; } return true;

(Not suggesting you use this code, just that I had a case where it was desirable to have it myself)

Not sure what you are getting at here. Either workflow-cps is 2.33+ and it will be called on all FlowNodes; or it is 2.32- and it will never be called.

jglick · 2017-08-22T15:13:14Z

@reviewbybees done

[JENKINS-38223] Introduced FlowNode.isActive.

5deffd7

jglick mentioned this pull request Jul 27, 2017

[JENKINS-38223] Using FlowNode.isActive to improve JENKINS-45553 jenkinsci/workflow-support-plugin#38

Merged

abayer previously approved these changes Jul 27, 2017

View reviewed changes

jglick added the work-in-progress label Jul 31, 2017

isActive did not work against older versions of workflow-cps which di…

0f48737

…d not recognize GraphListener as an extension point.

jglick removed the work-in-progress label Jul 31, 2017

Merge branch 'master' into FlowNode.isActive-JENKINS-38223

c196f22

svanoort reviewed Aug 3, 2017

View reviewed changes

svanoort previously requested changes Aug 3, 2017

View reviewed changes

jglick commented Aug 3, 2017

View reviewed changes

jglick added 4 commits August 3, 2017 13:33

Deprecating isRunning as usually misleading.

8b9f9dd

Initial test of isActive.

cd69f12

Unoptimized version is at least correct in the face of restarts.

711bfaf

In fact the original version from LogActionImpl was not correct!

f58028a

When running inside parallel, it misidentified block start nodes as active when they were not.

Merge branch 'master' into FlowNode.isActive-JENKINS-38223

8675099

jglick commented Aug 3, 2017

View reviewed changes

jglick mentioned this pull request Aug 3, 2017

[JEP-210] Log handling rewrite jenkinsci/workflow-support-plugin#15

Merged

6 tasks

Implemented a cache optimization that actually works.

f7d4853

jglick commented Aug 18, 2017

View reviewed changes

jglick added 4 commits August 18, 2017 11:12

FindBugs

ca578b6

Extend caching fix to work after restarts too.

d9af42b

Cannot use Java 8+ methods here.

710dd5b

Clarifying log message.

87cfc85

svanoort requested changes Aug 18, 2017

View reviewed changes

IllegalStateException, and extra comments.

e55b816

svanoort approved these changes Aug 21, 2017

View reviewed changes

svanoort merged commit 63e8ad0 into jenkinsci:master Aug 22, 2017

svanoort mentioned this pull request Aug 23, 2017

[JENKINS-27395] Add a junitResults step jenkinsci/junit-plugin#76

Merged

5 tasks

jglick deleted the FlowNode.isActive-JENKINS-38223 branch August 26, 2017 22:18

svanoort mentioned this pull request Sep 13, 2017

[JENKINS-37573] / [JENKINS-45553] Provide a fast view of block structures in the flow graph #50

Merged

2 tasks

Uh oh!

Conversation

jglick commented Jul 27, 2017

Uh oh!

ghost commented Jul 27, 2017

Uh oh!

svanoort commented Jul 28, 2017

Uh oh!

jglick commented Jul 31, 2017

Uh oh!

jglick commented Jul 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svanoort Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svanoort left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jglick commented Aug 10, 2017

Uh oh!

svanoort commented Aug 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jglick commented Aug 18, 2017

Uh oh!

jglick commented Aug 18, 2017

Uh oh!

svanoort left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

svanoort Aug 3, 2017 •

edited

Loading

svanoort left a comment •

edited

Loading

svanoort left a comment •

edited

Loading