[JENKINS-38223] Introduced FlowNode.isActive#45
Conversation
|
This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation. |
|
I'd like to review this before merge if that's okay @jglick ? (Holding a slot because I'm pretty heavily booked today and it might get merged before I can take a look at it) |
|
Sure. There are multiple plausible implementations for this, including the original unoptimized version which seemed to be a critical performance problem; listening for the end node seemed the most straightforward optimization to me. |
|
Downstream test failures, checking them. |
…d not recognize GraphListener as an extension point.
| return getExecution().isCurrentHead(this); | ||
| } | ||
|
|
||
| private static final Map<FlowExecution, Set<BlockStartNode>> unclosedBlocks = new WeakHashMap<>(); |
There was a problem hiding this comment.
So, how will this behave if the master restarts? Normally the FlowExecution would be safe to use as a weak key because we only care about running flows, where a strong reference will be held. But won't this be unpopulated when we restart, meaning we can't refer to it for information on running flows?
There was a problem hiding this comment.
and by 'information on running flows' I mean that this cache won't have the current unclosed blocks present I think?
There was a problem hiding this comment.
Right, this case needs to be tested.
| @Override public void onNewHead(FlowNode node) { | ||
| if (node instanceof BlockStartNode || node instanceof BlockEndNode) { | ||
| FlowExecution exec = node.getExecution(); | ||
| synchronized (unclosedBlocks) { |
There was a problem hiding this comment.
Ehhhh, any chance we could find a way to avoid all flows with BlockStartNodes and BlockEndNodes contending for this lock? Maybe make a cache per FlowExecution or something?
There was a problem hiding this comment.
The work done inside the monitor is so trivial that contention is not likely to be an issue: HotSpot should just switch to a spin lock.
|
|
||
| private static final Map<FlowExecution, Set<BlockStartNode>> unclosedBlocks = new WeakHashMap<>(); | ||
| @Restricted(DoNotUse.class) | ||
| @Extension public static class BlockListener implements GraphListener.Synchronous { |
There was a problem hiding this comment.
Overall this isn't a bad way to approach this. I'd considered the possibility of storing some information on nesting of blocks to facilitate other lookups (with a more complex structure though) but this approach will get the job done nicely.
| return blocks.contains((BlockStartNode) this); | ||
| } else { | ||
| // Need workflow-cps 2.33+ to use the optimization. | ||
| LOGGER.log(Level.FINE, "falling back to old isActive implementation for {0}", this); |
There was a problem hiding this comment.
I like that you did this, a lot. The implementation is now in workflow-api, which is much more reasonable for this.
| * Unlike {@link #isRunning}, this behaves intuitively for a {@link BlockStartNode}: | ||
| * it will be considered active until the {@link BlockEndNode} is added. | ||
| */ | ||
| public final boolean isActive() { |
| synchronized (unclosedBlocks) { | ||
| Set<BlockStartNode> blocks = unclosedBlocks.get(exec); | ||
| if (blocks != null) { | ||
| return blocks.contains((BlockStartNode) this); |
There was a problem hiding this comment.
Note that while initially it would seem that this check fully covers the case where you have restated the master with an in-progress flow and lost the map, there's a potential to be inside a long running block (for example a parallel running automation)... and then begin another block, which populates the map... but now you're missing the block start for the parallel start.
There was a problem hiding this comment.
Which would generate a false negative for the parallel Edit: the parallel start node being active... if my chain of logic holds up here, and it may not (bearing in mind it's presently 2300 hours).
| List<FlowNode> headNodes = exec.getCurrentHeads(); | ||
| AbstractFlowScanner scanner = (headNodes.size() > 1) ? new DepthFirstScanner() : new LinearBlockHoppingScanner(); | ||
| return scanner.findFirstMatch(headNodes, Predicates.equalTo(this)) != null; | ||
| } |
There was a problem hiding this comment.
Future enhancement if we do something like this: if we gotta scan the whole flow, fully populate the cache of the block structure along the way, so we never have to do that full scan again.
There was a problem hiding this comment.
Blocking for now until the question about the persistence of the map of block structures is answered since it looks like a bug. The excessive synchronization concern would be nice to address too but not a full blocker. Likely both can be solved at once by the right data structure or moving the Map of block starts to somewhere specific to a FlowExecution
jglick
left a comment
There was a problem hiding this comment.
Needs test coverage particularly across restarts.
| return getExecution().isCurrentHead(this); | ||
| } | ||
|
|
||
| private static final Map<FlowExecution, Set<BlockStartNode>> unclosedBlocks = new WeakHashMap<>(); |
There was a problem hiding this comment.
Right, this case needs to be tested.
| @Override public void onNewHead(FlowNode node) { | ||
| if (node instanceof BlockStartNode || node instanceof BlockEndNode) { | ||
| FlowExecution exec = node.getExecution(); | ||
| synchronized (unclosedBlocks) { |
There was a problem hiding this comment.
The work done inside the monitor is so trivial that contention is not likely to be an issue: HotSpot should just switch to a spin lock.
When running inside parallel, it misidentified block start nodes as active when they were not.
| } | ||
| if (this instanceof BlockStartNode) { | ||
| for (FlowNode headNode : currentHeads) { | ||
| if (new LinearBlockHoppingScanner().findFirstMatch(headNode, Predicates.equalTo(this)) != null) { |
There was a problem hiding this comment.
Not as fast as the GraphListener-based approach I tried first, but still far faster than the original LogActionImpl.isRunning, and more correct in the face of parallel as well. With the core and workflow-durable-task-step patches plus this, the test case at n=1000 completes in ~9m, not as impressive as the ~4m with the cache of BlockStartNode…but compare to the 33m baseline. As noted in JENKINS-45553, most of the calls to isActive would go away with fixes of JENKINS-38381 & JENKINS-36547, reducing pressure to optimize this further.
There was a problem hiding this comment.
Would be nice to get more detail on how you're doing these tests for performance? The improvements sound good, but would like to know what specifically is being measured.
There was a problem hiding this comment.
@jglick 🐛 If you use this implementation you're potentially scanning all the nodes before a parallel multiple times, remember? So if you have this:
stage ('some long stuff') {
for(int i=0; i<1000; i++) {
echo("say something for the $i-th time")
}
}
parallel withNBranches
then you're scanning 'some long stuff' up to N times for N*1000+4 nodes.
In this specific testcase you're seeing faster build times but you're optimizing just for that case.
Better solution: go back to the cache-like method, but make the cache persisted within the FlowExecution or something like that (and rebuild it if not present).
Long term we need to support this within either the FlowNodeStorage or provide a lookup object with graph structure information that we can use to accelerate lookups (AKA Bismuth Level 3, the tree-structured view of blocks).
There was a problem hiding this comment.
Ref: jenkinsci/workflow-support-plugin#19 (comment) for the same class of problem cropping up again.
If we don't provide a comprehensive lookup for block info, the next best thing we can do is provide something that scans along nodes within a branch before going beyond the ParallelStartNode. Basically like a ForkScanner mostly but without so strictly enforcing iteration order. Might be easier now that we have trivial ways to check if a node is a ParallelStartNode.
There was a problem hiding this comment.
So if you have this
Your example is unclear—is that stage before parallel, or repeated in each branch? If the former, the contents of the long stage should be irrelevant because for each active head we will quickly hit the StepEndNode for the stage and jump right over the long sequence of StepAtomNodes for echos. If the latter, then yes we will traverse ~1000n nodes for a call to isActive. Not great, but probably no worse than the previous behavior.
I am not following why you are blocking this PR. Yes there could perhaps be some more highly optimized version using caches, at the expense of more complex code. Persisting the cache is a bad idea—DRY. I considered recreating the cache on demand after load, which would work fine if you could assume a recent version of workflow-job so that FlowExecutionListener is implemented. The tricky part was handling the case of an old workflow-job—just as my main error in the original PR was neglecting to handle the case of an old workflow-cps, and thus no GraphListener. The code here is at least simple enough to read and test, and in practice seems to perform much better than what was there before. Better to get it out there and make improvements later.
As noted previously, individual calls to isActive are actually pretty fast and further optimization may not really be needed—the main remaining issues seem to stem from the fact that there are far too many calls, for stupid reasons (copyLogs and a limited Task API).
Would be nice to get more detail on how you're doing these tests for performance?
See here and just measuring total build time.
There was a problem hiding this comment.
@jglick You need to ensure that at least the parallel is not generating abusive performance if there are a lot of steps preceding the parallel - this reverts the previous fix that avoided that. If that's solved, this is ready to go forward.
| public abstract class FlowNode extends Actionable implements Saveable { | ||
| private transient List<FlowNode> parents; | ||
| private List<String> parentIds; | ||
| private final List<String> parentIds; |
| </pluginRepositories> | ||
| <properties> | ||
| <jenkins.version>1.642.3</jenkins.version> | ||
| <jenkins.version>2.7.3</jenkins.version> |
There was a problem hiding this comment.
test against workflow-cps 2.33
| // TODO this should probably also be _anime in case this is a step node with a body and the body is still running (try FlowGraphTable for example) | ||
| if (isRunning()) c = c.anime(); | ||
| if (isActive()) { | ||
| c = c.anime(); |
There was a problem hiding this comment.
Visible in flow graph table. You can now see at a glance which blocks are still running.
|
Possible optimization: implement |
|
Yes @jglick that's exactly along the lines I was thinking -- it avoids one case of abusive performance with a long run of nodes followed by a dense parallel |
| <groupId>org.jenkins-ci.plugins.workflow</groupId> | ||
| <artifactId>workflow-cps</artifactId> | ||
| <version>2.27</version> | ||
| <version>2.33</version> |
There was a problem hiding this comment.
Also tried FlowNodeTest with workflow-cps 2.32 (no GraphListener); workflow-job 2.9 (no FlowExecutionListener); and both together. Passed in all cases, albeit without optimization.
|
New version brought test project time down to under 4m, the best yet. |
As noted in f58028a, the original implementation was not correct so I am not going back to it. Theoretically there could be degraded performance in a weird case like [0..9999].each {
echo 'what a dumb idea'
}
def branches = [:]
[0..9999].each {x ->
branches["b$x"] = {sleep 999}
}
parallel branchesand then restarted during the sleeps, since checking liveness of each branch |
There was a problem hiding this comment.
This is a really clever approach and should be high-performance in the face of large flow graphs. 👍 👍
Noted what I think is one problematic bug in an edge case. Mainly since this is performance-critical code, I'd like to see the hot path optimized more and see the Listener logic include some explanatory comments -- while you and I can follow it right now with all the context in memory, I'm skeptical that'll be true in the future.
In general though this is The Right Thing To Do (tm) and done well, just needs explanation and minor tuning.
| */ | ||
| @Exported(name="running") | ||
| public final boolean isActive() { | ||
| if (this instanceof FlowEndNode) { // cf. JENKINS-26139 |
| }); | ||
| } | ||
| private static void assertActiveSteps(WorkflowRun b, String... expected) { | ||
| List<String> actual = new ArrayList<>(); |
There was a problem hiding this comment.
🐜 Did you mean to omit the second type? I only ask because you made a point of adding it for the other ones.
There was a problem hiding this comment.
The “second type”? This is -source 7, so I am using diamond inference.
| } | ||
| private static void assertActiveSteps(WorkflowRun b, String... expected) { | ||
| List<String> actual = new ArrayList<>(); | ||
| for (FlowNode n : new FlowGraphWalker(b.getExecution())) { |
There was a problem hiding this comment.
🐜 DepthFirstScanner, not FlowGraphWalker ;)
There was a problem hiding this comment.
Either works fine, and this is test code so it does not really matter.
| @Rule public LoggerRule logging = new LoggerRule().record(FlowNode.class, Level.FINER); | ||
|
|
||
| @Issue("JENKINS-38223") | ||
| @Test public void isActive() { |
There was a problem hiding this comment.
This is a really well-written testcase. 👍 👍
There was a problem hiding this comment.
Though it would be useful to have more test coverage going forward.
| } | ||
| @Override public void onRunning(FlowExecution execution) { | ||
| LOGGER.finer("FlowExecutionListener working"); | ||
| assert !startNodesAreClosedByFlow.containsKey(execution.getOwner()); |
There was a problem hiding this comment.
Mixed feelings about the assert here -- would rather we log something rather than accept an AssertionError.
There was a problem hiding this comment.
Again, assertions are for tests, not production.
| final Map<FlowExecutionOwner, Map<String, Boolean>> startNodesAreClosedByFlow = new HashMap<>(); | ||
| static Map<FlowExecutionOwner, Map<String, Boolean>> startNodesAreClosedByFlow() { | ||
| FlowL flowL = ExtensionList.lookup(FlowExecutionListener.class).get(FlowL.class); | ||
| return flowL != null ? flowL.startNodesAreClosedByFlow : /* ? */ new HashMap<FlowExecutionOwner, Map<String, Boolean>>(); |
There was a problem hiding this comment.
A bit confused about the intent here?
There was a problem hiding this comment.
🐛 I think this may generate incorrect results if we somehow call this method prematurely -- the map can be written to by the GraphListener I think and that write will be lost.
Would be very hard to test I think though -- also I think the ExtensionList lookup should ensure the extension is loaded.
Would prefer to perhaps throw an exception if extension not loaded.
There was a problem hiding this comment.
If flowL == null? Should never happen unless Jenkins is basically trashed anyway (in which case we would see all sorts of errors earlier). But FindBugs forced me to put in the check.
| LOGGER.finer("GraphListener working"); | ||
| if (node instanceof BlockStartNode || node instanceof BlockEndNode) { | ||
| Map<String, Boolean> startNodesAreClosed = FlowL.startNodesAreClosedByFlow().get(node.getExecution().getOwner()); | ||
| if (startNodesAreClosed != null) { |
There was a problem hiding this comment.
Mention this is safe to do because the FlowExecutionListener has ensured that the FlowExecution is only null if flow is done (otherwise always set for all running builds).
There was a problem hiding this comment.
Not sure what you are saying. That what is safe to do?
It will be null if we have an old workflow-job, FWIW.
| Map<String, Boolean> startNodesAreClosed = FlowL.startNodesAreClosedByFlow().get(node.getExecution().getOwner()); | ||
| if (startNodesAreClosed != null) { | ||
| if (node instanceof BlockStartNode) { | ||
| assert !startNodesAreClosed.containsKey(node.getId()); |
There was a problem hiding this comment.
Again would prefer not to risk AssertionErrors here -- instead log something. Best not to risk breaking builds in flight if something wacky happens with the maps.
There was a problem hiding this comment.
Assertions will normally be disabled in production systems. These are here to make sure mistakes will be caught in tests.
| * it will be considered active until the {@link BlockEndNode} is added. | ||
| */ | ||
| @Exported(name="running") | ||
| public final boolean isActive() { |
There was a problem hiding this comment.
Because this method is an extremely hot path and performance-critical for pipeline, I would strongly prefer to remove all the logging and order conditionals to put the common case first. If you want logging, have two versions of this method, one of which includes logging and one which doesn't, and have a static boolean used to decide which version to invoke (toggleable via scriptconsole).
There was a problem hiding this comment.
Specifically: I think the most common case is a node either being in the current heads (active=true) or NOT being a block start/end node AND not being in the current heads (active=false). I think putting that conditional first may make the hottest path perform well.
There was a problem hiding this comment.
(I'm aware there's all sort of speculative branch prediction and whatnot that happens at the hardware level and maybe some re-ordering in the JIT compilation where allowed, but can't hurt).
Happy to take a crack at that ordering myself in an add-on PR if you like.
There was a problem hiding this comment.
remove all the logging
If you look more carefully you will see that all of the logging calls use a message format with arguments that require no method calls, so the cost of the statement comes down to
- create a small array
logchecks that logger is enabled; it is not, so returns
This is cheap enough to leave in.
put the common case first
Read more carefully. I am putting the cheap checks first. Calling CpsFlowExecution.getCurrentHeads requires constructions of a list and a map traversal—pretty cheap, but more work than the cases being handled before it.
| LOGGER.log(Level.FINER, "quick closed={0}", closed); | ||
| return !closed; | ||
| } else { | ||
| LOGGER.log(Level.FINER, "no record of {0} in {1}, presumably GraphListener not working", new Object[] {this, exec}); |
There was a problem hiding this comment.
Worth mentioning that this will detect cases where GraphListeners as extensions can't be invoked for all FlowNodes.
Method m = hudson.util.ReflectionUtils.findMethod(CpsFlowExecution.class, "getListenersToRun", null);
if (m != null) {
return true;
}
return true;There was a problem hiding this comment.
(Not suggesting you use this code, just that I had a case where it was desirable to have it myself)
There was a problem hiding this comment.
Not sure what you are getting at here. Either workflow-cps is 2.33+ and it will be called on all FlowNodes; or it is 2.32- and it will never be called.
|
@reviewbybees done |
JENKINS-38223 and also JENKINS-26139: a more useful API than
isRunningfor some purposes.@reviewbybees