Skip to content

Conversation

@Sanchit2662
Copy link

Summary

This PR fixes a silent error propagation bug in the ArrayNode worker output-gathering path (flytepropeller/pkg/controller/nodes/array/worker.go).
Due to a hardcoded nil being returned on the response channel, errors encountered while reading sub-node outputs were silently discarded, leading to workflows completing successfully with missing or corrupted outputs.

This is a one-character fix that restores correct error propagation and aligns the behavior with the existing node execution handling logic.


Why are the changes needed?

ArrayNodes (map tasks) rely on gatherOutputs to read outputs from all sub-executions.
Under normal production conditions (object store latency, transient network failures, rate limiting), output reads can legitimately fail.

However, the current implementation captures the error internally but never returns it to the caller, causing:

  • Silent data loss
  • Incorrect workflow success states
  • Downstream tasks receiving nil literals
  • Extremely hard-to-debug production failures

This violates Flyte’s execution correctness guarantees.


What changes were proposed in this pull request?

Root issue

In worker.run(), the gatherOutputsRequest handler correctly sets err when failures occur, but always sends nil on the response channel.

// Before
gatherOutputsRequest.responseChannel <- struct {
    literalMap map[string]*idlcore.Literal
    error
}{literalMap, nil}

Fix

Return the actual captured error instead of hardcoding nil:

// After
gatherOutputsRequest.responseChannel <- struct {
    literalMap map[string]*idlcore.Literal
    error
}{literalMap, err}

This mirrors the already-correct behavior used for nodeExecutionRequest earlier in the same worker loop and ensures consistent error handling.


Steps to reproduce (realistic)

  1. Run a workflow using a map task (ArrayNode) with moderate to high parallelism (e.g. 50+ sub-tasks)
  2. Introduce transient object store latency or network disruption during the ArrayNodePhaseSucceeding phase
  3. One or more output reads fail
  4. Before fix: workflow completes successfully with missing outputs
    After fix: workflow fails with a clear, actionable error

Impact

Before

  • Output read failures silently ignored
  • Workflows succeed with incorrect or missing data
  • Downstream task behavior becomes nondeterministic
  • Production issues are extremely difficult to diagnose

After

  • Output gathering errors are properly propagated
  • Workflows fail explicitly and predictably
  • Retry / failure semantics behave as expected
  • No silent data corruption

This affects all Flyte deployments using ArrayNodes, which is a core execution primitive.


How was this patch tested?

  • Manually verified error propagation path by inducing output read failures
  • Confirmed consistency with nodeExecutionRequest error handling logic
  • Ensured workerErrorCollector correctly captures and reports the error after the fix

No new tests were added due to the minimal and localized nature of the change, but the behavior now matches the established and already-tested execution path.


Labels

  • fixed

Checklist

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@welcome
Copy link

welcome bot commented Jan 16, 2026

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@Sanchit2662
Copy link
Author

Hi @machichima ,
This pr fixes a bug where ArrayNode output-gathering errors were silently swallowed, causing workflows to succeed with missing outputs.
Whenever you have time, I’d really appreciate it if you could take a look and review the PR.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant