Should job status poller or executors stop scaling blocks when an executor is in a bad state?

Currently (0d4a261c00ce2e78d6f11786f1772f652dc18d93) and I think since its introduction in #1519, the job status poller continues to scale out blocks even when an executor is in a bad state (in the sense of `set_bad_state_and_fail_all`):

the job status poller doesn't stop scaling up if the executor reports active tasks, and the HighThroughputExecutor  continues to count bad-state-failed tasks as active in htex.outstanding, making those tasks continue to contribute to scaling load.

Generally this would go mostly un-noticed except for occasional reports of spurious block launches.

PR #3991 experienced a race condition in testing which is sensitive to this, though, and which I was able to recreate in my dev environment with a suitably placed 10ms delay -- see https://github.com/Parsl/parsl/pull/3991#issuecomment-3401053135

The main point of this issue is to think about what the behaviour should be here:

if htex is in bad state, perhaps it should not be reporting any outstanding tasks -- perhaps by removing them as they are cancelled or perhaps by reporting 0 tasks.

if an executor is in a bad state, perhaps the job status poller should never try to scale up a block, no matter what spurious scaling load is reported: after all, the executor is in a "bad state" now so we perhaps should be ignoring its state as much as possible?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should job status poller or executors stop scaling blocks when an executor is in a bad state? #3992

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Should job status poller or executors stop scaling blocks when an executor is in a bad state? #3992

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions