-
Notifications
You must be signed in to change notification settings - Fork 212
Description
Currently (0d4a261) and I think since its introduction in #1519, the job status poller continues to scale out blocks even when an executor is in a bad state (in the sense of set_bad_state_and_fail_all):
the job status poller doesn't stop scaling up if the executor reports active tasks, and the HighThroughputExecutor continues to count bad-state-failed tasks as active in htex.outstanding, making those tasks continue to contribute to scaling load.
Generally this would go mostly un-noticed except for occasional reports of spurious block launches.
PR #3991 experienced a race condition in testing which is sensitive to this, though, and which I was able to recreate in my dev environment with a suitably placed 10ms delay -- see #3991 (comment)
The main point of this issue is to think about what the behaviour should be here:
if htex is in bad state, perhaps it should not be reporting any outstanding tasks -- perhaps by removing them as they are cancelled or perhaps by reporting 0 tasks.
if an executor is in a bad state, perhaps the job status poller should never try to scale up a block, no matter what spurious scaling load is reported: after all, the executor is in a "bad state" now so we perhaps should be ignoring its state as much as possible?