Skip to content

Should job status poller or executors stop scaling blocks when an executor is in a bad state? #3992

@benclifford

Description

@benclifford

Currently (0d4a261) and I think since its introduction in #1519, the job status poller continues to scale out blocks even when an executor is in a bad state (in the sense of set_bad_state_and_fail_all):

the job status poller doesn't stop scaling up if the executor reports active tasks, and the HighThroughputExecutor continues to count bad-state-failed tasks as active in htex.outstanding, making those tasks continue to contribute to scaling load.

Generally this would go mostly un-noticed except for occasional reports of spurious block launches.

PR #3991 experienced a race condition in testing which is sensitive to this, though, and which I was able to recreate in my dev environment with a suitably placed 10ms delay -- see #3991 (comment)

The main point of this issue is to think about what the behaviour should be here:

if htex is in bad state, perhaps it should not be reporting any outstanding tasks -- perhaps by removing them as they are cancelled or perhaps by reporting 0 tasks.

if an executor is in a bad state, perhaps the job status poller should never try to scale up a block, no matter what spurious scaling load is reported: after all, the executor is in a "bad state" now so we perhaps should be ignoring its state as much as possible?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions