Badly-behaved tokio task can wedge the runtime

The root cause of #9569 was twofold:

1. A tokio task in trust quorum could, under unusual-but-expected conditions, go into an infinite loop due to not handling `read()` returning 0. This is fixed by #9612.
2. This single task's behavior caused the entire tokio runtime to hang. That's the topic of this issue.

The relevant tokio issues are https://github.com/tokio-rs/tokio/issues/4730, the title of which exactly matches the behavior we've observed here but is closed, and https://github.com/tokio-rs/tokio/issues/6315, an open issue about better tolerating badly-behaved tasks. https://github.com/tokio-rs/tokio/issues/6315#issuecomment-1920876711 describes three ways that long polls (in our case, an _infinitely_ long poll) can degrade performance (in our case, hang the runtime entirely), and notes that that issue is about the third way:

> 3. There's a single blocking future on the thread that is currently responsible for receiving incoming IO/timer events.

The description of how to trigger this is:

> 1. All threads are idle. Thread A is waiting for IO/timer events.
> 2. Thread A receives an IO or timer event and wakes up to handle it.
> 3. A single task is woken up by the IO/timer event, and thread A polls it.
> 4. During the call to poll, a new IO (or timer) event becomes available.
> 5. Thread A does not see the IO event since it's polling a future. The other worker threads don't see it either, since none of them are currently registered to receive IO events.

This was extremely helpful in narrowing down what was happening in #9569. That same comment also notes that it is possible to unwedge the runtime in this case, if a separate thread is spawned that periodically injects a do-nothing task that forces the runtime to wake up:

> None of this happens if there are other active workers, because they will take over IO events when they next become idle. This is why you can make [your own monitor thread](https://github.com/tokio-rs/tokio/issues/4730#issuecomment-1147165954) by regularly spawning tasks on the runtime.

I'm strongly tempted to suggest we add this pattern to all of our tokio-based binaries, although it has upsides and downsides. We have to choose an interval on which to have this monitor thread inject a task. Any hang will recover at the next task injection point, so if we make this too long we can still have large stalls, and if we make it too short we've got a bunch of otherwise-worthless CPU work. (That's technically true regardless of how long or short the interval is - this thread does nothing useful except "try to make sure tokio isn't wedged".)

I'll post a comment below with some notes about the particular debugging we did to confirm this is the situation we were in and some signs to look for if this comes up again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Badly-behaved tokio task can wedge the runtime #9619

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Badly-behaved tokio task can wedge the runtime #9619

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions