-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I think we've seen a case where a huge dataclip has triggered an OOMKill by kubernetes
Now, a Run executes in a child process with a limit of usually 1gb, or whatever. The run should be oomkilled if that child process exceeds its memory limit, and nothing is lost.
But that OOMkill does require memory allocation to run. Blocking processes, like JSON.stringify, could prevent the nodeprocess from blowing itself up.
So what can happen is this:
- A step completes and writes a huge dataclip to state. 1gb of JSON, why not
- The runtime serializes the state at the end of each step.
- For a large object, this serialization could use a lot of memory (i think json needs 3-4x the memory of the object its parsing)
- And if this serialisation is blocking, it'll just chew up available memory without blowing up
- And if the pod uses too much memory, it'll be instantly and gracelessly killed by kubernetes
So I think I need to look closely at that serialisation code. I think I use fast-safe-stringify right now - but if it's blocking, maybe I need to borrow the streaming serializer from the worker
Something else we might consider is saying: if the state object exceeds a certain limit in size, don't bother serializing, just pass it through. In a similar way that I want an exemption for streams on state.