Sometimes, running many tasks takes a long time. If the graph files when it's almost done, you currently need to rerun everything.
Solution: When execution of a graph files, serialize the state and data of the task graph. Then, resume execution from that point. Note: You should be able to change the code of tasks in between failure and retry. Since bugs are many times the cause of task failure.
This needs to be well thought through wrt multiprocessing and sharing of data.