The first step should be to profile the end-to-end evaluation to see where most of the time is being spent,
so that only the slow parts can be parallelized.
Then, we can figure out what is the best approach - MPI could be a powerful way to go about it that can work on large clusters with several nodes, but maybe it's not worth it, and a simpler thing like multiprocessing is enough.