When the number of already-processed URLs is huge (e.g. 200M), it will take a long time to load them. Check out if a better storage & loading method is available.
(pyahocorasick comes to mind, but if memory serves, it doesn't really speed up loading from disk)