Optimizations needed for loading large tree sequences

A user named John reported having trouble loading a large tree sequence, generated by msprime I guess, into SLiM; it would just sit and sit (for more than a week!).  I investigated, and the end result is this issue.  The upshot is that loading very large tree sequences could be improved, in at least three ways that I've thought of so far:

- The number of mutations in each haplosome could be calculated ahead of time, in a "pre-flight" phase, and buffers could be allocated up front.  This would avoid a lot of reallocs.  For a really big model, the buffers are big enough that `madvise` is involved in their allocation, which gets extremely slow, so avoiding reallocs would pay off big time.

- The number of mutation runs per haplosome could be adjusted, upward from 1, according to some heuristic.  For a long chromosome, the optimal number is almost certainly not going to be 1, and it would help a lot to choose a reasonable number up front rather than slowly grinding towards that optimum post-load with the runtime mutrun experiments.  I have no idea what heuristic we might want to use, and it might interact with the next idea.

- Most importantly, mutation runs are not being used at load time to optimize for haplotype structure at all; we're basically representing all the mutations in the model as a big genotype matrix.  It seems to me that we could leverage the structure of the tree sequence to find runs that are completely shared between two or more haplosomes, and then create that information as a single shared MutationRun.  I have no idea what the algorithm would be for that, but it sounds like an excellent PhD research project for somebody with a computational and tree-sequence bent!  I think the payoff might be very large, in both runtime and memory usage.

Maybe there are more ideas here too, if we delve into what is happening now and how it could work better.  We should do some thinking on that.  Tagging @petrelharp and @philippmesser.

----------

OK.  The rest of the text here is from an email I wrote to John and Peter.  It's very verbose, and is provided just for background, and so I remember the context this all came from:

I got curious about this, because I've never tried loading a large model back into SLiM after recapitation, so I did a little experimenting.  The first thing I did, since I happen to have a set of tree sequences sitting around that represent a full-genome simulation of human evolutionary history (the Gravel model) with recapitation and neutral mutation overlay, was to load that simulation into SLiM.  Not a chance.  Loading chromosome 1 already took up about 40 GB of memory (from a  .trees file of 2.38 GB); I think loading the full set into SLiM would've taken roughly 1 TB of memory.  So that's interesting.  :->  Then I made a new model with more or less the parameter values you described, John, although I didn't look at your model closely – I just wanted something simple for testing purposes.  The SLiM model:

```
initialize() {
    initializeTreeSeq(simplificationRatio=INF, timeUnit="generations");
    initializeMutationRate(0);
    initializeMutationType("m2", 0.5, "f", 1.0);
    m2.convertToSubstitution = F;
    initializeGenomicElementType("g1", m2, 1);
    initializeGenomicElement(g1, 0, 124976000 - 1);
    initializeRecombinationRate(3e-10);
}
1 early() {
    sim.addSubpop("p1", 20000);
}
1 late() {
    sim.treeSeqOutput("decap.trees");
    sim.simulationFinished();
}
```

So 20,000 individuals, chromosome length 124976000 from you, no forward-simulated mutations, just one generation of forward simulation.  Not sure why the recombination rate is 3e-10 – that's left over from somewhere, whoops – but anyway it doesn't matter since there's only one generation forward-simulated.  Write out a tree sequence, done.

Then recapitation and neutral mutation overlay:

```
import tskit, pyslim, msprime

print("Loading tree sequence...")

ts = tskit.load("decap.trees")

print("")
print("Loaded tree sequence information:")
print(ts)

ts = pyslim.recapitate(ts, ancestral_Ne=20000, recombination_rate=1e-8, random_seed=1)

print("")
print("Recapitated tree sequence information:")
print(ts)

next_id = pyslim.next_slim_mutation_id(ts)
ts = msprime.sim_mutations(ts, rate=1e-7, model=msprime.SLiMMutationModel(type=0, next_id=next_id), random_seed=1, keep=True)
ts.dump("recap.trees")

print("")
print("Overlaid tree sequence information:")
print(ts)
```

Ancestral Ne of 20,000, recombination rate 1e-8, mutation rate 1e-7, pretty vanilla.  Here we're maybe at a similar state to what you produced with msprime, but like I said, I didn't look at your code closely.

That outputs this for the final tree sequence:

```
╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │  1,052,168║
╟───────────────┼───────────╢
║Sequence Length│1.24976e+08║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │     40,000║
╟───────────────┼───────────╢
║Total Size     │    1.0 GiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════════╤═════════╤════════════╗
║Table      │Rows      │Size     │Has Metadata║
╠═══════════╪══════════╪═════════╪════════════╣
║Edges      │ 4,027,356│122.9 MiB│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Individuals│    20,000│  1.9 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Migrations │         0│  8 Bytes│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Mutations  │11,234,857│644.9 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Nodes      │   690,470│ 19.6 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Populations│         3│  2.4 KiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Provenances│         3│  4.7 KiB│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Sites      │10,737,346│245.8 MiB│          No║
╚═══════════╧══════════╧═════════╧════════════╝
```

So, we've got 11,234,857 mutations, which is a lot, although we know of more than 1 billion SNPs present in humans, apparently (!).  Each mutation object takes 96 bytes in SLiM at present (sure would be nice if they were smaller, but there's a lot of data to track!), so just the mutation objects themselves take 1 GB of memory.  Representing all the haplosomes – which mutations are in which individuals – is going to take a lot more!  So then I tried loading that tree sequence back in:

```
initialize() {
    initializeTreeSeq(simplificationRatio=INF, timeUnit="generations");
    initializeMutationRate(0);
    initializeMutationType("m0", 0.5, "f", 1.0);
    initializeMutationType("m2", 0.5, "f", 1.0);
    m2.convertToSubstitution = F;
    initializeGenomicElementType("g1", m2, 1);
    initializeGenomicElement(g1, 0, 124976000 - 1);
    initializeRecombinationRate(1e-8);
}
10 late() {
    start = clock();
    sim.readFromPopulationFile("/Users/bhaller/Desktop/recap.trees");
    catn("Elapsed: " + (clock() - start));
}
```

I watched it as it ran.  The `Species::__AddMutationsFromTreeSequenceToHaplosomes()` method is the workhorse that does all the mutation addition, and that's where profiling it indicated that it was spending its time as it ran.  That method could probably stand to be optimized for big cases like this; it spends a lot of time in realloc() growing the buffers that hold the mutations, because it doesn't pre-flight how large those buffers are going to be.  That's probably optimal for smaller models, but for really big models like this doing a pre-flight and allocating buffers of the correct size to begin with would maybe be a win; or maybe not, since the pre-flight itself would have taken several minutes for this model.  Anyhow, memory usage grew to 65 GB, and my laptop has 64 GB, so it was getting to the point where macOS was probably compressing memory blocks dynamically, and swapping memory blocks out to disk, to try to fit the simulation into the available memory.  71 GB... 72 GB... 73 GB... 74 GB... I'm typing as I watch the memory usage grow.  75 GB... some time soon macOS will probably kill the process for taking too much memory.  76 GB... 77 GB... 78 GB... 79 GB... 80 GB... and, wow!  Now it's down to 30 GB, 29 GB, so clearly it broke through some kind of a barrier.  Now it's down to 13 GB, 12 GB, 11 GB.  It's still in `Species::__AddMutationsFromTreeSequenceToHaplosomes()` but it seems to have resolved something.  Looking at the code, I'm not really sure what that would be, so the behavior here is a bit of a head-scratcher.

But the take-home point is that the memory usage high-water mark for this model is quite high at load time, and depending on your OS and your available RAM, you might get well into the zone where the OS is doing things like swapping memory blocks to disk to make things fit.  That tends to make things run VERY slowly.  On macOS the operating system tries to compress memory blocks instead, rather than swapping them to disk, since it can compress and decompress memory blocks very quickly.  That is one possible explanation for the sudden drop from 80 GB to 11 GB; I think macOS might have just decided "OK, screw this, I'm gonna compress everything".  So now it's still running, but presumably much more slowly because it's dynamically decompressing and re-compressing memory pages as it needs them, maybe?  Hard to say, really; a lot of this OS-level stuff is pretty opaque.  The `top` command does tell me that right now 36 GB of memory pages are compressed, which is definitely not normal.  So my guess is that SLiMgui is not actually down to 11 GB, but is still somewhere north of 80 GB and has just been massively compressed to fit into memory.  `top` also tells me that about 30 million memory pages have been swapped out so far this session, and about 20,000 more swap-outs are happening each second or two; so macOS is also using swapping to make things fit.  Probably it's smart, swapping to disk for pages that are infrequently used, and compressing pages that it thinks it will need again relatively soon, and keeping pages uncompressed if they're in active use.  Modern operating systems are amazing.

Anyway, I suspect that this overall picture is the problem for you, John.  Big models like this just take a huge amount of memory to forward-simulate, because all of the mutations, and even worse, all of the information about which mutations are present in which individuals all has to be kept in memory, and it's just a TON of information for a large model like this.  Represented as a tree sequence, it is not that unreasonable, because the tree-sequence data structure is extremely compact; the recap.trees file here is about 1 GB.  But SLiM's in-memory representation of the same information is MUCH larger, which is necessary for performance in various ways; we haven't figured out how to efficiently run a forward simulator directly using the tree-sequence data structure, although it'd be awesome if we could.  From this run, it looks like the model scale you ended up at is probably about the biggest model that my machine would be capable of handling.  Beyond that, you don't necessarily get a crash or an out-of-memory message or something; you might just run EXTREMELY slowly because the operating system is trying to make things work through swapping and compression.  You really don't want to be in that zone; once you start swapping and compressing, you're pretty much dead in the water.

So, long story short, there are certainly limits to how big forward simulations can be.  :->  If you want to go further, you'll need to buy more memory – a LOT more memory.  Given that, SLiM should be capable of continuing to grow in scale quite a bit; memory is the main limitation on model scale in this sense.  (For models that run for many generations, runtime is typically the limiting factor, not memory; but here we're not even trying to run for a single tick, we're just trying to load the data in!)

OK, just now macOS put up a panel saying "your system has run out of application memory", asking me which process I'd like to kill.  So that's the end of the road.  SLiM's memory usage had grown back up to about 36 GB even despite the compression and the swapping, so it was overwhelming the ability of macOS to deal with the situation.  I just killed it, and now my system is recovering back to normal; I've now got only 1.7 GB of memory compressed, 0 swapouts happening per second.  It's frankly quite impressive that macOS was able to take things as far as it did, but there are limits.  :->  My guess is that my other machine, with 128 GB RAM, maybe could have completed this model load, but I'm not going to bother trying; the overall picture is clear.

I hope that sheds light on things, John!  I think Peter and I would still be curious to know how many mutations are present in the tree sequence you're trying to load, and how much memory the SLiM process takes once it is loaded.

Some final thoughts.  The Ne of the coalescent phase might matter quite a bit, since that will (I would think?) strongly influence how many mutations are present, especially for a model like this where the coalescent process runs right up to the present day.  If I had run a long burn-in period in SLiM instead of saving after one tick, I'd expect the ancestral Ne to be fairly unimportant, and the population size in SLiM to be the key factor instead.  This is because I'd expect that most of the segregating mutations are low-frequency alleles that arose close to the present, so the population size close to the present is going to be the main thing.  Peter will have better instincts on this than me, though, perhaps.  The mutation rate and chromosome length are also key parameters, obviously; the total number of segregating mutations will presumably scale linearly with those two parameters.  When doing forward simulation, you're just going to be limited in the scale you can attain, so you need to decide which parameters you can maybe compromise on – do you really need all those mutations, or can you simulate a lower mutation rate and extrapolate to what you would expect the results to be with more mutations?  That sort of thing.  When runtime is the limiting factor, model rescaling (see section 5.5 of the SLiM manual) can be quite helpful (at the price of approximations that can significantly distort the model's results); it might help here too, even though the expected number of mutations would be the same after rescaling, because at least you would have fewer individuals to keep track of the mutation presence/absence in; so I think I'd expect the memory usage of the model to go down roughly proportional to the rescaling factor Q, maybe?  Dabi and Schrider, or Ferrari et al., might have looked at memory usage in their papers about model rescaling, so that might be worth a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizations needed for loading large tree sequences #552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimizations needed for loading large tree sequences #552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions