-
Notifications
You must be signed in to change notification settings - Fork 31
Description
the existing file backend in block/file.rs gets us to pretty thrilling throughput (iops/bandwidth) figures (given enough workers threads), but is really pushing against the system to get that throughput. given files/devices/raw zvols that support async I/O, we can probably get much better throughput without needing several hundred threads by "just" doing async I/O from propolis. port_associate(3C) talks about exactly what we might want to do in Bind AIO transaction to a specific port.
the math
we have a whole bunch of Propolis threads whose only purpose is to consume stack space, go into the kernel, wait, come back, and maybe sleep. the context switching from all this ends up on the order of 10% of total CPU time (or closer to 20% of non-idle CPU time). each I/O ends up at biowait() in the kernel (twice!), which is especially egregious for writes that complete in single-digit microseconds to the hardware. we're just offering up millions of opportunities to context switch too eagerly.
a different problem is that to plumb all the throughput hardware might support, we may need a truly astounding number of threads. with some relatively conservative figures, assume a disk can support 2M reads/sec at 100 microseconds per read. that gets you 200 seconds per second of waiting, and if we have one I/O per thread, that implies at least 200 threads per disk. context switching gets even worse.
then there's the number of queues. we can tune the number of NVMe queues up, and that's great, but even at 64 queues that's ~4 threads constantly fighting over each SQ/CQ state lock. not great.
so, the math doesn't look good for getting much better with synchronous I/O!