Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 37 additions & 2 deletions bin/propolis-server/src/lib/initializer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ impl MachineInitializer<'_> {
backend_spec: &StorageBackend,
backend_id: &SpecKey,
nexus_client: &Option<NexusClient>,
wanted_heap: &mut usize,
) -> Result<StorageBackendInstance, MachineInitError> {
match backend_spec {
StorageBackend::Crucible(spec) => {
Expand All @@ -504,6 +505,17 @@ impl MachineInitializer<'_> {
}
};

// Wild guess: we might collect up to 1MB (assuming we're
// limited by NVMe MDTS) of data in each Crucible worker. That
// is accumulated into a BytesMut, which is backed by a
// Vec::with_capacity. With a power of two capacity it's
// *probably* not rounded up further.
const PER_WORKER_HEAP: usize = MB;
// And Crucible workers are not currently tunable, so this is
// how many there are
// (see propolis::block::crucible::Crucible::WORKER_COUNT)
*wanted_heap += 8 * PER_WORKER_HEAP;
Comment on lines +516 to +517
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, sorry: could we perhaps import that WORKER_COUNT constant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd thought about it, but I think we really should do the // TODO there and make it tunable. then the number here would be that tunable, defaulted to DEFAULT_WORKER_COUNT (also 8)! and be basically the same as nworkers for the file backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me!


let be = propolis::block::CrucibleBackend::create(
vcr,
propolis::block::BackendOpts {
Expand Down Expand Up @@ -560,6 +572,13 @@ impl MachineInitializer<'_> {
}
None => NonZeroUsize::new(DEFAULT_WORKER_COUNT).unwrap(),
};

// Similar to Crucible backends above: we might collect up to
// 1MB (assuming we're limited by NVMe MDTS) of data in each
// worker. This is a hack in its own right, see Propolis#985.
const PER_WORKER_HEAP: usize = MB;
*wanted_heap += nworkers.get() * PER_WORKER_HEAP;

let be = propolis::block::FileBackend::create(
&spec.path,
propolis::block::BackendOpts {
Expand Down Expand Up @@ -617,7 +636,9 @@ impl MachineInitializer<'_> {
&mut self,
chipset: &RegisteredChipset,
nexus_client: Option<NexusClient>,
) -> Result<(), MachineInitError> {
) -> Result<usize, MachineInitError> {
let mut wanted_heap = 0usize;

enum DeviceInterface {
Virtio,
Nvme,
Expand All @@ -642,13 +663,27 @@ impl MachineInitializer<'_> {
}
};

// For all storage devices we'll have a QueueMinder connecting
// each emulated device queue to storage backends. The minder and
// structures in its supporting logic don't have much state, but may
// do some dynamic allocation. Assume they won't need more than 1KiB
// of state (`in_flight` has at most nworkers entries currently and
// will need to grow only once or twice to a small capacity. The
// number of outstanding boxed requests and responses is at most
// nworkers. Might be more, but not much).
//
// 64 * 1K is a wild over-estimate while we support 1-15 queues
// across virtio-block and nvme.
wanted_heap += 64 * 1024;
Comment on lines +675 to +677
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64k ought to be enough for anybody?


let bdf: pci::Bdf = pci_path.into();

let StorageBackendInstance { be: backend, crucible } = self
.create_storage_backend_from_spec(
&disk.backend_spec,
backend_id,
&nexus_client,
&mut wanted_heap,
)
.await?;

Expand Down Expand Up @@ -751,7 +786,7 @@ impl MachineInitializer<'_> {
};
}
}
Ok(())
Ok(wanted_heap)
}

/// Initialize network devices, add them to the device map, and attach them
Expand Down
35 changes: 32 additions & 3 deletions bin/propolis-server/src/lib/vm/ensure.rs
Original file line number Diff line number Diff line change
Expand Up @@ -573,7 +573,8 @@ async fn initialize_vm_objects(
init.initialize_9pfs(&chipset);
}

init.initialize_storage_devices(&chipset, options.nexus_client.clone())
let wanted_heap = init
.initialize_storage_devices(&chipset, options.nexus_client.clone())
.await?;

let ramfb =
Expand Down Expand Up @@ -630,7 +631,7 @@ async fn initialize_vm_objects(
devices, block_backends, crucible_backends, ..
} = init;

Ok(InputVmObjects {
let res = InputVmObjects {
instance_spec: spec.clone(),
vcpu_tasks,
machine,
Expand All @@ -640,7 +641,35 @@ async fn initialize_vm_objects(
com1,
framebuffer: Some(ramfb),
ps2ctrl,
})
};

// Another really terrible hack. As we've found in Propolis#1008, brk()
// can end end up starved by a long stream of page faults. We allocate
// and deallocate often enough in the hot parts of Propolis at runtime,
// but the working set is expected to be relatively tiny; we'd brk()
// once or twice, get up to a reasonable size, and never thing twice a
// bout it.
//
// In practice that later allocation may be blocked for tens of
// *minutes*, and might be while holding locks for device state (in
// 1008 we hold a queue state lock, for example). Everything goes off
// the rails from there, as vCPUs can easily get dragged into the mess,
// guest NMIs aren't acknowledged because vCPUs can't run, etc.
//
// So, the awful hack. Most of the runtime growth, that we can think of,
// comes from storage backend implementation. We guess at that amount in
// `initialize_storage_devices`. Now' we'll alloc at least that much, free
// it, and try to avoid brk() when the OS won't be able to get to it.
//
// This should be able to be removed when we are confident we can
// actually brk() at runtime without starving ourselves out.
//
// As one last step, include an extra 16 MiB of heap as space we might want
// when running `propolis-server`, handling HTTP requests, etc.
let balloon = vec![0u8; wanted_heap + 16 * propolis::common::MB];
std::mem::drop(balloon);
Comment on lines +669 to +670
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: rather than allocating a balloon, zero-initializing it, dropping it, and hoping that it correctly sets the max heap size such that the allocator won't try to brk to get more heap despite us having allocated a wanted_heap and change...why not just call brk ourselves here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know in practice that growing the heap ended up at brk(), but the allocator might not use brk/sbrk to actually manage the storage (there's a mmap backend for umem for example which we're not using today and probably won't tomorrow, but..)

so there's still the iffy bit that maybe the allocations have weird alignment constraints, or the allocator gets fragmented, such that 320 contiguous MiB or whatever doesn't cut it later on. to be more confident about that I think we'd want the kind of pooled buffers you were talking about earlier, but also I really really want to fix the OS so we don't need this or the file backend buffers lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed wholeheartedly with...all of this.


Ok(res)
}

/// Create an object used to sample kstats.
Expand Down
Loading