Network simulation, integration tests and improved startup#13
Network simulation, integration tests and improved startup#13
Conversation
…tion in SrtlaConnection
…with utility method
Increase SWITCH_THRESHOLD from 1.02 to 1.10 so a new connection must be 10% better before traffic is moved to it. With the shorter 15ms cooldown, hysteresis is now the primary stability mechanism; 2% was too small to prevent noise-driven flip-flopping between connections with similar scores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iral Before REG3, forward_via_connection() tracks data packets in packet_log, creating phantom in-flight counts that never get ACKed. Early NAKs from these packets also penalize quality scoring. This cascading penalty starves connections of traffic, causing the system to get stuck at low throughput after startup. Reset packet_log, in_flight_packets, congestion state, and quality cache when REG3 is received so every connection starts with a clean slate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eling - Introduced a new workspace with a `network-sim` crate for simulating network conditions. - Implemented `ImpairmentConfig` for configuring network impairments using `tc netem`. - Added `GemodelConfig` for modeling bursty packet loss with Gilbert-Elliott model. - Created `Scenario` and `ScenarioConfig` for generating deterministic random-walk impairment scenarios. - Implemented `Namespace` management for Linux network namespaces, including veth link creation. - Added tests for impairment application and scenario generation to ensure functionality.
- Introduced a new `harness.rs` module for managing integration tests. - Added common utilities for integration tests in `mod.rs`. - Implemented basic connectivity tests in `netns_basic.rs` to validate registration and data forwarding. - Created failure and recovery tests in `netns_failure.rs` to ensure link failure detection and recovery. - Developed impairment tests in `netns_impairment.rs` to validate adaptation to network conditions. - Added scenario-driven tests in `netns_scenario.rs` to assess stability under evolving impairments. - Updated `Cargo.toml` and `Cargo.lock` to include `tempfile` dependency. - Enhanced `lib.rs` to expose new test harness functionalities.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds a new network-sim crate (namespaces, impairment, scenarios, harness), refactors RTT tracking to use EWMA with dual sliding windows, consolidates startup-grace configuration, increases enhanced-mode hysteresis to 10%, exposes queued packet counts, updates keepalive timestamping, and adds many integration tests and test utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner as Test Runner
participant Topology as Namespace Manager (Namespace)
participant Process as srtla_send / srtla_rec / srt-live-transmit
participant Tc as tc/netem (via sudo)
participant Injector as UDP Injector
participant Collector as Output Collector
TestRunner->>+Topology: create namespaces, veths, assign IPs
Topology-->>-TestRunner: namespaces ready
TestRunner->>+Process: start processes inside namespaces
Process-->>-TestRunner: processes started (stdout/stderr streams)
TestRunner->>+Tc: apply impairment (apply_impairment)
Tc-->>-TestRunner: tc qdisc applied
TestRunner->>+Injector: inject UDP packets/stream
Injector-->>-Process: packets arrive -> processing
Process->>+Collector: emit logs, metrics
Collector-->>-TestRunner: collect outputs (on stop)
TestRunner->>Topology: cleanup namespaces (Drop)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/sender/selection/mod.rs (1)
16-16:⚠️ Potential issue | 🟡 MinorStale doc comment: hysteresis is now 10%, not 2%.
Per the PR objectives,
SWITCH_THRESHOLDwas changed from 1.02 to 1.10. This line still says "Minimal hysteresis (2%)".Proposed fix
-//! - Minimal hysteresis (2%) to prevent flip-flopping +//! - Hysteresis (10%) to prevent flip-flopping
🤖 Fix all issues with AI agents
In `@crates/network-sim/src/harness.rs`:
- Around line 170-197: kill() currently signals the sudo wrapper
(self.pid()/self.child) which may not forward SIGTERM to the inner process;
change kill() to either send the signal to the entire process group or discover
and signal the inner PID before falling back to killing the sudo wrapper: locate
the inner PID from /proc/<sudo_pid> (e.g., inspect /proc/<pid>/task/*/children
or /proc/<pid>/status NSpid field) and call sudo kill on that PID, or send to
the process group by using a negative PID (sudo kill -TERM -- -<pgid>) so the
inner process receives it; ensure the same approach is used for the force-kill
path and keep behavior compatible with Namespace::drop / SrtlaTestStack::drop
cleanup.
- Around line 543-544: Guard against division by zero on packets_per_sec before
computing interval_us: check packets_per_sec (the value used to compute
interval_us) and return an error or handle the zero case (e.g., bail with a
clear message or treat zero as invalid) instead of performing 1_000_000 /
packets_per_sec; alternatively use checked_div on 1_000_000 and handle the None
result. Ensure the change updates the function that computes interval_us so
callers get a clear error when packets_per_sec == 0.
- Around line 537-556: The Python one-liner built in inject_udp_stream creates a
while loop whose body lines are not indented, causing an IndentationError;
update the formatted script (the script string passed to ns.exec_checked in
inject_udp_stream) so that the lines for s.sendto(...), i+=1 and time.sleep(...)
are indented (e.g., prefix with consistent spaces) inside the while block and
keep s.close() and the print statement placed after the loop if intended;
regenerate the script string accordingly so python3 -c receives a syntactically
valid, properly indented multi-line script.
In `@crates/network-sim/src/impairment.rs`:
- Around line 74-78: The build fails due to use of let-chains (e.g., the
if-let-and condition in crates/network-sim/src/impairment.rs around the
self.jitter_ms check and similar at the later block), so update the workspace
Cargo.toml's rust-version field from "1.87" to "1.88" (or any later stable
version) so the let-chain syntax is supported; you can keep edition = "2024"
as-is—just change the rust-version key in the workspace Cargo.toml so the
project compiles with stable Rust 1.88+.
In `@crates/network-sim/src/scenario.rs`:
- Around line 80-82: The calculation of total_steps in frames() can panic when
self.cfg.step is Duration::ZERO (division by zero producing infinity and casting
to u64), so add an explicit guard at the start of frames(): assert or
debug_assert that self.cfg.step != Duration::ZERO and, for runtime safety,
handle the zero-case by returning an empty Vec (or panicking with a clear
message) before computing total_steps; update references to cfg.step and
total_steps in the function (frames(), cfg.step, cfg.duration, total_steps)
accordingly so the function no longer attempts the division when step is zero.
In `@crates/network-sim/src/test_util.rs`:
- Around line 25-33: The function unique_ns_name currently truncates the entire
formatted name which can cut off the uniqueness suffix; change it so the suffix
is always preserved: compute seq via NS_COUNTER.fetch_add(1, Ordering::Relaxed)
and pid, build the uniqueness suffix (e.g., format!("_{pid:x}_{seq}")), then if
you must enforce a max length (or keep the existing 15-char limit) truncate only
the prefix to max_len - suffix.len() and concat prefix + suffix; alternatively
remove the 15-char truncation here and let callers that need IFNAMSIZ-sized
names (e.g., Namespace::new usage in harness.rs) perform interface-specific
truncation. Ensure the function unique_ns_name always returns a name that
includes the full _{pid:x}_{seq} suffix.
In `@src/connection/mod.rs`:
- Around line 343-365: clear_pre_registration_state currently clears packet_log,
in_flight_packets, highest_acked_seq, congestion, and quality_cache but does not
reset the batch queue; update clear_pre_registration_state to call
self.batch_sender.reset() (similar to reset_state's use of batch_sender.reset())
— place the call alongside the other resets (e.g., after self.packet_log.clear()
or with self.congestion.reset()) to ensure queued batches from pre-registration
are discarded.
In `@tests/common/mod.rs`:
- Line 2: The module-level attribute #![allow(dead_code)] conflicts with the
#[expect(dead_code)] on the inject_stream item causing an
unfulfilled_lint_expectations warning; fix by removing the redundancy—either
delete the module-level #![allow(dead_code)] and mark specific items with
#[allow(dead_code)] as needed, or remove the #[expect(dead_code)] on
inject_stream (or change it to #[allow(dead_code)]) so only one suppression
remains; locate the attributes in tests/common/mod.rs (module-level
#![allow(dead_code)] and the #[expect(dead_code)] on inject_stream) and apply
one of those changes.
In `@tests/netns_scenario.rs`:
- Around line 91-95: The loop over frame.configs currently swallows errors from
stack.impair_link by only eprintln!-ing them; change this so failures are
surfaced to the test: capture errors (e.g., accumulate into a Vec or increment a
failure counter) when calling stack.impair_link(i, cfg.clone()) and after the
loop assert that there were zero failures (or call panic!/assert!(false) with a
joined error message) so the test fails if any impairment failed; update the
code around frame.configs.iter().enumerate() / stack.impair_link to record and
report the concrete errors rather than continuing silently.
🧹 Nitpick comments (13)
crates/network-sim/Cargo.toml (1)
11-11: Inconsistentrandversion across workspace.The root crate depends on
rand = "0.9"whilenetwork-simusesrand = "0.10". This pulls two separate versions ofrandinto the dependency tree. Sincenetwork-simis dev-only, it won't bloat the release binary, but aligning versions would reduce compile times and avoid potential type incompatibilities if rand types are ever passed across crate boundaries..coderabbit.yaml (1)
78-80:base_branchesmay be redundant ifmainis already the default branch.Per the schema,
base_branchesis for branches other than the default branch. Ifmainis the repository's default branch, this entry has no effect. Consider removing it or adding only non-default target branches.src/ewma.rs (1)
24-31: No validation onalpharange in constructors.Passing
alpha <= 0oralpha > 1silently produces incorrect smoothing behavior (e.g.,alpha = 0freezes after the first sample, negative alpha diverges). Since this is a shared utility, adebug_assert!would catch misuse early without runtime cost.🛡️ Optional: add debug assertions
pub fn new(alpha: f64) -> Self { + debug_assert!(alpha > 0.0 && alpha <= 1.0, "alpha must be in (0.0, 1.0]"); Self { value: 0.0, alpha_up: alpha,Similarly for
asymmetric():pub fn asymmetric(alpha_up: f64, alpha_down: f64) -> Self { + debug_assert!(alpha_up > 0.0 && alpha_up <= 1.0, "alpha_up must be in (0.0, 1.0]"); + debug_assert!(alpha_down > 0.0 && alpha_down <= 1.0, "alpha_down must be in (0.0, 1.0]"); Self {src/sender/mod.rs (1)
59-70: UseSchedulingMode'sDisplayimplementation to simplify the log formatting.
SchedulingModeimplementsDisplay(src/mode.rs:64-72), which already produces the exact same strings ("classic", "enhanced", "rtt-threshold"). This is consistent with howstats.rsformats the mode (line 140:format!("{}", config.mode)). The manual match is redundant.♻️ Simplify with Display
info!( - "starting srtla_send: local_srt_port={}, receiver={}:{}, ips_file={}, mode={}", - local_srt_port, - receiver_host, - receiver_port, - ips_file, - match config.mode() { - crate::mode::SchedulingMode::Classic => "classic", - crate::mode::SchedulingMode::Enhanced => "enhanced", - crate::mode::SchedulingMode::RttThreshold => "rtt-threshold", - } + "starting srtla_send: local_srt_port={}, receiver={}:{}, ips_file={}, mode={}", + local_srt_port, + receiver_host, + receiver_port, + ips_file, + config.mode() );tests/netns_failure.rs (2)
47-51: Panic detection via string search is fragile.Checking for
"PANIC"/"panic"in stderr won't catch all failure modes (e.g., a segfault or silent abort), and could false-positive on legitimate log messages containing the word "panic" in a descriptive context (e.g.,"anti-panic guard"). Consider additionally asserting on the process exit code (e.g., thatsrtla_sendexited with status 0) and, where feasible, checking for positive evidence of failover/recovery (e.g., log lines indicating link switchover or re-registration).Also applies to: 93-97
54-98:test_link_recoverydoesn't inject traffic before or during the failure phase.Unlike
test_link_failure_failover, this test kills link 0 without any prior data flow. If the intent is to verify recovery under load, consider injecting packets before and during the impairment phase so the sender has in-flight state to reconcile on recovery. If the intent is purely to test idle recovery, a brief comment clarifying that would help.tests/netns_impairment.rs (1)
60-95: Test name overpromises relative to assertions.
test_loss_triggers_window_reductionimplies it verifies that congestion window reduction occurs, but the only assertion is the absence of "panic" in stderr. Consider either renaming to something liketest_loss_no_crashor adding an assertion that checks log output for window reduction indicators (e.g., a log line mentioning window decrease).tests/netns_scenario.rs (1)
56-84: Duplicated UDP injection logic bypasses thenetwork_simhelper.The inline
python3one-liner duplicates the injection logic fromnetwork_sim::inject_udp_packets(seecrates/network-sim/src/harness.rs, lines 524-533). If the packet format or injection method changes, this code path won't be updated.Consider making the
Namespacetype cheaply clonable (or providing afrom_nameconstructor) so the injection helper can be used from the spawned thread, or extract a function innetwork_simthat accepts a namespace name string.crates/network-sim/src/topology.rs (2)
32-44: Optional: extract shared argument-building logic.Both
execandexec_checkedconstructfull_argsidentically. A small private helper would reduce duplication.♻️ Example extraction
+ fn ns_args<'a>(&'a self, cmd: &'a str, args: &'a [&'a str]) -> Vec<&'a str> { + let mut full_args = vec!["ip", "netns", "exec", &self.name, cmd]; + full_args.extend_from_slice(args); + full_args + } + pub fn exec(&self, cmd: &str, args: &[&str]) -> Result<Output> { - let mut full_args = vec!["ip", "netns", "exec", &self.name, cmd]; - full_args.extend_from_slice(args); - sudo(&full_args).with_context(|| format!("exec '{cmd}' in ns '{}'", self.name)) + sudo(&self.ns_args(cmd, args)) + .with_context(|| format!("exec '{cmd}' in ns '{}'", self.name)) } pub fn exec_checked(&self, cmd: &str, args: &[&str]) -> Result<Output> { - let mut full_args = vec!["ip", "netns", "exec", &self.name, cmd]; - full_args.extend_from_slice(args); - sudo_checked(&full_args) + sudo_checked(&self.ns_args(cmd, args)) .with_context(|| format!("exec '{cmd}' in ns '{}'", self.name)) }
46-104: Consider validating interface name length.The doc on Line 49 states interface names must be ≤ 15 chars, but there's no enforcement here. Callers like
SrtlaTestTopology::newtruncate, but a defensive check would prevent silent failures from theipcommands.🛡️ Proposed guard
pub fn add_veth_link( &self, peer: &Namespace, local_iface: &str, peer_iface: &str, local_ip: &str, peer_ip: &str, ) -> Result<()> { + if local_iface.len() > 15 || peer_iface.len() > 15 { + bail!("interface names must be <= 15 chars (got '{}', '{}')", local_iface, peer_iface); + } + // Clean up stale veth (idempotent)crates/network-sim/src/impairment.rs (1)
186-196: Unused_ifaceparameter intc_checked.The
_ifaceparameter is never used in the function body. Either remove it or incorporate it into the error message for better diagnostics.♻️ Option A: Remove the parameter
-fn tc_checked(ns: &Namespace, _iface: &str, args: &[&str], ctx: &str) -> Result<()> { +fn tc_checked(ns: &Namespace, args: &[&str], ctx: &str) -> Result<()> {Update all call sites accordingly.
♻️ Option B: Use it in the error message
-fn tc_checked(ns: &Namespace, _iface: &str, args: &[&str], ctx: &str) -> Result<()> { +fn tc_checked(ns: &Namespace, iface: &str, args: &[&str], ctx: &str) -> Result<()> { let output = ns.exec("tc", args)?; if !output.status.success() { bail!( - "{ctx}: tc {}\n{}", + "{ctx} on {iface}: tc {}\n{}", args.join(" "), String::from_utf8_lossy(&output.stderr).trim() ); }crates/network-sim/src/harness.rs (2)
508-515:Dropdelegates toNamespaceProcess::drop— subtle but correct.
.take()moves theNamespaceProcessout of theOptionand the returned value is immediately dropped, which triggersNamespaceProcess::drop→kill(). Worth adding a brief comment for clarity.📝 Add explanatory comment
impl Drop for SrtlaTestStack { fn drop(&mut self) { - // Ensure all processes are killed even if stop() wasn't called + // Ensure all processes are killed even if stop() wasn't called. + // Each .take() drops the NamespaceProcess, triggering its Drop → kill(). self.srtla_send.take(); self.srtla_rec.take(); self.srt_server.take(); } }
562-595: Binary discovery relies onunwrap()for parent directory traversal.Lines 568-569 and 574-575 call
.parent().unwrap().parent().unwrap(). IfCARGO_MANIFEST_DIRhas an unexpected structure (e.g., root path), these will panic. Since this is test infrastructure andCARGO_MANIFEST_DIRis always set by cargo to a well-structured path, the risk is low — but a.context()chain would be more consistent with the rest of the codebase's error-handling style.
…andling in random walk test
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@crates/network-sim/src/harness.rs`:
- Around line 553-558: The generated Python script string in the `script`
binding (used by `inject_udp_stream`) contains a literal backslash followed by a
source newline (`"\\\n"`), producing an invalid Python line-continuation;
replace that sequence so the Python code actually contains a newline.
Concretely, locate the fragment inside the format! where `d=b'\\x00'*188\\\n
nstart=time.time(); i=0` is produced and change the `"\\\n nstart"` to a
proper newline escape like `"\nstart"` (i.e.,
`d=b'\\x00'*188\nstart=time.time(); i=0`) so the Python code has a real newline
between the `*188` and `start`. Ensure the surrounding format! placeholders
(`{dur_secs}`, `{target_ip}`, `{port}`, `{interval_us}`) remain unchanged.
🧹 Nitpick comments (7)
src/registration/probing.rs (1)
154-175: Note: test accessors use#[cfg(test)]instead of#[cfg(feature = "test-internals")].Per the coding guidelines, internal fields for testing should be exposed via
#[cfg(feature = "test-internals")]. These accessors use#[cfg(test)]which limits them to unit tests within the crate. If integration tests or other crates need access, consider switching to the feature flag. Since this is pre-existing code, this can be addressed separately.As per coding guidelines, "Use conditional compilation with #[cfg(feature = "test-internals")] to expose internal fields for testing".
tests/netns_scenario.rs (2)
57-84: Background injector errors are silently swallowed.Both the
Commandoutput (line 66) and the thread join result (line 99) are discarded withlet _ = .... Ifpython3isn't available or the namespace name becomes stale, zero packets are injected and the test still passes — it just validates the impairment framework, not srtla_send's behavior under load.Consider at minimum unwrapping the
join()to surface thread panics:Suggested change
- let _ = inject_handle.join(); + inject_handle.join().expect("injection thread panicked");
117-193: Test validates crash-freedom but not convergence.The test name
test_step_change_convergencesuggests it validates that srtla_send adapts to changed link conditions, but the only assertion is the absence of "panic" in stderr. Consider adding a TODO or renaming totest_step_change_no_crashto set expectations, or adding a lightweight convergence check in a follow-up (e.g., verify traffic eventually flows on the non-degraded link).crates/network-sim/src/scenario.rs (1)
188-204: Bounds test doesn't cover delay.
values_stay_within_boundsvalidates rate and loss bounds but doesn't verify that delay stays within[1, base_delay_ms + delay_jitter_ms]. Consider adding a delay assertion for completeness.Suggested addition
let loss = config.loss_percent.unwrap(); assert!(loss >= 0.0, "negative loss"); assert!(loss <= link_cfg.max_loss_percent, "loss {loss} > max"); + + let delay = config.delay_ms.unwrap(); + assert!(delay >= 1, "delay {delay} < 1"); + let max_delay = link_cfg.base_delay_ms + link_cfg.delay_jitter_ms; + assert!(delay <= max_delay, "delay {delay} > max {max_delay}"); }crates/network-sim/src/harness.rs (3)
149-168:stdout_lines()andstderr_lines()silently return empty on second call.Both methods use
.take()on the child's pipes, so a second invocation always returnsvec![]with no warning. This is fine given current usage (called once instop()), but it's a footgun for future callers.Consider either documenting this as
/// **Note:** consumes the pipe; returnsvec![]on subsequent calls.or buffering the lines internally.
511-518:SrtlaTestStack::droprelies on implicitNamespaceProcess::dropfor cleanup.The
.take()calls extract theOption<NamespaceProcess>values without binding them, so theNamespaceProcessvalues are dropped immediately, triggering theirDrop::drop → kill(). This works but is subtle — a comment would help future readers understand the intent.Suggested clarification
impl Drop for SrtlaTestStack { fn drop(&mut self) { - // Ensure all processes are killed even if stop() wasn't called - self.srtla_send.take(); - self.srtla_rec.take(); - self.srt_server.take(); + // Ensure all processes are killed even if stop() wasn't called. + // Dropping NamespaceProcess triggers its Drop impl which calls kill(). + drop(self.srtla_send.take()); + drop(self.srtla_rec.take()); + drop(self.srt_server.take()); } }
24-31:whichmay not be available on all Linux systems.Minimal containers or some distros don't ship
which.command -vis POSIX-mandated and more portable, but requires invoking a shell. For a test-only utility this is low-risk.More portable alternative
pub fn check_binary(name: &str) -> Option<PathBuf> { - Command::new("which") - .arg(name) + Command::new("sh") + .args(["-c", &format!("command -v {name}")]) .output() .ok() .filter(|o| o.status.success()) .map(|o| PathBuf::from(String::from_utf8_lossy(&o.stdout).trim().to_string())) }
Summary by CodeRabbit
New Features
Improvements
Tests