feat: add replay support to runner group and fix replay duration overrun#235
Open
JasonXuDeveloper wants to merge 3 commits intoAzure:unstable-replayfrom
Open
feat: add replay support to runner group and fix replay duration overrun#235JasonXuDeveloper wants to merge 3 commits intoAzure:unstable-replayfrom
JasonXuDeveloper wants to merge 3 commits intoAzure:unstable-replayfrom
Conversation
f644ffd to
f5a367b
Compare
There was a problem hiding this comment.
Pull request overview
Adds “replay mode” execution to kperf, spanning runner-group deployment changes (indexed Jobs + replay entrypoint), a new replay engine/package (profile loading, partitioning, scheduling, runner), and CLI wiring for both local replay runs and distributed runner pods.
Changes:
- Add replay profile types + loader, scheduler, runner, partitioning, and request builder under
replay/. - Extend runnergroup deployment to support replay mode (skip configmap upload, mount PVC optionally, run
/run_replay.shin indexed Jobs). - Add CLI commands for replay (
kperf replay runlocal mode,kperf runner replayfor runner pods) and refactor latency percentile reporting.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| testdata/sample_replay_runnergroup.yaml | Example RunnerGroupSpec for replay mode (URL/PVC profile sources). |
| testdata/sample_replay.yaml | Sample replay profile YAML with realistic request sequence. |
| scripts/run_replay.sh | New runner-pod entrypoint for replay mode + result upload loop. |
| runner/group/handler.go | Replay-aware job building (script, env, PVC mount) + skip configmap upload in replay mode. |
| replay/schedule_test.go | Tests for result aggregation and config warning validation behavior. |
| replay/schedule.go | Local multi-runner scheduler + single-runner entry for distributed mode + aggregation/warnings. |
| replay/runner_test.go | Unit tests/benchmarks for runner internals (bucket sizing, indexing). |
| replay/runner.go | Replay runner implementation (worker pool, WATCH handling, metrics). |
| replay/partition_test.go | Tests for deterministic partitioning and per-object ordering. |
| replay/partition.go | Partitioning logic (FNV-1a by object key) + distribution analysis helpers. |
| replay/loader_test.go | Tests for loading profiles from file/gzip and validation errors. |
| replay/loader.go | Profile loader supporting local paths and HTTP(S) + gzip auto-detection. |
| replay/builder_test.go | Tests for request building, masking, method mapping, query handling. |
| replay/builder.go | Request builder/executor using rest.Interface + masking for metric aggregation. |
| metrics/utils.go | New helper to build percentile latency reports (aggregate + per-URL) with optional raw data. |
| cmd/kperf/commands/runnergroup/run.go | Avoid clobbering nodeAffinity unless CLI flags are provided. |
| cmd/kperf/commands/runner/runner.go | Add runner replay subcommand + reuse percentile-report helper + config validation. |
| cmd/kperf/commands/root.go | Register top-level replay command. |
| cmd/kperf/commands/replay/run_test.go | Tests for local replay report building (with/without raw data). |
| cmd/kperf/commands/replay/run.go | Implement kperf replay run local-mode command and JSON reporting. |
| cmd/kperf/commands/replay/root.go | Add replay CLI root with run subcommand. |
| api/types/runner_group.go | Add replay fields to RunnerGroupSpec + IsReplayMode(). |
| api/types/replay_test.go | Tests for replay types validation and helpers. |
| api/types/replay.go | Define replay profile/request/spec types + validation + duration helper. |
| api/types/load_traffic.go | Fix typo in comment (“target”). |
| Dockerfile | Include /run_replay.sh and ensure scripts are executable. |
dd4ffe3 to
d03afca
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
scripts/run_replay.sh:33
- The log message
Uploaded itis not very actionable when debugging uploads (it doesn’t include which file/runner or where it was uploaded). Consider logging the runner identity and/or target URL (and possibly the HTTP status) to make successful uploads traceable in pod logs.
echo "Uploaded it"
exit 0
;;
scripts/run_replay.sh:40
- The message
Leaking pod? skipis ambiguous/colloquial and makes it hard to understand the actual failure mode (404 from the result upload endpoint). Consider rewording to explicitly state that the runner is not recognized by the server (404) and that the pod is exiting as a result.
404)
echo "Leaking pod? skip"
exit 1;
Distributed replay mode integration: - Replay-aware job building: skip configmap upload for replay mode, use indexed Jobs for runner assignment, custom replay entrypoint script - run_replay.sh: entrypoint script for replay runner pods that downloads the replay profile and invokes kperf runner replay - Dockerfile: chmod +x for scripts directory Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
d03afca to
6c132ee
Compare
…ring herd Treat context cancellation as success for all verbs (not just WATCH) to prevent mutex contention at shutdown. Remove unnecessary atomic operations on per-worker metrics and simplify WATCH goroutines to fire-and-forget. Update worker recommendation formula to be QPS-based. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Runner Both Schedule and ScheduleSingleRunner were called with context.Background() and never enforced a hard deadline. The replay would run until every request completed naturally (up to 60s timeout each), causing 15-min profiles to run 20+ minutes. Now both functions create a context.WithTimeout based on profile.Duration() plus a 30s grace period. When the deadline fires, in-flight requests get context.Canceled (treated as success per the previous commit), and WATCH connections are torn down immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
buildBatchJobObjectto support replay mode — skip configmap upload for replay, mount PVC for local profiles, setREPLAY_PROFILE_SOURCEenv var, use/run_replay.shentrypointkperf runner replayand uploads results (with proper quoting and--data-binary)run_replay.shandchmod +xscriptsprofile.Duration() + 30sin bothScheduleandScheduleSingleRunnerso replays can't exceed the profile timecontext.Canceled/DeadlineExceededas success for all verbs — when the deadline fires, in-flight requests are cancelled intentionally (not failures), and this avoids theObserveFailure()mutex thundering-herd at shutdownint32/atomic to plainint(each goroutine owns its instance)recommendedWorkersfromconns * 3to QPS-based calculationrestClisinstartWorkersrunnerIndexbounds inScheduleSingleRunnerTest plan
go build ./...passesgo vet ./...passesgo test ./replay/...passes🤖 Generated with Claude Code