This artifact accompanies the ICPE 2026 paper:
perfaid: A Declarative Framework for Composable Performance Evaluation of System Software
It provides all scripts, configurations, and instructions required to reproduce the experimental results presented in the paper and to inspect the underlying data and artifacts produced during execution.
During the ICPE review process, the paper used codenames for some software components in order to preserve reviewer anonymity. These components are already publicly available open-source projects and will be referenced exclusively by their public names in the camera-ready version.
For clarity, this artifact may use both names interchangeably.
| Submission codename | Public name | Repository |
|---|---|---|
perfaid |
benchkit | https://github.com/open-s4c/benchkit |
libmutrep |
tilt | https://github.com/open-s4c/tilt |
UserPlace |
schedkit | https://github.com/open-s4c/schedkit |
After camera-ready submission, all remaining codename references will be removed and only public names will remain.
perfaid / benchkit is a Python framework for systematic, composable performance experimentation. Rather than focusing on a single benchmark or workload, the artifact demonstrates how experiments can be assembled declaratively from reusable components:
- benchmarks (LevelDB, RocksDB, KyotoCabinet, SPEC CPU 2017, microbenchmarks)
- locking mechanisms (Tilt, wrapping spinlock implementations like CAS, MCS, CNA, HMCS, etc.)
- scheduling policies (schedkit, implementing CLOSE, FAR, SAS, SAM, etc.)
- profiling tools (
perf stat,perf record, flame graphs) - execution environments (host, Docker)
The artifact contains three main case studies from the paper:
-
Hybrid-core variability analysis (Section 3) Understanding performance variability on asymmetric processors (P/E cores) and the impact of affinity and placement control.
-
Locks and schedulers study (Section 4) Composing locking mechanisms, scheduling policies, and profiling tools on a large NUMA system to study scalability and interaction effects.
-
Framework overhead analysis (Section 5) Quantifying the runtime overhead introduced by the perfaid/benchkit framework, comparing native shell-based execution with framework-managed execution, both on the host and inside containers.
artifact/
├── README.md # This file
├── pyproject.toml # Artifact Python package
├── requirements.txt # Python dependencies
│
├── examples/ # Runnable code examples (paper listings)
│ ├── fig01_leveldb.py # Figure 1: Basic LevelDB campaign
│ ├── fig02_spec.py # Figure 2: SPEC baseline variability
│ ├── fig03_heater.py # Figure 3: Sequential heater sweep
│ ├── fig04_leveldb_placement.py # Figure 4: TasksetWrap (LevelDB)
│ ├── fig04_spec_placement.py # Figure 4: TasksetWrap (SPEC CPU)
│ ├── fig06_leveldb_locks.py # Figure 6: Lock campaign with Tilt
│ ├── fig08_sched_hooks.py # Figure 8: Scheduler hook demo
│ ├── fig09_leveldb_schedulers.py # Figure 9: Scheduler campaign with schedkit
│ ├── fig11_leveldb_perfstat.py # Figure 11: perf-stat integration
│ └── fig13_leveldb_flamegraph.py # Figure 13: Flame graph generation
│
├── experiments/ # Full experiments producing paper figures
│ ├── fig05_heater.py # Figure 5 (right): Per-CPU heater
│ ├── fig05_placement_spec.py # Figure 5 (left): Placement (SPEC)
│ ├── fig05_placement_leveldb.py # Figure 5 (left): Placement (LevelDB, alternative)
│ ├── fig07_locks.py # Figure 7: Lock sweep (5 panels)
│ ├── fig10_schedulers.py # Figure 10: Scheduler sweep (5 panels)
│ ├── fig12_leveldb_perfstat.py # Figure 12: perf-stat + schedulers
│ ├── fig14_leveldb_flamegraph.py # Figure 14: Differential flame graphs
│ └── fig15_overhead/ # Figure 15: Framework overhead
│ ├── fig15_leveldb_overhead.py # benchkit campaigns (host + Docker)
│ ├── shell_host.sh # shell baseline (host)
│ ├── shell_docker.sh # shell baseline (Docker)
│ └── plot_overhead.py # final comparison figure
│
├── lib/ # Shared helpers (installed via pip -e .)
│ ├── __init__.py
│ ├── locks.py # Lock configuration and Tilt integration
│ ├── schedulers.py # Scheduler hooks (schedkit integration)
│ ├── panels.py # Panel dataclass for multi-panel figures
│ ├── platforms.py # Platform detection
│ ├── flame.py # Flame graph utilities
│ └── lockgen/ # Lock code generation for Tilt
│
├── locks/ # Custom lock implementations (CNA)
├── generatedlocks/ # Generated locks (HMCS)
│
└── deps/ # External dependencies (git submodules)
├── benchkit/ # Core framework
├── schedkit/ # User-space scheduler (UserPlace)
├── tilt/ # Lock interposition library (libmutrep)
├── libvsync/ # Spinlock implementations
└── pythainer/ # Docker image builders
Rule of thumb for reviewers:
examples/are short, fast, illustrative — they correspond to the code listings in the paper.experiments/are full paper results — they produce the figures reported in the paper.
This artifact assumes a Linux-based system. All experiments were developed and validated on recent Linux distributions, including Ubuntu (20.04 LTS, 22.04 LTS, 24.04 LTS) and Manjaro (26.0.1).
No distribution-specific kernel patches or vendor tools are required beyond
standard Linux facilities (e.g., perf, taskset, cpupower), and the artifact
is expected to run on other modern Linux distributions with equivalent tooling.
The following capabilities are required:
- availability of hardware performance counters (
perf) - support for CPU affinity control (
taskset,sched_setaffinity) - support for NUMA placement (for Section 4 experiments)
Other Unix-like systems (e.g., macOS) may execute some scripts but will not produce correct or meaningful results for experiments involving CPU affinity, NUMA placement, or Linux scheduling behavior.
The experiments were run on two representative platforms:
- CPU: AMD Ryzen AI 9 HX 370 (8 P-cores + 16 E-cores, 24 total)
- RAM: 32 GiB
- OS: Manjaro 26.0.1 with Linux 6.18+
- Use case: Hybrid-core variability study
- CPU: 2x Kunpeng 920-4826 (96 cores across 4 NUMA nodes)
- RAM: 539 GiB
- Architecture: aarch64
- OS: Ubuntu 20.04.6 LTS, kernel 5.4.0-200-generic
- Use case: Lock/scheduler studies on many-core ARM
The overhead study (Section 5, Figure 15) was conducted on a third platform (Intel Core i7-13800H, Ubuntu 22.04.5) but can be replicated on any Linux machine with Docker installed. The specific hardware is not important for this experiment, as it measures the relative overhead of the framework compared to hand-written shell scripts on the same machine.
Results may differ on other hardware. The experiments are designed to demonstrate perfaid's capabilities; absolute performance numbers are platform-specific.
The artifact does not require identical hardware to run. Absolute performance numbers will differ, but qualitative trends and experiment structure remain reproducible. In particular, the experiments' main conclusions (e.g., that NUMA-aware locks outperform flat locks under contention, or that thread placement significantly affects variability on hybrid-core CPUs) should hold on any machine with similar architectural characteristics (multi-socket NUMA or heterogeneous cores, respectively).
# Ubuntu/Debian
sudo apt update
sudo apt install -y \
build-essential \
cmake \
git \
python3 \
python3-pip \
python3-venv \
linux-tools-generic \
linux-tools-common \
linux-tools-$(uname -r) \
numactl \
stress-ng \
fuseiso \
libbz2-dev \
libgflags-dev \
liblz4-dev \
libsnappy-dev \
libzstd-dev \
zlib1g-dev \
fzf \
curl \
perlNote: On a fresh Ubuntu installation, a reboot may be required after installing
linux-tools-*forperfto be recognized in yourPATH. We recommend rebooting after installing all packages above before proceeding.
Some experiments rely on Linux perf for profiling. By default, modern Linux
kernels restrict access to performance counters to privileged users. To allow
running perf without sudo (as expected by the artifact scripts), you must
lower the perf_event_paranoid level:
sudo sysctl -w kernel.perf_event_paranoid=-1Note: This setting is temporary and must be reapplied after each reboot. It affects only the current session and is sufficient for running the artifact without elevated privileges.
| Tool | Purpose | Installation |
|---|---|---|
| perf | Performance profiling | Included in linux-tools-* |
| Docker | Container experiments (Figure 15 only) | See below |
We recommend installing Docker using the official convenience script rather than Ubuntu packages:
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
# Log out and back in for group membership to take effect| Component | Repository | Purpose |
|---|---|---|
| benchkit | open-s4c/benchkit | Core framework |
| tilt | open-s4c/tilt | Lock interposition (libmutrep) |
| schedkit | open-s4c/schedkit | User-space scheduler (UserPlace) |
| libvsync | open-s4c/libvsync | Spinlock implementations (CAS, Ticket, MCS, TTAS, Hemlock) |
| pythainer | apaolillo/pythainer | Composable Docker builders |
libvsync is a C library of verified synchronization primitives, including spinlock algorithms such as CAS, Ticket, MCS, TTAS, and Hemlock. In this artifact, libvsync provides the lock implementations that are injected into benchmarks via tilt (LD_PRELOAD) for the locking experiments (Figures 7, 14). libvsync is included in the artifact as a git submodule and will be installed with the instructions below.
Pythainer is a Python library for programmable, composable Docker image building. It is used by perfaid to manage container-based experiments (Figure 15). Pythainer is included in the artifact as a git submodule and will be installed with the instructions below.
git clone --recursive https://github.com/softwarelanguageslab/icpe26-perfaid-ae.git
cd icpe26-perfaid-ae/# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# Install the artifact library in editable mode
pip install -e .
# Install the framework components in editable mode
pip3 install --editable deps/benchkit
pip3 install --editable deps/schedkit
pip3 install --editable deps/pythainerNote: Benchmark sources (LevelDB, RocksDB, etc.) are fetched automatically by benchkit during the first run of each campaign via the
fetch()method. No manual build step is required. The benchmark sources are stored in~/.benchkit/benches/.
Each experiment is expressed as a campaign. When a campaign is executed, benchkit automatically creates:
- one aggregated CSV file that contains both the tabular data (columns for
each variable and metric) and run metadata (platform, kernel version,
commit hash, timestamps, etc.) embedded as
#-prefixed comment lines at the top (and bottom) of the file. - one directory containing all per-run artifacts
Both are stored under:
~/.benchkit/results/
They have the same name:
benchmark_<hostname>_<campaign_name>_<timestamp>.csv
benchmark_<hostname>_<campaign_name>_<timestamp>/
Example:
benchmark_proton_fig04_taskset_placement_20260127_192431_194508.csv
benchmark_proton_fig04_taskset_placement_20260127_192431_194508/
The directory contains:
- exact commands executed (
commands.sh) - generated figures (PNG/PDF)
- hierarchical subdirectories reflecting the parameter space
- in each of the "leaves" of the hierarchy:
- per-run logs,
- JSON results,
- profiler outputs
For example, a flame-graph campaign (Figures 13 and 14) produces the following structure:
benchmark_vm_fig13_leveldb_flamegraph_20260128_205003_953967
├── benchkit-20260128-205034-01.pdf
├── benchkit-20260128-205034-01.png
├── commands.sh
├── diff_cas_vs_mcs.svg
├── diff_mcs_vs_cas.svg
└── nb_threads-32
└── bench_name-readrandom
└── scheduler-CLOSE
├── lock-caslock
│ └── run-1
│ ├── experiment_results.json
│ ├── flamegraph.svg
│ ├── perf.data
│ ├── perf.folded
│ ├── perf.report
│ ├── schedkit.err
│ └── schedkit.out
└── lock-mcslock
└── run-1
├── experiment_results.json
├── flamegraph.svg
├── perf.data
├── perf.folded
├── perf.report
├── schedkit.err
└── schedkit.out
At the campaign directory level, benchkit stores:
- the generated paper figures (PDF/PNG),
- differential flame graphs (
diff_*.svg), - and the full list of executed commands (
commands.sh).
At the per-run level, benchkit stores:
- raw profiling data (
perf.data,perf.report,perf.folded), - derived (individual) flame graphs (
flamegraph.svg), - scheduler logs (
schedkit.out,schedkit.err), - and structured metrics (
experiment_results.json).
This structure makes it possible to inspect both the final figures used in the paper and the raw profiler traces they were derived from, supporting independent validation and post-processing. Each "recorded configuration" has its own leaf directory in the hierarchy, allowing for easy comparison across experiments.
We strongly encourage reviewers to explore this directory — it is designed to make experiments transparent and inspectable.
To find your latest results after running a campaign:
ls -larth ~/.benchkit/results/The most recent campaigns are listed last.
About /tmp/benchkit.sh (command trace and reproducibility aid)
During execution, benchkit records every external command it executes
(e.g., benchmark runs, git clone, build commands, scheduler control scripts)
into a temporary shell script located at:
/tmp/benchkit.sh
This file is append-only and serves two purposes:
- Live inspection: users can monitor progress in real time using
tail -f /tmp/benchkit.shto see exactly which commands are being executed. - Post-mortem debugging and reproducibility: in case of a crash or interruption, the last executed commands can be inspected and re-run manually by copy-pasting them from this file.
This mechanism is intentionally simple and transparent: it provides reviewers with a concrete, low-level trace of what the framework does under the hood, without requiring any internal instrumentation or logging format.
During execution, benchkit provides live progress information in the console output, including an estimate of the remaining execution time for long-running campaigns. This estimate is updated as runs complete and is especially useful for experiments that sweep large parameter spaces (e.g., Figures 7 and 10).
Almost every script exposes two parameters that control execution time:
nb_runs: number of repetitions per configuration (affects statistical confidence)duration_s: duration in seconds of each individual benchmark run (affects measurement stability)
Reviewers who want to quickly verify that a script runs correctly can reduce
these values (e.g., nb_runs=1, duration_s=3). This is especially useful for
the long-running experiments (Figures 7 and 10). The parameters are defined as
constants near the top of each script and are straightforward to edit.
Trade-off: Reducing
nb_runsincreases variance in the plotted results; reducingduration_smay cause short-lived benchmarks to be dominated by startup overhead rather than steady-state behavior. The default values used in the scripts are those that produced the paper's figures. For a quick smoke test,nb_runs=1andduration_s=5are reasonable choices that still produce meaningful (if noisier) plots.
These scripts are runnable versions of the code listings in the paper. They are designed to run quickly on any Linux machine and illustrate specific perfaid features. All examples are self-contained and require only the basic system packages listed above (no NUMA server, no Docker).
Each example script contains a detailed docstring with paper references, expected timing, hardware context, prerequisites, and output description.
| Script | Paper Figure | What it demonstrates | Est. time |
|---|---|---|---|
fig01_leveldb.py |
Fig. 1 | Basic campaign (benchmark + parameter space + plot) | ~5 min |
fig02_spec.py |
Fig. 2 | SPEC baseline variability under default scheduling (requires license) | ~5-30 min |
fig03_heater.py |
Fig. 3 | Per-CPU sequential heater sweep | ~3 min (24 cores) |
fig04_leveldb_placement.py |
Fig. 4 | TasksetWrap for CPU placement (LevelDB) | ~20 min |
fig04_spec_placement.py |
Fig. 4 | TasksetWrap for CPU placement (SPEC CPU 2017, requires license) | ~5-120 min |
fig06_leveldb_locks.py |
Fig. 6 | Lock interposition via Tilt (LD_PRELOAD) | ~15 min |
fig08_sched_hooks.py |
Fig. 8 | Scheduler hook mechanism (illustrative, no benchmark run) | instant |
fig09_leveldb_schedulers.py |
Fig. 9 | Scheduler sweep with pre/post-run hooks | ~15 min |
fig11_leveldb_perfstat.py |
Fig. 11 | perf stat + scheduling policies | ~10 min |
fig13_leveldb_flamegraph.py |
Fig. 13 | Flame graph and differential flame graph generation | ~2 min |
cd examples/
# Figure 1: Basic LevelDB campaign
python fig01_leveldb.py
# Figure 2: SPEC baseline variability (requires license, only on x86)
python fig02_spec.py /path/to/cpu2017-1.1.9.iso
# Figure 3: Sequential heater (per-CPU characterization)
python fig03_heater.py
# Figure 4: TasksetWrap for CPU placement (LevelDB variant)
python fig04_leveldb_placement.py
# Figure 4: TasksetWrap for CPU placement (SPEC CPU 2017, requires license, only on x86)
python fig04_spec_placement.py /path/to/cpu2017-1.1.9.iso
# Figure 6: Lock campaign with Tilt
python fig06_leveldb_locks.py
# Figure 8: Scheduler hook demo (no benchmark run)
python fig08_sched_hooks.py
# Figure 9: Scheduler campaign
python fig09_leveldb_schedulers.py
# Figure 11: perf-stat integration
python fig11_leveldb_perfstat.py
# Figure 13: Flame graph generation
python fig13_leveldb_flamegraph.pyEach script in experiments/ reproduces one or more figures from the paper.
Below is a detailed tutorial for each figure.
Paper reference: Section 3 (Drilldown Case Study on Hybrid-Core Variability)
Hardware used in the paper: Platform A — hybrid-core laptop (AMD Ryzen AI 9 HX 370, 8 P-cores + 16 E-cores, 24 total, 32 GiB RAM).
What it produces: A two-panel figure showing (left) runtime/throughput variability under different CPU placements and (right) per-CPU throughput from the sequential heater.
Scripts involved (run in order):
| Step | Script | What it does | Est. time |
|---|---|---|---|
| 1 | fig05_heater.py |
Per-CPU heater sweep to identify P/E cores (Fig. 5 right) | ~3 min |
| 2 | fig05_placement_spec.py |
Placement experiment with SPEC CPU 2017 (Fig. 5 left, requires license) | ~5-120 min |
| 2 (alt.) | fig05_placement_leveldb.py |
Open-source alternative using LevelDB (same methodology, no license needed) | ~5 min |
Procedure:
cd experiments/
# Step 1: Characterize your CPUs (identify P vs E cores)
python fig05_heater.py
# -> Inspect the bar plot to determine which cores are fast (P) vs slow (E)
# Step 2: SPEC placement experiment (as used in the paper, requires license)
# Edit P_CORES and E_CORES in fig05_placement_spec.py based on Step 1
python fig05_placement_spec.py /path/to/cpu2017-1.1.9.iso
# Step 2 (alternative): Open-source placement experiment (LevelDB, no license needed)
# Edit P_CORES and E_CORES in fig05_placement_leveldb.py based on Step 1
python fig05_placement_leveldb.pyNotes:
- The paper uses SPEC CPU 2017 for Figure 5 (left). If you have a SPEC
license, use
fig05_placement_spec.pyto reproduce the exact experiment. - If you do not have a SPEC license,
fig05_placement_leveldb.pydemonstrates the same perfaid features and methodology using LevelDB as a drop-in replacement. - On homogeneous machines (no P/E asymmetry), all three placement conditions will produce similar results — this is expected.
Paper reference: Section 4.1 (Studying Locking Impact with perfaid)
Hardware used in the paper: Platform B — NUMA server (2x Kunpeng 920-4826, 96 cores, 4 NUMA nodes, 539 GiB RAM, aarch64).
What it produces: A 5-panel line plot showing throughput vs. thread count for 8 lock implementations across KyotoCabinet, LevelDB/readrandom, LevelDB/seekrandom, RocksDB/readrandom, and RocksDB/seekrandom.
Expected execution time: ~96 minutes on the paper's 96-core server
(measured: real 95m40s). Thread counts are automatically filtered to
available CPUs, so it runs faster on smaller machines.
Procedure:
cd experiments/
python fig07_locks.pyNotes:
- The Tilt shared library (lock interposition) is built automatically.
- On non-NUMA machines, the NUMA-aware locks (CNA, HMCS) may not show the performance advantages reported in the paper.
- RocksDB requires additional system packages:
libgflags-dev liblz4-dev libzstd-dev zlib1g-dev.
Paper reference: Section 4.2 (Studying Thread-Placement Impact with perfaid)
Hardware used in the paper: Platform B — NUMA server (2x Kunpeng 920-4826, 96 cores, 4 NUMA nodes, 539 GiB RAM, aarch64).
What it produces: A 5-panel line plot showing throughput vs. thread count for 6 scheduling policies across the same benchmarks as Figure 7.
Expected execution time: ~96 minutes on the paper's 96-core server
(measured: real 95m26s).
Procedure:
cd experiments/
python fig10_schedulers.pyNotes:
- The schedkit daemon (UserPlace) is built and managed automatically via pre/post-run hooks.
- On non-NUMA machines, NUMA-aware policies (CLOSE, FAR, etc.) may not produce meaningful differences.
Paper reference: Section 4.3 (Using perf for Profiling and Run-Time Statistics)
Hardware used in the paper: Platform B — NUMA server (2x Kunpeng 920-4826, 96 cores, aarch64).
What it produces: A 5-panel bar chart for LevelDB readrandom at 24 threads: throughput, context-switches, cpu-migrations, page-faults, and cache-misses, under 6 scheduling policies.
Expected execution time: ~10 minutes (measured: real 9m58s).
Prerequisites: sudo sysctl -w kernel.perf_event_paranoid=-1
Procedure:
cd experiments/
python fig12_leveldb_perfstat.pyPaper reference: Section 4.4 (Visualizing Performance with Flame Graphs in perfaid)
Hardware used in the paper: Platform B — NUMA server (2x Kunpeng 920-4826, 96 cores, aarch64).
What it produces: Individual flame graphs for CAS and MCS locks, plus two
differential flame graphs showing where each lock spends more or less time
than the other. The paper highlights that CAS spends significantly more time
in pthread_mutex_lock compared to MCS.
Expected execution time: ~1 minute (measured: real 0m56s).
Prerequisites: sudo sysctl -w kernel.perf_event_paranoid=-1
Procedure:
cd experiments/
python fig14_leveldb_flamegraph.pyOutput: Look for flamegraph.svg files in per-run directories and
diff_*.svg files at the campaign root (see "How results are produced
and stored" above).
Paper reference: Section 5 (Overhead of perfaid)
Hardware used in the paper: Intel Core i7-13800H laptop, Ubuntu 22.04.5, Linux 6.8.0.
What it produces: A 3-panel strip plot (one per thread count: 2, 4, 8) comparing throughput of perfaid vs. hand-written shell scripts, on host and inside Docker. The paper shows <2.2% overhead on host and <0.7% inside Docker.
Expected execution time: ~20 minutes total (broken down below).
Prerequisites: Docker installed and current user in docker group.
Procedure (4 steps, must be run in order):
cd experiments/fig15_overhead/
# Step 1: Run benchkit campaigns (host + Docker)
# This also builds the Docker image and clones LevelDB.
# Est. time: ~12 minutes (image build + 2 campaigns)
python fig15_leveldb_overhead.py
# Step 2: Run the shell baseline on the host
# Est. time: ~5 minutes
./shell_host.sh
# Step 3: Run the shell baseline inside Docker
# Est. time: ~5 minutes
./shell_docker.sh
# Step 4: Generate the final comparison figure
python plot_overhead.py
# -> Output: ~/.benchkit/results/fig15_overhead.{pdf,png}
# -> Console: overhead summary tableNotes:
- Step 1 must run before Steps 2-3 because it builds the Docker image and
clones LevelDB into
~/.benchkit/benches/. - The shell scripts reuse the LevelDB source cloned by benchkit and the Docker image built by pythainer.
| Script | Figure | Section | Platform | Est. time | Needs |
|---|---|---|---|---|---|
fig05_heater.py |
5 (right) | 3.2 | A (laptop) | ~3 min | — |
fig05_placement_spec.py |
5 (left) | 3.1-3.3 | A (laptop) | 5-120 min | SPEC license |
fig05_placement_leveldb.py |
5 (left) | 3.3 | A (laptop) | ~5 min | — (alternative) |
fig07_locks.py |
7 | 4.1 | B (server) | ~96 min | — |
fig10_schedulers.py |
10 | 4.2 | B (server) | ~96 min | — |
fig12_leveldb_perfstat.py |
12 | 4.3 | B (server) | ~10 min | perf access |
fig14_leveldb_flamegraph.py |
14 | 4.4 | B (server) | ~1 min | perf access |
fig15_overhead/ |
15 | 5 | A (laptop) | ~20 min | Docker |
If an experiment crashes, inspecting the last lines of /tmp/benchkit.sh
often reveals the exact command that failed (useful to try again and
troubleshoot the error).
Benchkit automatically fetches and builds benchmark sources (e.g., LevelDB,
RocksDB, KyotoCabinet, SPEC CPU, microbenchmarks) into ~/.benchkit/benches/.
This directory acts as a local cache to avoid re-downloading and rebuilding
benchmarks across runs. In rare cases (e.g., interruption during a fetch or build),
this cache may become inconsistent. For example, a directory may exist even
though the benchmark source was not fully cloned or built.
If a benchmark fails unexpectedly during the fetch or build phase, a simple
workaround is to remove the corresponding directory (or the entire
~/.benchkit/benches/ directory) and re-run the experiment. Benchkit will
then re-fetch and rebuild all required benchmarks from scratch.
| Symptom | Cause | Fix |
|---|---|---|
perf permission error |
perf_event_paranoid too high |
sudo sysctl -w kernel.perf_event_paranoid=-1 |
| Docker permission denied | User not in docker group |
sudo usermod -aG docker $USER then re-login |
cmake build failure |
Missing dev libraries | Install all packages from the "System Packages" section |
ModuleNotFoundError |
Virtual environment not activated | source .venv/bin/activate |
| Inconsistent benchmark cache | Interrupted fetch/build | rm -rf ~/.benchkit/benches/<benchmark> and re-run |
| Flame graph scripts missing or broken | Interrupted clone of FlameGraph tools | rm -rf ~/.benchkit/tools/ and re-run |
This artifact is submitted for:
- Artifacts Available: The artifact is hosted on a public GitHub repository with all source code, scripts, and documentation. In addition, all frameworks and tools introduced in the paper (benchkit, tilt, and schedkit) are publicly available as open-source projects under the MIT license.
- Artifacts Evaluated -- Functional: All scripts are documented, consistent with the paper, complete (open-source alternatives provided where proprietary software is needed), and exercisable on any Linux machine.
- Artifacts Evaluated -- Reusable: Each script is thoroughly documented with paper references, hardware context, timing expectations, prerequisites, and step-by-step instructions. The modular structure (benchmarks, locks, schedulers, profilers as independent components) facilitates repurposing for new studies.