Ideal 2-Taken & 2-Fetch by Yakkhini · Pull Request #736 · OpenXiangShan/GEM5

Yakkhini · 2026-01-26T09:24:44Z

Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0

Summary by CodeRabbit

New Features
- Optional two-fetch mode: perform up to two predictions per cycle when enabled.
- New configuration: toggle two-fetch and set max fetch bytes per cycle.
Refactor
- Fetch and prediction flow reworked to support in-cycle two-fetch extension and updated fetch-loop/termination semantics.
Chores
- Example presets updated to enable two-fetch and reduce queue sizes.

coderabbitai · 2026-01-26T09:25:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

DecoupledBPUWithBTB gains a two-fetch mode and max-fetch-bytes config; tick() can produce up to two predictions per cycle. Fetch path now supports an in-cycle 2‑fetch extension, preserves/keeps the next FSQ entry buffered, and changes the fetch-stop semantics.

Changes

Cohort / File(s)	Summary
Prediction core `src/cpu/pred/btb/decoupled_bpred.cc`	Adds batching in `tick()` to run up to 2 prediction iterations per cycle (controlled by new flags); moves per-iteration request/finalize/clear/dry-run/FSQ-enqueue logic into loop; introduces `tempNumOverrideBubbles`.
Predictor interface / flags `src/cpu/pred/btb/decoupled_bpred.hh`, `src/cpu/pred/BranchPredictor.py`	Adds configuration members `enable2Fetch` and `maxFetchBytesPerCycle`, `enableTwoTaken` flag, FTQ navigation/accessor helpers (`getTarget`, `ftqHasNext`, `ftqNext`, `is2FetchEnabled`, `getMaxFetchBytesPerCycle`).
Fetch logic `src/cpu/o3/fetch.cc`, `src/cpu/o3/fetch.hh`	Adds 2‑fetch extension path: conditionally perform/do_2fetch, keep next FSQ entry buffered, change `lookupAndUpdateNextPC` return to reflect stop-this-cycle semantics, force I-cache reissue on buffer-edge PCs, and propagate `stopFetchThisCycle`.
Configs `configs/example/kmhv3.py`, `configs/example/idealkmhv3.py`	Enable `cpu.branchPred.enable2Fetch = True` for DecoupledBPUWithBTB and reduce FTQ/FSQ sizes from 256 → 64 in example configs.

Sequence Diagram(s)

sequenceDiagram
    participant Core as DecoupledBPUWithBTB
    participant Predictor as BTB/TAGE
    participant FTQ as FTQ
    participant FSQ as FSQ
    participant Fetch as FetchUnit

    Note over Core: Up to N = (enable2Fetch ? 2 : 1) iterations per tick
    loop per-prediction
        Core->>Predictor: requestPrediction()
        Predictor-->>Core: provisionalPrediction
        Core->>Predictor: generateFinalPrediction()
        Predictor-->>Core: finalPrediction (+overrideBubbles)
        Core->>FTQ: ftqNext / getTarget()
        Core->>FSQ: enqueue(finalPrediction)
        alt do 2-fetch
            Core->>Fetch: signal do_2fetch (keep next FSQ entry buffered)
            Fetch-->>Core: continue fetch without consuming FTQ entry
        else single fetch
            Core->>Fetch: consume FTQ target / mark used
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

cpu-o3: remove simple functions in fetch #725 — Rewrites FSQ/FTQ handling and fetch/prediction integration; high overlap with FTQ/FSQ and two‑fetch interaction changes.
cpu-o3: simplify fetch， only support decoupled BTB mode #721 — Modifies decoupled_bpred.* and related interfaces; likely to conflict with new config/flag additions.
Sc ut #710 — Changes prediction/update flow and I-history interactions that intersect with the altered prediction/timing paths.

Suggested reviewers

jensen-yan
tastynoob
CJ362ff

Poem

🐰 I hop in pairs where branches bend,

I stash the next and gently send,
Two little peeks in one swift beat,
Buffer tucked and pipeline neat,
— coderabbit 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: implementation of 2-taken and 2-fetch capabilities in the branch predictor and fetch logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch 2-taken-ideal

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Line 140: The local variable tempNumOverrideBubbles is declared but never
used; remove the unused declaration of tempNumOverrideBubbles from the scope
where it's defined (the function containing the line "unsigned
tempNumOverrideBubbles = 0;") to clean up dead code and avoid compiler warnings,
ensuring no other references to tempNumOverrideBubbles remain in the function.

🧹 Nitpick comments (2)

src/cpu/pred/btb/decoupled_bpred.hh (1)

162-162: Consider making enableTwoTaken configurable via params.

This feature flag is hardcoded to true with no way to disable it through simulation parameters. Other similar options like fetchStreamQueueSize, predictWidth, and resolveBlockThreshold are initialized from the Params object in the constructor. For flexibility during experimentation and for consistency with the existing pattern, consider adding this to the DecoupledBPUWithBTBParams.
src/cpu/pred/btb/decoupled_bpred.cc (1)
142-184: Add documentation clarifying the multi-prediction loop behavior.

The loop logic for producing up to 2 predictions per tick is non-trivial. Consider adding a comment explaining:

The intended behavior when both predictions succeed (no bubbles)

What happens when the first prediction generates override bubbles (second iteration essentially becomes a no-op)

The interaction between numOverrideBubbles being set inside the loop but decremented only once outside

This will help future maintainers understand the "ideal 2-taken" semantics.
📝 Suggested documentation
     int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
-    unsigned tempNumOverrideBubbles = 0;

+    // "Ideal 2-taken" mode: attempt up to 2 predictions per tick.
+    // - If the first prediction generates override bubbles, the second iteration
+    //   will be blocked by validateFSQEnqueue() until bubbles are consumed.
+    // - If no bubbles, both predictions can be enqueued in a single tick.
+    // - Bubble decrement happens once per tick after the loop.
     while (predsRemainsToBeMade > 0) {

coderabbitai · 2026-01-26T09:28:11Z

src/cpu/pred/btb/decoupled_bpred.cc

-            requestNewPrediction();
-            bpuState = BpuState::PREDICTOR_DONE;
+    int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
+    unsigned tempNumOverrideBubbles = 0;


⚠️ Potential issue | 🟡 Minor

Remove unused variable tempNumOverrideBubbles.

This variable is declared but never used anywhere in the function. It appears to be leftover code from development.

🧹 Proposed fix

int predsRemainsToBeMade = enableTwoTaken ? 2 : 1; - unsigned tempNumOverrideBubbles = 0; while (predsRemainsToBeMade > 0) {

🤖 Prompt for AI Agents

In `@src/cpu/pred/btb/decoupled_bpred.cc` at line 140, The local variable tempNumOverrideBubbles is declared but never used; remove the unused declaration of tempNumOverrideBubbles from the scope where it's defined (the function containing the line "unsigned tempNumOverrideBubbles = 0;") to clean up dead code and avoid compiler warnings, ensuring no other references to tempNumOverrideBubbles remain in the function.

github-actions · 2026-01-26T09:33:55Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.1727`	-
This PR	`2.1706`	📉 `-0.0021` (`-0.10%`)

✅ Difftest smoke test passed!

github-actions · 2026-01-27T03:04:48Z

🚀 Performance test triggered: spec06-0.8c

XiangShanRobot · 2026-01-29T02:22:30Z

[Generated by GEM5 Performance Robot]
commit: 0d664b4
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	20.64	20.27	+1.86 🟢

github-actions · 2026-01-29T03:52:11Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.1727`	-
This PR	`2.2052`	📈 `+0.0325` (`+1.50%`)

✅ Difftest smoke test passed!

github-actions · 2026-01-29T03:53:08Z

🚀 Performance test triggered: spec06-0.8c

XiangShanRobot · 2026-01-29T05:15:43Z

[Generated by GEM5 Performance Robot]
commit: ebf7386
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	20.64	20.27	+1.86 🟢

github-actions · 2026-01-29T06:54:50Z

🚀 Performance test triggered: spec06-0.8c

github-actions · 2026-01-29T07:01:02Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.1727`	-
This PR	`2.2052`	📈 `+0.0325` (`+1.50%`)

✅ Difftest smoke test passed!

XiangShanRobot · 2026-01-29T08:14:13Z

[Generated by GEM5 Performance Robot]
commit: 9406572
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	20.65	20.27	+1.92 🟢

github-actions · 2026-01-30T00:42:43Z

🚀 Performance test triggered: spec06-0.8c

github-actions · 2026-01-30T03:39:29Z

🚀 Performance test triggered: spec06-0.8c

github-actions · 2026-01-30T03:42:56Z

🚀 Performance test triggered: spec06-0.8c

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Around line 135-179: When blockPredictionPending is true the loop clears it
and continues so a second prediction can be issued in the same tick; change that
behavior to exit the prediction loop immediately to honor the "prioritize
resolve update" backpressure. In the while loop that uses predsRemainsToBeMade
(and enableTwoTaken), update the branch that currently does DPRINTF(Override) /
dbpBtbStats.predictionBlockedForUpdate++ / blockPredictionPending = false so
that after logging and incrementing the stat you break out of the loop (or set
predsRemainsToBeMade = 0) instead of clearing blockPredictionPending; this
ensures requestNewPrediction(), requestNewPrediction() / bpuState transitions,
generateFinalPredAndCreateBubbles(), validateFSQEnqueue(), and
processNewPrediction() cannot run for subsequent iterations when a block is
pending.

coderabbitai · 2026-01-30T03:44:27Z

src/cpu/pred/btb/decoupled_bpred.cc

+    int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
+    unsigned tempNumOverrideBubbles = 0;
+
+    while (predsRemainsToBeMade > 0) {
+        // 1. Request new prediction if FSQ not full and we are idle
+        if (bpuState == BpuState::IDLE && !targetQueueFull()) {
+            if (blockPredictionPending) {
+                DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");
+                dbpBtbStats.predictionBlockedForUpdate++;
+                blockPredictionPending = false;
+            } else {
+                requestNewPrediction();
+                bpuState = BpuState::PREDICTOR_DONE;
+            }
        }
-    }

-    // 2. Handle pending prediction if available
-    if (bpuState == BpuState::PREDICTOR_DONE) {
-        DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
-        numOverrideBubbles = generateFinalPredAndCreateBubbles();
-        bpuState = BpuState::PREDICTION_OUTSTANDING;
+        // 2. Handle pending prediction if available
+        if (bpuState == BpuState::PREDICTOR_DONE) {
+            DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
+            numOverrideBubbles = generateFinalPredAndCreateBubbles();
+            bpuState = BpuState::PREDICTION_OUTSTANDING;

-        // Clear each predictor's output
-        for (int i = 0; i < numStages; i++) {
-            predsOfEachStage[i].btbEntries.clear();
+            // Clear each predictor's output
+            for (int i = 0; i < numStages; i++) {
+                predsOfEachStage[i].btbEntries.clear();
+            }
        }
-    }

-    if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
-        tage->dryRunCycle(s0PC);
-    }
+        if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
+            tage->dryRunCycle(s0PC);
+        }

-    // check if:
-    // 1. FSQ has space
-    // 2. there's no bubble
-    // 3. PREDICTION_OUTSTANDING
-    if (validateFSQEnqueue()) {
-        // Create new FSQ entry with the current prediction
-        processNewPrediction();
+        // check if:
+        // 1. FSQ has space
+        // 2. there's no bubble
+        // 3. PREDICTION_OUTSTANDING
+        if (validateFSQEnqueue()) {
+            // Create new FSQ entry with the current prediction
+            processNewPrediction();

-        DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
-        bpuState = BpuState::IDLE;
+            DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
+            bpuState = BpuState::IDLE;
+        }
+
+        predsRemainsToBeMade--;


⚠️ Potential issue | 🟠 Major

Block prediction should skip all iterations when resolve backpressure is set.
In two‑taken mode, blockPredictionPending is cleared in the first iteration, so the second iteration can still issue a prediction in the same tick. That undermines the intended “prioritize resolve update” block.

Proposed fix (exit the loop when prediction is blocked)

if (bpuState == BpuState::IDLE && !targetQueueFull()) { if (blockPredictionPending) { DPRINTF(Override, "Prediction blocked to prioritize resolve update\n"); dbpBtbStats.predictionBlockedForUpdate++; blockPredictionPending = false; + break; // block predictions for the rest of this tick } else { requestNewPrediction(); bpuState = BpuState::PREDICTOR_DONE; } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;

unsigned tempNumOverrideBubbles = 0;

while (predsRemainsToBeMade > 0) {

// 1. Request new prediction if FSQ not full and we are idle

if (bpuState == BpuState::IDLE && !targetQueueFull()) {

if (blockPredictionPending) {

DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");

dbpBtbStats.predictionBlockedForUpdate++;

blockPredictionPending = false;

} else {

requestNewPrediction();

bpuState = BpuState::PREDICTOR_DONE;

}

}

}

// 2. Handle pending prediction if available

if (bpuState == BpuState::PREDICTOR_DONE) {

DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);

numOverrideBubbles = generateFinalPredAndCreateBubbles();

bpuState = BpuState::PREDICTION_OUTSTANDING;

// 2. Handle pending prediction if available

if (bpuState == BpuState::PREDICTOR_DONE) {

DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);

numOverrideBubbles = generateFinalPredAndCreateBubbles();

bpuState = BpuState::PREDICTION_OUTSTANDING;

// Clear each predictor's output

for (int i = 0; i < numStages; i++) {

predsOfEachStage[i].btbEntries.clear();

// Clear each predictor's output

for (int i = 0; i < numStages; i++) {

predsOfEachStage[i].btbEntries.clear();

}

}

}

if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {

tage->dryRunCycle(s0PC);

}

if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {

tage->dryRunCycle(s0PC);

}

// check if:

// 1. FSQ has space

// 2. there's no bubble

// 3. PREDICTION_OUTSTANDING

if (validateFSQEnqueue()) {

// Create new FSQ entry with the current prediction

processNewPrediction();

// check if:

// 1. FSQ has space

// 2. there's no bubble

// 3. PREDICTION_OUTSTANDING

if (validateFSQEnqueue()) {

// Create new FSQ entry with the current prediction

processNewPrediction();

DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");

bpuState = BpuState::IDLE;

DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");

bpuState = BpuState::IDLE;

}

predsRemainsToBeMade--;

int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;

unsigned tempNumOverrideBubbles = 0;

while (predsRemainsToBeMade > 0) {

// 1. Request new prediction if FSQ not full and we are idle

if (bpuState == BpuState::IDLE && !targetQueueFull()) {

if (blockPredictionPending) {

DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");

dbpBtbStats.predictionBlockedForUpdate++;

blockPredictionPending = false;

break; // block predictions for the rest of this tick

} else {

requestNewPrediction();

bpuState = BpuState::PREDICTOR_DONE;

}

}

// 2. Handle pending prediction if available

if (bpuState == BpuState::PREDICTOR_DONE) {

DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);

numOverrideBubbles = generateFinalPredAndCreateBubbles();

bpuState = BpuState::PREDICTION_OUTSTANDING;

// Clear each predictor's output

for (int i = 0; i < numStages; i++) {

predsOfEachStage[i].btbEntries.clear();

}

}

if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {

tage->dryRunCycle(s0PC);

}

// check if:

// 1. FSQ has space

// 2. there's no bubble

// 3. PREDICTION_OUTSTANDING

if (validateFSQEnqueue()) {

// Create new FSQ entry with the current prediction

processNewPrediction();

DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");

bpuState = BpuState::IDLE;

}

predsRemainsToBeMade--;

🧰 Tools

🪛 Cppcheck (2.19.0)

[error] 138-138: Shifting 64-bit value by 64 bits is undefined behaviour

(shiftTooManyBits)

🤖 Prompt for AI Agents

In `@src/cpu/pred/btb/decoupled_bpred.cc` around lines 135 - 179, When blockPredictionPending is true the loop clears it and continues so a second prediction can be issued in the same tick; change that behavior to exit the prediction loop immediately to honor the "prioritize resolve update" backpressure. In the while loop that uses predsRemainsToBeMade (and enableTwoTaken), update the branch that currently does DPRINTF(Override) / dbpBtbStats.predictionBlockedForUpdate++ / blockPredictionPending = false so that after logging and incrementing the stat you break out of the loop (or set predsRemainsToBeMade = 0) instead of clearing blockPredictionPending; this ensures requestNewPrediction(), requestNewPrediction() / bpuState transitions, generateFinalPredAndCreateBubbles(), validateFSQEnqueue(), and processNewPrediction() cannot run for subsequent iterations when a block is pending.

github-actions · 2026-01-30T04:06:18Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.1727`	-
This PR	`2.2397`	📈 `+0.0670` (`+3.08%`)

✅ Difftest smoke test passed!

XiangShanRobot · 2026-02-03T10:11:40Z

[Generated by GEM5 Performance Robot]
commit: 0542507
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	20.56	20.37	+0.93 🟢

github-actions · 2026-02-05T03:53:52Z

🚀 Performance test triggered: spec06-0.8c

XiangShanRobot · 2026-02-05T05:46:22Z

[Generated by GEM5 Performance Robot]
commit: 0721720
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	20.94	20.37	+2.84 🟢

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

configs/example/idealkmhv3.py (1)
81-83: Consider pinning maxFetchBytesPerCycle explicitly in this ideal profile.

This profile currently depends on the global default (64). Setting it here avoids silent benchmark drift if the default changes later.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/example/idealkmhv3.py` around lines 81 - 83, The branch predictor
settings rely on the global default for maxFetchBytesPerCycle and should pin it
explicitly to prevent silent drift; in the block where cpu.branchPred.ftq_size,
cpu.branchPred.fsq_size, and cpu.branchPred.enable2Fetch are set, add an
explicit assignment cpu.branchPred.maxFetchBytesPerCycle = 64 so the ideal
profile fixes the fetch width regardless of global defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpu/o3/fetch.cc`:
- Around line 821-836: The 2-fetch extension can be re-enabled multiple times in
a single cycle; add a per-cycle one-shot guard so do_2fetch is only allowed once
per cycle: introduce a boolean flag (e.g., twoFetchTakenThisCycle) that is
cleared at the start of the fetch cycle and checked before setting do_2fetch in
the block that currently tests predict_taken && dbpbtb->is2FetchEnabled() &&
dbpbtb->ftqHasNext(), then set the flag when you set do_2fetch = true; apply the
same guard update to the other symmetric 2-fetch site around the loop (the block
referenced at lines ~2004-2025) so the extension cannot chain beyond one extra
FTQ entry per cycle and still obey the per-cycle byte budget.

---

Nitpick comments:
In `@configs/example/idealkmhv3.py`:
- Around line 81-83: The branch predictor settings rely on the global default
for maxFetchBytesPerCycle and should pin it explicitly to prevent silent drift;
in the block where cpu.branchPred.ftq_size, cpu.branchPred.fsq_size, and
cpu.branchPred.enable2Fetch are set, add an explicit assignment
cpu.branchPred.maxFetchBytesPerCycle = 64 so the ideal profile fixes the fetch
width regardless of global defaults.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0721720 and 51bfe68.

📒 Files selected for processing (7)

configs/example/idealkmhv3.py
configs/example/kmhv3.py
src/cpu/o3/fetch.cc
src/cpu/o3/fetch.hh
src/cpu/pred/BranchPredictor.py
src/cpu/pred/btb/decoupled_bpred.cc
src/cpu/pred/btb/decoupled_bpred.hh

🚧 Files skipped from review as they are similar to previous changes (3)

src/cpu/o3/fetch.hh
configs/example/kmhv3.py
src/cpu/pred/btb/decoupled_bpred.hh

src/cpu/o3/fetch.cc

github-actions · 2026-02-26T02:39:14Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.1727`	-
This PR	`2.2133`	📈 `+0.0406` (`+1.87%`)

✅ Difftest smoke test passed!

github-actions · 2026-03-03T09:00:49Z

🚀 Performance test triggered: spec06-0.8c

Yakkhini · 2026-03-04T05:43:55Z

Effective benchmark count is 12 (SPECint rows with numeric data); SPECfp rows are empty in both files.

Overall performance is up: IPC geomean speedup is ~1.0288x (about +2.9% avg IPC, -2.8% avg CPI), and all 12 benchmarks improve IPC.
Biggest IPC gains: sjeng +8.3%, perlbench +6.6%, gcc +6.4%, xalancbmk +5.3%, gobmk +3.6%.
Smallest gains: hmmer +0.05%, libquantum +0.08%, omnetpp +0.13%.

Counter-level behavior (most relevant to your FSQ-size change):

Frontend utilization is much better:
- frontendBound avg -47.4%
- fetchBubbles avg -48.1%
FTQ starvation dropped:
- ftqNotValid avg -38.3% (11/12 benches improved; mcf worsened).
But FSQ pressure sharply increased (expected with smaller FSQ):
- fsqFullCannotEnq avg +726.7% (all 12 worsened)
- resolveQueueFull avg +311%
Prediction/squash pressure slightly worsened:
- controlSquashFromCommit avg +2.4%
- controlSquashFromDecode avg +4.8%
- mispredict_rate avg +0.77%
I-cache pressure is mixed/slightly better:
- icacheStallCycles avg -3.3%
- icacheWaitRetryStallCycles avg -7.5%
- tlbCycles avg +12.5% (some outliers; median rise is small).

Interpretation:

The change appears to trade more FSQ/resolve-path pressure for better frontend delivery efficiency, and net performance is positive.
The data says the primary win comes from reduced frontend bubbles/starvation, not from improved prediction quality.

Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0 Co-authored-by: Xu Boran <xuboran@bosc.ac.cn>

Change-Id: Ic203a9694c093034744986309e796b9d66d6f826

Change-Id: I3f0f686000b610c3bf842e62c9b9e91e7188a028

XiangShanRobot · 2026-03-05T03:51:21Z

[Generated by GEM5 Performance Robot]
commit: 51bfe68
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	20.95	20.38	+2.79 🟢

Yakkhini · 2026-03-05T06:57:58Z

Re-analyzed using the 2026-03-05 pair:

out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/spec_all/perf-weighted.csv
out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/ref/spec_all/perf-weighted-ref.csv

Now against the newer ref with the same 64-entry FTQ setting. See: #769

Results:

12 valid SPECint benchmarks compared.
IPC geomean (new/ref) = 1.0279x (~+2.8%).
11/12 benchmarks improved IPC; omnetpp is the only slight regression (-0.21%).

Key counter trends vs new ref:

Frontend improved strongly
- frontendBound: -48.5% avg
- fetchBubbles: -49.2% avg
- ftqNotValid: -65.8% avg (all 12 improved)
Queue pressure increased
- fsqFullCannotEnq: +305.6% avg (all 12 worse)
- resolveQueueFull: +340.5% avg (large spread; many outliers)
Prediction/squash pressure slightly worse overall
- mispredict_rate: +1.86% avg
- controlSquashFromCommit: +3.50%
- controlSquashFromDecode: +5.69%
- overrideCount / overrideBubbleNum: about +27%
Memory frontend mixed
- icacheStallCycles: -2.68% avg (slightly better)
- icacheWaitRetryStallCycles: +16.3% avg (mixed/outlier-driven)
- tlbCycles: roughly flat (+1.34% avg, median ~0)

Best IPC gains:

sjeng +8.53%
perlbench +6.52%
gcc +6.38%
xalancbmk +5.06%
gobmk +3.55%

Interpretation with your corrected baseline:

The gain is still clearly from frontend supply improvements (fewer bubbles, less FTQ-not-valid).
Compared with the old 256-entry ref analysis, the “FSQ pressure explosion” now looks less exaggerated (still real, but more apples-to-apples).
Net: The 2-taken/2-fetch path remains a positive tradeoff at 64-entry setting, with main watchpoints being resolveQueueFull and squash/mispredict side effects.

Yakkhini · 2026-03-05T07:19:37Z

2026-03-05 Frontend/Topdown Analysis

1. Scope and Question

This report analyzes the newest 2026-03-05 weighted data, focusing on:

Whether (near-)perfect 2Taken / 2Fetch has eliminated frontend bandwidth bound.
Whether bottlenecks have shifted to branch misprediction and backend bound.
Whether meaningful performance headroom still exists.

2. Data Source

Primary data file:

out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/spec_all/perf-weighted.csv

Only rows with valid numeric cycles are used (12 SPECint workloads).
SPECfp rows in this CSV are empty and are excluded from quantitative summaries.

3. Method

3.1 Metrics inspected

Frontend:

frontendBound
frontendLatencyBound
frontendBandwidthBound
fetch_nisn_mean
decodeStallRate

Speculation/branch:

badSpecBound
branchMissPrediction
mispredict_rate
controlSquashFromCommit
controlSquashFromDecode
btbMiss

Backend:

backendBound
coreBound
memoryBound

Structural pressure:

fsqFullCannotEnq
resolveQueueFull
ftqNotValid

3.2 Simple headroom model (for intuition only)

To estimate residual frontend-bandwidth opportunity, a rough upper-bound is used:

IPC_if_no_frontend_bw ~= IPC / (1 - frontendBandwidthBound)

This is intentionally optimistic and assumes independent bottlenecks.

4. Key Results

4.1 Frontend bandwidth bound is reduced, but NOT zero

Across 12 valid workloads:

Mean frontendBound: 0.0663
Mean frontendBandwidthBound: 0.0490
Median frontendBandwidthBound: 0.0413
Mean frontendLatencyBound: 0.0174

Interpretation:

Frontend bound is no longer dominant at suite level.
But frontend bandwidth bound still contributes non-trivially (~5% average slot fraction).

Within frontend bound, bandwidth part remains major:

Mean frontendBandwidthBound / frontendBound: 0.68
Median frontendBandwidthBound / frontendBound: 0.67

So, for remaining FE loss, bandwidth is still the larger sub-component.

4.2 Bottleneck has shifted, but not fully to one place

Topdown dominant category count (12 workloads):

baseRetiring: 7
badSpecBound: 3
backendBound: 2
frontendBound: 0

This indicates:

FE is no longer the top-1 limiter.
Bottleneck is now distributed mainly among retiring quality, speculation quality, and backend constraints.
It is not "all shifted to branch misprediction + backend" in every workload, but that trend is clearly stronger.

4.3 Branch/speculation pressure is significant on some workloads

Suite-level central tendency:

Mean badSpecBound: 0.2432
Mean branchMissPrediction: 0.2511
Mean mispredict_rate: 0.0258 (workload spread is large)

High-speculation-pressure examples:

astar: badSpecBound ~= 0.586, branchMissPrediction ~= 0.608
gobmk: badSpecBound ~= 0.508, branchMissPrediction ~= 0.527
sjeng: badSpecBound ~= 0.361, branchMissPrediction ~= 0.368

4.4 Backend remains a major limiter for selected workloads

Examples:

omnetpp: backendBound ~= 0.554 (dominant), memoryBound ~= 0.617
mcf: backendBound ~= 0.448 (dominant)
gcc: backendBound ~= 0.349 (substantial)

Hence future gain cannot rely on FE-only tuning.

4.5 There is still frontend-bandwidth headroom

Using the rough model IPC/(1-fbw):

Estimated additional IPC room from removing only FE-bandwidth term:

Suite-average rough gain: ~5.3%
Larger potential workloads:
- libquantum: ~+12.6%
- astar: ~+9.8%
- perlbench: ~+9.5%
- sjeng: ~+8.6%
Small potential workloads:
- hmmer: ~+0.2%
- omnetpp: ~+0.5%

Conclusion:

FE bandwidth is not gone; tail opportunities are still meaningful in some benchmarks.

5. Additional Structural Signals

Normalized queue/squash counters (per 1k committed instructions) still show notable pressure in several workloads:

fsqFullCannotEnq / kInst examples:
- mcf: ~939.5
- omnetpp: ~752.9
- gcc: ~341.7
- libquantum: ~254.5
resolveQueueFull / kInst is relatively high in:
- perlbench: ~11.1
- xalancbmk: ~7.1
- gcc: ~6.6

These counters suggest structural queue contention still exists, even when FE is no longer the top-level dominant bound.

6. Direct Answer to the Main Question

Q: Has perfect 2Taken/2Fetch already removed frontend bandwidth bound entirely?
A: No. It is greatly reduced as a top-level bottleneck, but residual FE bandwidth bound is still present (~5% average, larger in some workloads).

Q: Has bottleneck fully shifted to branch misprediction and backend?
A: Partially yes, but not fully. Dominance has shifted away from FE; now bad speculation and backend are often more important, while many workloads are already retire-dominant.

Q: Is there still performance headroom?
A: Yes.

FE bandwidth-only residual upper bound is still ~5% on average (higher on specific workloads).
Additional room likely exists from reducing speculation loss and backend stalls (especially memory-bound workloads).

7. Recommended Next Optimization Focus

Speculation quality first for astar/gobmk/sjeng (badSpec-heavy).
Backend/memory path for omnetpp/mcf/gcc.
Residual FE-bandwidth tuning for libquantum/astar/perlbench/sjeng where frontendBandwidthBound remains relatively high.
Watch queue-side counters (fsqFullCannotEnq, resolveQueueFull) to avoid hidden structural saturation in future tuning.

8. Notes and Limitations

This report is intentionally no-ref (absolute reading only).
Headroom estimates are coarse and non-orthogonal; removing one bound may expose another.
Negative badSpecBound/branchMissPrediction in some rows indicates potential metric fitting/noise artifacts and should be interpreted carefully.

github-actions · 2026-03-05T07:29:59Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.2203`	-
This PR	`2.2481`	📈 `+0.0278` (`+1.25%`)

✅ Difftest smoke test passed!

Yakkhini · 2026-03-05T07:35:20Z

2026-03-05 Co-Analysis (New vs Ref)

1) Goal

This report re-analyzes the latest 2026-03-05 results with reference to answer:

Has (near-perfect) 2Taken/2Fetch already removed frontend bandwidth bound?
Has bottleneck fully shifted to branch misprediction and backend?
Is there still practical performance headroom?

Also included as supplemental findings:

TopDown dominant-category migration.
Structural queue-pressure signals.
Workload-priority suggestions.

2) Data Sources

New: out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/spec_all/perf-weighted.csv
Ref: out/gem5/parallel-2026-03-05-enable-2-fetch-on-idealkmhv3/ref/spec_all/perf-weighted-ref.csv

Only rows with valid numeric cycles are used (12 SPECint workloads).

3) Executive Summary

IPC geomean (new/ref) is 1.0279x (about +2.83% average IPC delta).
Frontend bottleneck is strongly reduced:
- frontendBound: -48.53% avg
- frontendBandwidthBound: -42.82% avg
- frontendLatencyBound: -63.20% avg
But frontend bandwidth is not zero in absolute terms:
- New absolute mean frontendBandwidthBound = 0.0490
- New median frontendBandwidthBound = 0.0413
Bottleneck does shift toward speculation/backend/retiring mix:
- New dominant categories: baseRetiring 7, badSpecBound 3, backendBound 2, frontendBound 0.
There is still headroom:
- Rough upper-bound from removing only FE bandwidth term in new data: average potential about +5.3% IPC.

4) Core Question Answers

Q1. Is frontend bandwidth bound already gone?

No. It is much smaller, but not eliminated.

Evidence:

Relative to ref, frontendBandwidthBound improves for all 12 workloads.
Absolute new mean remains 0.0490.
Workloads with noticeable residual FE bandwidth in new data:
- libquantum: frontendBandwidthBound=0.1120
- astar: 0.0896
- perlbench: 0.0869
- sjeng: 0.0793

Q2. Has bottleneck fully shifted to branch misprediction + backend?

Partially, but not fully.

TopDown dominant category counts:

Ref: baseRetiring 8, badSpecBound 2, backendBound 2, frontendBound 0
New: baseRetiring 7, badSpecBound 3, backendBound 2, frontendBound 0

Interpretation:

FE is clearly no longer dominant.
More pressure appears in badSpecBound/backendBound on some workloads.
But many workloads are still retire-dominant; not all moved to bad-spec/backend.

Q3. Is there still improvement ceiling?

Yes.

Using a rough bound for new data: IPC_if_no_FE_bw ~= IPC / (1 - frontendBandwidthBound):

Average estimated additional room: ~5.3% IPC.
Higher residual FE-bandwidth opportunities:
- libquantum: ~+12.6%
- astar: ~+9.8%
- perlbench: ~+9.5%
- sjeng: ~+8.6%

This is optimistic and non-orthogonal, but confirms FE bandwidth is not yet fully exhausted.

5) What Changed (New vs Ref)

5.1 Throughput and frontend supply

ipc: +2.83% avg (+1.64% median), 11 improved / 1 regressed (omnetpp slight).
fetch_nisn_mean: +10.15% avg.
ftqNotValid: -65.79% avg (all 12 improved).

These strongly indicate 2Taken/2Fetch improves frontend supply and feed continuity.

5.2 Trade-offs and side-effects

badSpecBound: +8.18% avg.
branchMissPrediction: +8.17% avg.
backendBound: +18.09% avg.
controlSquashFromDecode: +5.69% avg.
controlSquashFromCommit: +3.50% avg.
overrideCount: +27.23% avg; overrideBubbleNum: +27.11% avg.

Interpretation:

Frontend gain is real, but more aggressive/denser fetch stream also increases correction/override activity.
Once FE pressure drops, bad-spec and backend limits become more visible.

5.3 Queue-pressure signals (important)

fsqFullCannotEnq: +305.63% avg.
resolveQueueFull: +340.47% avg (with outlier-heavy distribution).

Normalized examples (new, per 1k committed insts):

fsqFullCannotEnq/kInst:
- mcf ~939.5
- omnetpp ~752.9
- gcc ~341.7
- libquantum ~254.5
resolveQueueFull/kInst:
- perlbench ~11.1
- xalancbmk ~7.1
- gcc ~6.6

This suggests structural queue pressure remains a likely next limiter.

6) Workload-Level Patterns

High IPC gain and strong FE reduction:

sjeng (+8.53%), perlbench (+6.52%), gcc (+6.38%), xalancbmk (+5.06%), gobmk (+3.55%).

Low/near-flat gains:

hmmer (+0.05%), libquantum (+0.11%), mcf (+0.30%), h264ref (+0.43%).

Slight regression:

omnetpp (-0.21%), where backend/memory pressure is already very strong.

Correlation hint:

IPC gain vs FE-bound reduction shows moderate positive trend (corr ~= 0.48), i.e., FE relief explains a meaningful part of gains.

7) Practical Conclusions

2Taken/2Fetch works: frontend bound is significantly reduced and IPC increases overall.
Frontend bandwidth bound is not gone: absolute residual FE bandwidth is still visible, with benchmark-specific tails.
Bottleneck migration is real but mixed: more bad-spec/backend pressure appears, but many workloads remain retire-dominant.
Still has upside: FE residual + speculation quality + queue-structure tuning can still deliver gains.

8) Recommended Next Tuning Priorities

Speculation-quality path (astar, gobmk, sjeng): reduce wrong-path/override penalties.
Backend/memory path (omnetpp, mcf, gcc): memory/core backend constraints dominate.
Queue-structure path: target fsqFullCannotEnq and resolveQueueFull hot workloads.
Residual FE-bandwidth path (libquantum, astar, perlbench, sjeng): still has measurable headroom.

9) Notes

This report uses weighted aggregate CSVs; empty SPECfp rows are excluded.
Headroom estimates are intentionally rough and should be interpreted as directional bounds.
Some TopDown fields can be noisy (including occasional negative components); trend-level interpretation is recommended.

Yakkhini added the do not merge label Jan 26, 2026

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

Yakkhini added the perf label Jan 27, 2026

Yakkhini force-pushed the 2-taken-ideal branch from 0d664b4 to ebf7386 Compare January 29, 2026 03:42

Yakkhini added perf and removed perf labels Jan 29, 2026

Yakkhini force-pushed the 2-taken-ideal branch from ebf7386 to 9406572 Compare January 29, 2026 06:53

Yakkhini added perf and removed perf labels Jan 29, 2026

Yakkhini added perf and removed perf labels Jan 30, 2026

Yakkhini force-pushed the 2-taken-ideal branch from 398987b to 0f5a594 Compare January 30, 2026 03:37

Yakkhini added perf and removed perf labels Jan 30, 2026

Yakkhini force-pushed the 2-taken-ideal branch from 0f5a594 to 0542507 Compare January 30, 2026 03:42

Yakkhini added perf and removed perf labels Jan 30, 2026

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

Yakkhini force-pushed the 2-taken-ideal branch from 0542507 to 65f3dc4 Compare February 4, 2026 10:14

Yakkhini changed the title ~~cpu-o3: ideal 2-taken implementation~~ Ideal 2-Taken & 2-Fetch Feb 4, 2026

Yakkhini added perf and removed perf labels Feb 5, 2026

jensen-yan mentioned this pull request Feb 5, 2026

Gem5的2fetch-ideal版本core dump问题 #733

Open

Yakkhini force-pushed the 2-taken-ideal branch from 0721720 to 51bfe68 Compare February 26, 2026 02:30

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

src/cpu/o3/fetch.cc Show resolved Hide resolved

Yakkhini added perf and removed perf labels Mar 3, 2026

Yakkhini and others added 3 commits March 5, 2026 10:50

cpu-o3: ideal 2-taken implementation

8522ebb

Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0 Co-authored-by: Xu Boran <xuboran@bosc.ac.cn>

cpu-o3: ideal 2-fetch

65f35de

Change-Id: Ic203a9694c093034744986309e796b9d66d6f826

cpu-o3: enable 2-fetch on idealkmhv3 config

601ff23

Change-Id: I3f0f686000b610c3bf842e62c9b9e91e7188a028

Yakkhini force-pushed the 2-taken-ideal branch from 51bfe68 to 601ff23 Compare March 5, 2026 06:58

Conversation

Yakkhini commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 26, 2026

🚀 Coremark Smoke Test Results

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

XiangShanRobot commented Jan 29, 2026

Ideal BTB Performance

Overall Score

Uh oh!

github-actions bot commented Jan 29, 2026

🚀 Coremark Smoke Test Results

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

XiangShanRobot commented Jan 29, 2026

Ideal BTB Performance

Overall Score

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026

🚀 Coremark Smoke Test Results

Uh oh!

XiangShanRobot commented Jan 29, 2026

Ideal BTB Performance

Overall Score

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 30, 2026

🚀 Coremark Smoke Test Results

Uh oh!

XiangShanRobot commented Feb 3, 2026

Ideal BTB Performance

Overall Score

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

XiangShanRobot commented Feb 5, 2026

Ideal BTB Performance

Overall Score

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 26, 2026

Yakkhini commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 26, 2026 •

edited

Loading

Yakkhini commented Mar 5, 2026 •

edited

Loading