perf(evm): optimize interpreter dispatch with computed goto and opcode inlining#367
perf(evm): optimize interpreter dispatch with computed goto and opcode inlining#367starwarfan wants to merge 1 commit intoDTVMStack:mainfrom
Conversation
…e inlining Replace the switch-based opcode dispatch in the interpreter fast path with computed goto (GCC/Clang __label__ extension) for better branch prediction and reduced dispatch overhead. Key changes: - Add computed goto dispatch table (256 entries) with per-opcode label targets - Inline hot opcode logic directly in dispatch targets: arithmetic (ADD, SUB, MUL, DIV, etc.), logic (AND, OR, XOR, NOT), comparison (LT, GT, EQ, etc.), shifts (SHL, SHR, SAR), stack ops (PUSH0-32, DUP1-16, SWAP1-16, POP), and control flow (JUMP, JUMPI, JUMPDEST) - Use local variables (Pc, sp) for program counter and stack pointer to encourage register allocation, syncing back to frame only for complex handlers - Delegate complex opcodes (memory, storage, calls, creates, logs) to existing handler implementations via HANDLER_CALL macro - Retain original switch-based dispatch as fallback for non-GCC compilers (#else) Benchmark results (evmone-bench, Release mode, vs evmone baseline interpreter): - 165 of 167 synth tests faster than evmone baseline - Average speedup: 5.25x (excluding loop/startup tests) - Peak speedups: MUL 8.2x, SUB 8.1x, ISZERO/NOT 7.9x, SWAP 6.9x, ADD 6.0x - Only loop_v1/v2 (tiny 5.8k gas) remain slower due to per-call overhead Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the EVM interpreter’s gas-chunk (“no gas per opcode”) fast path by replacing the switch-based opcode dispatch with GCC/Clang computed-goto, inlining hot opcodes and delegating complex ones to existing handlers.
Changes:
- Adds a computed-goto dispatch table (256 entries) and a new dispatch loop using local
Pc/sp. - Inlines selected hot opcodes (arithmetic, stack ops, and control flow) directly in dispatch targets.
- Delegates complex opcodes (memory/storage/env/call/create/log/return/revert/selfdestruct) to existing
*Handler::doExecute()via a helper macro, retaining the switch-based chunk dispatcher as fallback.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TARGET_PUSH0 : { | ||
| if (INTX_UNLIKELY(sp >= MAXSTACK)) { | ||
| Context.setStatus(EVMC_STACK_OVERFLOW); | ||
| goto cgoto_error; | ||
| } | ||
| Frame->Stack[sp++] = 0; | ||
| ++Pc; | ||
| DISPATCH_NEXT; |
There was a problem hiding this comment.
TARGET_PUSH0 doesn’t guard on Revision, and the computed-goto path doesn’t perform the NamesTable-based undefined-opcode check that the switch fast path relies on. This changes semantics for pre-Shanghai revisions where opcode 0x5f must be treated as EVMC_UNDEFINED_INSTRUCTION, not executed as PUSH0.
| static void *cgoto_table[256] = {}; | ||
| static bool cgoto_initialized = false; | ||
| if (!cgoto_initialized) { | ||
| for (int i = 0; i < 256; i++) | ||
| cgoto_table[i] = &&TARGET_UNDEFINED; | ||
| cgoto_table[0x00] = &&TARGET_STOP; | ||
| cgoto_table[0x01] = &&TARGET_ADD; | ||
| cgoto_table[0x02] = &&TARGET_MUL; | ||
| cgoto_table[0x03] = &&TARGET_SUB; | ||
| cgoto_table[0x04] = &&TARGET_DIV; | ||
| cgoto_table[0x05] = &&TARGET_SDIV; | ||
| cgoto_table[0x06] = &&TARGET_MOD; | ||
| cgoto_table[0x07] = &&TARGET_SMOD; | ||
| cgoto_table[0x08] = &&TARGET_ADDMOD; | ||
| cgoto_table[0x09] = &&TARGET_MULMOD; | ||
| cgoto_table[0x0a] = &&TARGET_EXP; | ||
| cgoto_table[0x0b] = &&TARGET_SIGNEXTEND; | ||
| cgoto_table[0x10] = &&TARGET_LT; | ||
| cgoto_table[0x11] = &&TARGET_GT; | ||
| cgoto_table[0x12] = &&TARGET_SLT; | ||
| cgoto_table[0x13] = &&TARGET_SGT; | ||
| cgoto_table[0x14] = &&TARGET_EQ; | ||
| cgoto_table[0x15] = &&TARGET_ISZERO; | ||
| cgoto_table[0x16] = &&TARGET_AND; | ||
| cgoto_table[0x17] = &&TARGET_OR; | ||
| cgoto_table[0x18] = &&TARGET_XOR; | ||
| cgoto_table[0x19] = &&TARGET_NOT; | ||
| cgoto_table[0x1a] = &&TARGET_BYTE; | ||
| cgoto_table[0x1b] = &&TARGET_SHL; | ||
| cgoto_table[0x1c] = &&TARGET_SHR; | ||
| cgoto_table[0x1d] = &&TARGET_SAR; | ||
| cgoto_table[0x1e] = &&TARGET_CLZ; | ||
| cgoto_table[0x20] = &&TARGET_KECCAK256; | ||
| cgoto_table[0x30] = &&TARGET_ADDRESS; | ||
| cgoto_table[0x31] = &&TARGET_BALANCE; | ||
| cgoto_table[0x32] = &&TARGET_ORIGIN; | ||
| cgoto_table[0x33] = &&TARGET_CALLER; | ||
| cgoto_table[0x34] = &&TARGET_CALLVALUE; | ||
| cgoto_table[0x35] = &&TARGET_CALLDATALOAD; | ||
| cgoto_table[0x36] = &&TARGET_CALLDATASIZE; | ||
| cgoto_table[0x37] = &&TARGET_CALLDATACOPY; | ||
| cgoto_table[0x38] = &&TARGET_CODESIZE; | ||
| cgoto_table[0x39] = &&TARGET_CODECOPY; | ||
| cgoto_table[0x3a] = &&TARGET_GASPRICE; | ||
| cgoto_table[0x3b] = &&TARGET_EXTCODESIZE; | ||
| cgoto_table[0x3c] = &&TARGET_EXTCODECOPY; | ||
| cgoto_table[0x3d] = &&TARGET_RETURNDATASIZE; | ||
| cgoto_table[0x3e] = &&TARGET_RETURNDATACOPY; | ||
| cgoto_table[0x3f] = &&TARGET_EXTCODEHASH; | ||
| cgoto_table[0x40] = &&TARGET_BLOCKHASH; | ||
| cgoto_table[0x41] = &&TARGET_COINBASE; | ||
| cgoto_table[0x42] = &&TARGET_TIMESTAMP; | ||
| cgoto_table[0x43] = &&TARGET_NUMBER; | ||
| cgoto_table[0x44] = &&TARGET_PREVRANDAO; | ||
| cgoto_table[0x45] = &&TARGET_GASLIMIT; | ||
| cgoto_table[0x46] = &&TARGET_CHAINID; | ||
| cgoto_table[0x47] = &&TARGET_SELFBALANCE; | ||
| cgoto_table[0x48] = &&TARGET_BASEFEE; | ||
| cgoto_table[0x49] = &&TARGET_BLOBHASH; | ||
| cgoto_table[0x4a] = &&TARGET_BLOBBASEFEE; | ||
| cgoto_table[0x50] = &&TARGET_POP; | ||
| cgoto_table[0x51] = &&TARGET_MLOAD; | ||
| cgoto_table[0x52] = &&TARGET_MSTORE; | ||
| cgoto_table[0x53] = &&TARGET_MSTORE8; | ||
| cgoto_table[0x54] = &&TARGET_SLOAD; | ||
| cgoto_table[0x55] = &&TARGET_SSTORE; | ||
| cgoto_table[0x56] = &&TARGET_JUMP; | ||
| cgoto_table[0x57] = &&TARGET_JUMPI; | ||
| cgoto_table[0x58] = &&TARGET_PC; | ||
| cgoto_table[0x59] = &&TARGET_MSIZE; | ||
| cgoto_table[0x5a] = &&TARGET_GAS; | ||
| cgoto_table[0x5b] = &&TARGET_JUMPDEST; | ||
| cgoto_table[0x5c] = &&TARGET_TLOAD; | ||
| cgoto_table[0x5d] = &&TARGET_TSTORE; | ||
| cgoto_table[0x5e] = &&TARGET_MCOPY; | ||
| cgoto_table[0x5f] = &&TARGET_PUSH0; | ||
| for (int i = 0x60; i <= 0x7f; i++) | ||
| cgoto_table[i] = &&TARGET_PUSHX; | ||
| for (int i = 0x80; i <= 0x8f; i++) | ||
| cgoto_table[i] = &&TARGET_DUPX; | ||
| for (int i = 0x90; i <= 0x9f; i++) | ||
| cgoto_table[i] = &&TARGET_SWAPX; | ||
| for (int i = 0xa0; i <= 0xa4; i++) | ||
| cgoto_table[i] = &&TARGET_LOGX; | ||
| cgoto_table[0xf0] = &&TARGET_CREATEX; | ||
| cgoto_table[0xf1] = &&TARGET_CALLX; | ||
| cgoto_table[0xf2] = &&TARGET_CALLX; | ||
| cgoto_table[0xf3] = &&TARGET_RETURN; | ||
| cgoto_table[0xf4] = &&TARGET_CALLX; | ||
| cgoto_table[0xf5] = &&TARGET_CREATEX; | ||
| cgoto_table[0xfa] = &&TARGET_CALLX; | ||
| cgoto_table[0xfd] = &&TARGET_REVERT; | ||
| cgoto_table[0xfe] = &&TARGET_INVALID; | ||
| cgoto_table[0xff] = &&TARGET_SELFDESTRUCT; | ||
| cgoto_initialized = true; | ||
| } |
There was a problem hiding this comment.
cgoto_table is initialized via a non-atomic static bool cgoto_initialized check. If BaseInterpreter::interpret() can be entered concurrently, this is a data race (undefined behavior) and could also leave the table partially initialized. Prefer thread-safe initialization (e.g., a function-local static table built by a lambda, or std::once_flag/std::call_once).
| // Dispatch to next opcode or exit if chunk boundary reached | ||
| #define DISPATCH_NEXT \ | ||
| do { \ | ||
| if (INTX_UNLIKELY(Pc >= ChunkEnd)) \ | ||
| goto cgoto_chunk_done; \ | ||
| goto *cgoto_table[static_cast<uint8_t>(Code[Pc])]; \ | ||
| } while (0) |
There was a problem hiding this comment.
The computed-goto fast path bypasses the existing NamesTable[Op] == NULL check used in the switch-based chunk dispatcher to enforce revision-specific opcode availability. As a result, opcodes introduced in later revisions (e.g. PUSH0/TLOAD/TSTORE/MCOPY/BLOB* etc.) can be executed in earlier revisions instead of raising EVMC_UNDEFINED_INSTRUCTION. Consider adding an opcode-availability check in the dispatch path (initial + DISPATCH_NEXT), or building a dispatch table per Revision that maps unsupported opcodes to TARGET_UNDEFINED.
Replace the switch-based opcode dispatch in the interpreter fast path with computed goto (GCC/Clang
__label__extension) for better branch prediction and reduced dispatch overhead.Key changes:
1. Does this PR affect any open issues?(Y/N) and add issue references (e.g. "fix #123", "re #123".):
2. What is the scope of this PR (e.g. component or file name):
evm,interpreter3. Provide a description of the PR(e.g. more details, effects, motivations or doc link):
The EVM interpreter's main dispatch loop uses a
switchstatement over 256 opcode values. Modern CPUs struggle to predict indirect branches from large switch tables, causing pipeline stalls on every opcode dispatch. This PR replaces the switch with GCC/Clang computed goto (&&label/goto *dispatch_table[opcode]), which gives each opcode its own indirect branch site and allows the branch predictor to specialize per-opcode.Dispatch mechanism: A 256-entry
dispatch_tablemaps each opcode to a label address. After executing an opcode, the code jumps directly to the next handler viagoto *dispatch_table[*Pc]without returning to a central switch.Opcode inlining: Hot opcodes (arithmetic, logic, comparison, shifts, stack manipulation, control flow) are inlined directly in their dispatch targets with local
Pc/spvariables to encourage register allocation. Complex opcodes (memory, storage, calls, creates, logs) delegate to existing handler methods via aHANDLER_CALLmacro that syncs local state before/after the call.Portability: The computed goto path is guarded by
#if defined(__GNUC__) || defined(__clang__). A fallback#elsebranch retains the original switch-based dispatch for non-GCC compilers.Benchmark results (evmone-bench, Release mode, vs evmone baseline interpreter):
4. Are there any breaking changes?(Y/N) and describe the breaking changes(e.g. more details, motivations or doc link):
5. Are there test cases for these changes?(Y/N) select and add more details, references or doc links:
Benchmark results using evmone-bench (Release mode, vs evmone baseline interpreter):
6. Release note
Made with Cursor