PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference
Official Manual of PolymorPIC deployment on FPGA, inclduing module test on both simulator and Linux running on FPGA.
PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference
Cheng Zou, Ziling Wei, Jun Yan Lee, Chen Nie, Kang You, Zhezhi He
Proceedings of Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2025
Please cite this work if this manual or work help your research or project:
@inproceedings{zou2025polymorpic,
author = {Zou, Cheng and Wei, Ziling and Lee, Jun Yan and Nie, Chen and You, Kang and He, Zhezhi},
title = {PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference},
year = {2025},
isbn = {9798400715730},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3725843.3756066},
doi = {10.1145/3725843.3756066},
pages = {1609–1624},
numpages = {16},
keywords = {Processing-in-Cache, Neural Network Acceleration, AIoT, Area-Efficient SoC, RISC-V, Full-stack},
location = {},
series = {MICRO '25}
}
We introduce it with the following order:
- Brief introduction of the RTL structure.
- Add modified/customized codes to Chipyard, including controller/scheduler modules and modified inclusive cache modules (rocket-chip-inclusive-cache). We pack all the design code within rocket-chip-inclusive-cache and make it a new folder as rocket-chip-inclusive-cache-pic-virtual.
- Module test in Chipyard via verilator or vcs under baremetal mode.
- Deployment on FPGA with Linux running.
- Chipyard12
- Vivado2022.2
The follwoing figure shows the RTL modules overall view from DigitalTop Module.
The DigitalTop module is an SoC module with certain peripherals excluded, mainly comprising various buses, the Rocket core, and the LLC module.
The origine line is the module modified in PolymorPIC, and the red box is the added modules.
- RoCC interface in
RockerTilemodule
Based on the RoCC interface specification, functionality to receive instructions from the CPU and forward them to the PolymorPIC main ( - Added Modules in PolymorPIC main module
- Modified LLC
PolymorPIC_main_in_cache) module.
PolymorPIC_main_in_cache is PolymorPIC main module, including CMD_resolver and other Data Process Modules.
CMD_resolver receives command from bus (sent from RoCC interface in Core). Then it conducts instruction decoding and send execution requests to respective processing units. It also returns operational status data through the bus back to the RoCC, which will further return it to user program.
Other modules like DMA, P2S and im2col work to read/write and process data from/to PIC part LLC (InclusiveCache).
For easy communication with LLC, PolymorPIC_main_in_cache module is put inside LLC module.
SwitchCtl_pic module is added to handle PIC allocation/release request.
FlushReqRouter module is added to handle direct cache flush for PIC allocaion.
Directory module is modified to 1) handle direct queries from FlushReqRouter. 2) Keep PIC mode cache ways isolated from normal CPU programs.
BankStore module is modifed to add Bit-serial Process Engine under arrays to support near memory computing.
In this demo, LLC is configured as a 1 MB, 16-way set-associative cache. Each sub-array is sized 512$\times$64, and each Mat contains four sub-arrays. The i-th Mat of each Bank together forms a Level. Level 0 is reserved exclusively for use as the CPU cache, while the other levels can be switched to PIC mode.
The following figure shows the structure.
Only chipyard12 is verified and supported for this demo.
Follow the following steps to put PolymorPIC code to chipyard and modify some configuration files. Each step title indicates the direct operation to be performed, and the content beneath the title explains the reasons and details of that step.
- Copy the folder
rocket-chip-inclusive-cache-pic-virtualin this repo to the foldergeneratorsin chipyard. - For rocket, replace generators/rocket-chip/src/main/scala/tile/LazyRoCC.scala with
LazyRoCC.scalaprovided in this repo under folderRocket. - For Boom, replace /root/chipyard/generators/boom/src/main/scala/common/tile.scala with
tile.scalaunder folderBoom. Replace /root/chipyard/generators/boom/src/main/scala/exu/execution-units/rocc.scala withrocc.scalaunder folderBoom. - Replace the original
build.sbtin chipyard - Add chipyard generator config code
PolymorPIC_Configs.scalain this repo to chipyard folder /root/chipyard/generators/chipyard/src/main/scala/config - Delete the original
benchmarksunder /root/chipyard/toolchains/riscv-tools/riscv-tests/. Addbenchmarksin this repo to chipyard folder /root/chipyard/toolchains/riscv-tools/riscv-tests/.
The PolymorPIC code is located along side inclusive cache for easy getting LLC parameters.
The folder rocket-chip-inclusive-cache-pic-virtual contains the code of modified InclusiveCache and our PolymorPIC_main_in_cache module.
The virtual in the name means it supports operation system.
This aims to add a TileLink slave port in RoCC and enable it connected to bus when initialized.
This code in line 68 adds the port:
val tlSlaveNode : TLNode = TLIdentityNode()This code in line 85 adds the connection:
roccs.map(_.tlSlaveNode).foreach { tl_slave => tl_slave :=TLFragmenter(8, 64):*= tlSlaveXbar.node }Besides the connection of extra slave port in RoCC, the modification mainly aim to fix the bug that the uninitilization of some RoCC ports can't pass the compilation in chipyard12.
The modification in tile.scala includes slave port connection and initilize fpu port.
Connect salve port with RoCC:
DisableMonitors { implicit p => tlSlaveXbar.node :*= slaveNode } // line 99
roccs.map(_.tlSlaveNode).foreach { tl_slave => tl_slave :=TLFragmenter(8, 64):*= tlSlaveXbar.node } // line 149
Initilize fpu port:
val fp_ios = outer.roccs.map(r => {
val roccio = r.module.io
roccio.fpu_req.ready := true.B
roccio.fpu_resp.valid := false.B
roccio.fpu_resp.bits := DontCare
})The modification in rocc.scala is to fix the uninitilization of some RoCC ports:
// line 68
io.req.ready := true.B
io.core.rocc.exception := false.B
io.core.rocc.mem.req.ready := false.B
io.core.rocc.mem.s2_nack := false.B
io.core.rocc.mem.s2_nack_cause_raw := false.B
io.core.rocc.mem.s2_uncached := false.B
io.core.rocc.mem.s2_paddr := DontCare
io.core.rocc.mem.resp.valid := false.B
io.core.rocc.mem.resp.bits := DontCare
io.core.rocc.mem.replay_next := false.B
io.core.rocc.mem.s2_xcpt.ma.ld := false.B
io.core.rocc.mem.s2_xcpt.ma.st := false.B
io.core.rocc.mem.s2_xcpt.pf.ld := false.B
io.core.rocc.mem.s2_xcpt.pf.st := false.B
io.core.rocc.mem.s2_xcpt.gf.ld := false.B
io.core.rocc.mem.s2_xcpt.gf.st := false.B
io.core.rocc.mem.s2_xcpt.ae.ld := false.B
io.core.rocc.mem.s2_xcpt.ae.st := false.B
io.core.rocc.mem.s2_gpa := DontCare
io.core.rocc.mem.s2_gpa_is_pte := false.B
io.core.rocc.mem.uncached_resp.map(r => {
r.valid := false.B
r.bits := DontCare
})
io.core.rocc.mem.ordered := false.B
io.core.rocc.mem.perf.acquire := false.B
io.core.rocc.mem.perf.release := false.B
io.core.rocc.mem.perf.grant := false.B
io.core.rocc.mem.perf.tlbMiss := false.B
io.core.rocc.mem.perf.blocked := false.B
io.core.rocc.mem.perf.canAcceptStoreThenLoad := false.B
io.core.rocc.mem.perf.canAcceptStoreThenRMW := false.B
io.core.rocc.mem.perf.canAcceptLoadThenLoad := false.B
io.core.rocc.mem.perf.storeBufferEmptyAfterLoad := false.B
io.core.rocc.mem.perf.storeBufferEmptyAfterStore := false.B
io.core.rocc.mem.clock_enabled := false.B
// line 174
io.core.rocc.cmd.bits := DontCare
// line 247
io.resp.bits := DontCareThe following code which is the original LLC cache should be removed:
// lazy val rocketchip_inclusive_cache = (project in file("generators/rocket-chip-inclusive-cache"))
// .settings(
// commonSettings,
// Compile / scalaSource := baseDirectory.value / "design/craft")
// .dependsOn(rocketchip)
// .settings(libraryDependencies ++= rocketLibDeps.value)And the following build info of LLC+PolymorPIC is added:
lazy val rocketchip_inclusive_cache = (project in file("generators/rocket-chip-inclusive-cache-pic-virtual"))
.settings(
commonSettings,
Compile / scalaSource := baseDirectory.value / "design/craft")
.dependsOn(rocketchip)
.settings(libraryDependencies ++= rocketLibDeps.value)This add the configuration for generating SoC with PolymorPIC. The example config PICRocket1024 gives an sample that BigRocket+1MB LLC with PIC.
Under benchmarks, besides the original existed common, there are some extra tests proveded.
For example, ACC_test is to test the functionality of accumulator.
The C code of these tests is generated by python script for easy changing the test parameters.
In each test, there is ISA.c/h, which is automatically generated during hardware (scala code) compile.
To control where the ISA.c/h put during hardware (scala code) compile, please see Chipyard/rocket-chip-inclusive-cache-pic-virtual/design/craft/PolymorPIC/src/main/scala/SysConfig.scala line 220: header_cpy_Paths.
The Makefile contains which test to compile when make.
To verify the correctness of the RTL functionality, the design should first be validated using VCS or Verilator.
Before this step, step Embed PolymorPIC code into Chipyard should have been finished.
Make sure the benchmarks contains these tests.
chipyard/, run:
source env.sh
Go to the chipyard/toolchains/riscv-tools/riscv-tests/benchmarks/ACC_test. Then go to folder gen, where python scripts locate. The gen.py contain the main. In testSet, there are two tests provided:
testSet={
"a1":{"len_64b":256,"srcArrayID":8*4,"srcNum":5,"destArrayID":5*4,"bitWidth":"_32b_"},
"a2":{"len_64b":16,"srcArrayID":16*4,"srcNum":4,"destArrayID":60*4,"bitWidth":"_32b_"},
}a1 and a2 is test ID. Multi tests can be run one by one, which can test whether some register is reset correctly after one execution.
len_64b is the number of rows in each-subarray.
srcArrayID is the begin sub-array ID to accumulate.
destArrayID is the sub-array to save.
bitWidth only supports _32b_ currently.
The test also contains the mode switch part.
The diagram of what test a1 does is shown in the following figure:

Start the simulation:
- Compile C program
- Compile hardware and run simulation
Go back to chipyard/toolchains/riscv-tools/riscv-tests/benchmarks, make sure ACC_test is in Makefile:
bmarks = \
ACC_test \
Then run make. Afterwards, ACC_test.riscv will generated under the folder.
Go to chipyard/sims/verilator (use vcs here), then run the following code:
make run-binary LOADMEM=1 CONFIG=PICRocket1024 BINARY=/root/chipyard/toolchains/riscv-tools/riscv-tests/benchmarks/ACC_test.riscv VERILATORTHREADS=10 -j 10
The first time running needs to compile the hardware first. Then the simulation will output the C test program printf like:
Switch successfully!
...
...
...
################ Runtime State ###############
Available Cache Volumn (KB) = 1024
Available PIC Volumn (KB) = 0
Available Mats = 0
Available Mats Range = 63~63
##############################################
ACC check start!
This Acc result is all correct!
The FPGA deployment implementation is based on https://github.com/eugene-tarassov/vivado-risc-v from eugene-tarassov.
The modified project is vivado_fpga_pic.
We further modify the process and make it support zcu102 and customized boards.
This step is elaborated based on zcu102, and directly use our modified script that supports the zcu102.
With respect to how to support customized boards not from Xilinx and create the script, the turtorial is also provided in other section.
- SD card slot preparasion
For easy implementation and follow the method in eugene-tarassov/vivado-risc-v, we choose to use an external memory card.
The implementation only use the PL, and PS will not be used, so the SD slot connnected to the PS will not work.
Therefore, an external storage card is needed.
It uses PMOD connector, which is supported by PL side of zcu102. Connect it to the board like this:
Go to the folder vivado_fpga_pic. Then input command:
make vivado-tcl BOARD=zcu102 CONFIG=BigRocketPIC1024KB ROCKET_FREQ_MHZ=72.0Then the compile begins. Afterwards, a folder named workspace is generated.
It should be copied to a machine with Vivado2022.2. For easy movement, input the pack command:
make pack PACK_NAME=zcu102_bigrocket_PIC_1M_72mhzThen the necessary files will be packed into zcu102_bigrocket_PIC_1M_72mhz.tar.
Copy zcu102_bigrocket_PIC_1M_72mhz.tar to a machine having vivado2022.2.
Extract the files and open vivado.
source /nvme/zcu102_bigRocket_pic_1024_72mhz/workspace/BigRocketPIC1024KB/system-zcu102.tclInput the upper command is input in the following position of gui:
The system-zcu102.tcl is the script that can setup the whole vivado project.
Then, the vivado project is generated. The block design is like:
The .xdc file defines the some connection to the peripheral device like SD, UART and reset button.
For example, to reset, use the button in the figure:
This constrain of the reset is defined in top.xdc:
set_property PACKAGE_PIN AG13 [get_ports rst]
set_property IOSTANDARD LVCMOS33 [get_ports rst]
Then, run Synthesis and Implementaion and Generate Bitstream. The resource on zcu102 pl is:
This step aims to prepare the linux image, which contains the programs to run on FPGA. Take ACC_test as an example here.
The Accumulator has been test on RTL simulator like VCS and verilator, however, those simulation is based on baremetal.
To run simulation on Linux on system-level, the page needs to be locked. The code is like:
#ifdef LINUX
if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0)
{
perror("mlockall failed");
exit(1);
}
#endif
#ifdef PIC_MODE
conduct_alloc(15);
#endif
printf("Test a1 begin.\n");The pre-build image is ready under folder debian_img. Steps to compile and move the execution file to the image:
- Compile the riscv execution using cross-comepile
- Copy file to the Linux image
The host machine should install:
apt install gcc-riscv64-linux-gnu g++-riscv64-linux-gnuThen go to the ACC_test folder and run:
riscv64-linux-gnu-gcc -DLINUX -o ACC main.c ISA.cThen we get elf file ACC.
The image file should be mount first. The file system is the second section of the provided debian-riscv64.sd.img.
You can use the provided scrip to mount:
./mount.sh debian-riscv64.sd.img 2 /root/img_mount/It is mount to the folder /root/img_mount/:
Directly copy the elf ACC to root or any other folders.
Then, umount:
./umount.sh debian-riscv64.sd.img /root/img_mount/Then, flash the debian-riscv64.sd.img to the SD card using balenaEtcher.
Connect zcu102 jtag and uart to the machine. Connect the SD card to the zcu102. Use vivado to program the device. Open serial monitor to watch the uart output (115200):
Then, there is login step:
debian login: root
Password:
Linux debian 6.9.6-dirty #1 SMP Fri Jul 19 23:29:15 CST 2024 riscv64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@debian:~#
The username is root and the password is 1.
Run ACC.








