Skip to content

Intelligent-Computing-Research-Group/PolymorPIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference

MICRO Chipyard RISC-V License

Official Manual of PolymorPIC deployment on FPGA, inclduing module test on both simulator and Linux running on FPGA.

PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference
Cheng Zou, Ziling Wei, Jun Yan Lee, Chen Nie, Kang You, Zhezhi He
Proceedings of Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2025

Please cite this work if this manual or work help your research or project:

@inproceedings{zou2025polymorpic,
author = {Zou, Cheng and Wei, Ziling and Lee, Jun Yan and Nie, Chen and You, Kang and He, Zhezhi},
title = {PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference},
year = {2025},
isbn = {9798400715730},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3725843.3756066},
doi = {10.1145/3725843.3756066},
pages = {1609–1624},
numpages = {16},
keywords = {Processing-in-Cache, Neural Network Acceleration, AIoT, Area-Efficient SoC, RISC-V, Full-stack},
location = {},
series = {MICRO '25}
}

We introduce it with the following order:

  1. Brief introduction of the RTL structure.
  2. Add modified/customized codes to Chipyard, including controller/scheduler modules and modified inclusive cache modules (rocket-chip-inclusive-cache). We pack all the design code within rocket-chip-inclusive-cache and make it a new folder as rocket-chip-inclusive-cache-pic-virtual.
  3. Module test in Chipyard via verilator or vcs under baremetal mode.
  4. Deployment on FPGA with Linux running.

Software Prerequisites

  1. Chipyard12
  2. Vivado2022.2

1. RTL Structure

The follwoing figure shows the RTL modules overall view from DigitalTop Module. Overview The DigitalTop module is an SoC module with certain peripherals excluded, mainly comprising various buses, the Rocket core, and the LLC module. The origine line is the module modified in PolymorPIC, and the red box is the added modules.

  1. RoCC interface in RockerTile module
  2. Based on the RoCC interface specification, functionality to receive instructions from the CPU and forward them to the PolymorPIC main (PolymorPIC_main_in_cache) module.
  3. Added Modules in PolymorPIC main module
  4. PolymorPIC_main_in_cache is PolymorPIC main module, including CMD_resolver and other Data Process Modules. CMD_resolver receives command from bus (sent from RoCC interface in Core). Then it conducts instruction decoding and send execution requests to respective processing units. It also returns operational status data through the bus back to the RoCC, which will further return it to user program.

    Other modules like DMA, P2S and im2col work to read/write and process data from/to PIC part LLC (InclusiveCache). For easy communication with LLC, PolymorPIC_main_in_cache module is put inside LLC module.

  5. Modified LLC
  6. SwitchCtl_pic module is added to handle PIC allocation/release request.

    FlushReqRouter module is added to handle direct cache flush for PIC allocaion.

    Directory module is modified to 1) handle direct queries from FlushReqRouter. 2) Keep PIC mode cache ways isolated from normal CPU programs.

    BankStore module is modifed to add Bit-serial Process Engine under arrays to support near memory computing.

BankStore

In this demo, LLC is configured as a 1 MB, 16-way set-associative cache. Each sub-array is sized 512$\times$64, and each Mat contains four sub-arrays. The i-th Mat of each Bank together forms a Level. Level 0 is reserved exclusively for use as the CPU cache, while the other levels can be switched to PIC mode.

The following figure shows the structure.

Bank

2. Embed PolymorPIC code into Chipyard.

Only chipyard12 is verified and supported for this demo.

Follow the following steps to put PolymorPIC code to chipyard and modify some configuration files. Each step title indicates the direct operation to be performed, and the content beneath the title explains the reasons and details of that step.

  1. Copy the folder rocket-chip-inclusive-cache-pic-virtual in this repo to the folder generators in chipyard.

  2. The PolymorPIC code is located along side inclusive cache for easy getting LLC parameters.

    The folder rocket-chip-inclusive-cache-pic-virtual contains the code of modified InclusiveCache and our PolymorPIC_main_in_cache module.

    The virtual in the name means it supports operation system.

  3. For rocket, replace generators/rocket-chip/src/main/scala/tile/LazyRoCC.scala with LazyRoCC.scala provided in this repo under folder Rocket.

  4. This aims to add a TileLink slave port in RoCC and enable it connected to bus when initialized.

    This code in line 68 adds the port:

    val tlSlaveNode : TLNode = TLIdentityNode()

    This code in line 85 adds the connection:

    roccs.map(_.tlSlaveNode).foreach { tl_slave => tl_slave :=TLFragmenter(8, 64):*= tlSlaveXbar.node }
  5. For Boom, replace /root/chipyard/generators/boom/src/main/scala/common/tile.scala with tile.scala under folder Boom. Replace /root/chipyard/generators/boom/src/main/scala/exu/execution-units/rocc.scala with rocc.scala under folder Boom.

  6. Besides the connection of extra slave port in RoCC, the modification mainly aim to fix the bug that the uninitilization of some RoCC ports can't pass the compilation in chipyard12.

    The modification in tile.scala includes slave port connection and initilize fpu port.

    Connect salve port with RoCC:

    DisableMonitors { implicit p => tlSlaveXbar.node :*= slaveNode }  // line 99
    roccs.map(_.tlSlaveNode).foreach { tl_slave => tl_slave :=TLFragmenter(8, 64):*= tlSlaveXbar.node }  // line 149
    

    Initilize fpu port:

    val fp_ios = outer.roccs.map(r => {
            val roccio = r.module.io
            roccio.fpu_req.ready := true.B
            roccio.fpu_resp.valid := false.B
            roccio.fpu_resp.bits := DontCare
          })

    The modification in rocc.scala is to fix the uninitilization of some RoCC ports:

      // line 68
      io.req.ready := true.B
      io.core.rocc.exception := false.B
      io.core.rocc.mem.req.ready := false.B
      io.core.rocc.mem.s2_nack := false.B
      io.core.rocc.mem.s2_nack_cause_raw := false.B
      io.core.rocc.mem.s2_uncached := false.B
      io.core.rocc.mem.s2_paddr := DontCare
      io.core.rocc.mem.resp.valid := false.B
      io.core.rocc.mem.resp.bits := DontCare
      io.core.rocc.mem.replay_next := false.B
      io.core.rocc.mem.s2_xcpt.ma.ld := false.B
      io.core.rocc.mem.s2_xcpt.ma.st := false.B
      io.core.rocc.mem.s2_xcpt.pf.ld := false.B
      io.core.rocc.mem.s2_xcpt.pf.st := false.B
      io.core.rocc.mem.s2_xcpt.gf.ld := false.B
      io.core.rocc.mem.s2_xcpt.gf.st := false.B
      io.core.rocc.mem.s2_xcpt.ae.ld := false.B
      io.core.rocc.mem.s2_xcpt.ae.st := false.B
      io.core.rocc.mem.s2_gpa := DontCare
      io.core.rocc.mem.s2_gpa_is_pte := false.B
      io.core.rocc.mem.uncached_resp.map(r => {
        r.valid := false.B
        r.bits := DontCare
      })
      io.core.rocc.mem.ordered := false.B
      io.core.rocc.mem.perf.acquire := false.B
      io.core.rocc.mem.perf.release := false.B
      io.core.rocc.mem.perf.grant := false.B
      io.core.rocc.mem.perf.tlbMiss := false.B
      io.core.rocc.mem.perf.blocked := false.B
      io.core.rocc.mem.perf.canAcceptStoreThenLoad := false.B
      io.core.rocc.mem.perf.canAcceptStoreThenRMW := false.B
      io.core.rocc.mem.perf.canAcceptLoadThenLoad := false.B
      io.core.rocc.mem.perf.storeBufferEmptyAfterLoad := false.B
      io.core.rocc.mem.perf.storeBufferEmptyAfterStore := false.B
      io.core.rocc.mem.clock_enabled := false.B
    
      // line 174
      io.core.rocc.cmd.bits  := DontCare
    
      // line 247
      io.resp.bits           := DontCare
  7. Replace the original build.sbt in chipyard
  8. The following code which is the original LLC cache should be removed:

    // lazy val rocketchip_inclusive_cache = (project in file("generators/rocket-chip-inclusive-cache"))
    //   .settings(
    //     commonSettings,
    //     Compile / scalaSource := baseDirectory.value / "design/craft")
    //   .dependsOn(rocketchip)
    //   .settings(libraryDependencies ++= rocketLibDeps.value)

    And the following build info of LLC+PolymorPIC is added:

    lazy val rocketchip_inclusive_cache = (project in file("generators/rocket-chip-inclusive-cache-pic-virtual"))
      .settings(
        commonSettings,
        Compile / scalaSource := baseDirectory.value / "design/craft")
      .dependsOn(rocketchip)
      .settings(libraryDependencies ++= rocketLibDeps.value)
  9. Add chipyard generator config code PolymorPIC_Configs.scala in this repo to chipyard folder /root/chipyard/generators/chipyard/src/main/scala/config

  10. This add the configuration for generating SoC with PolymorPIC. The example config PICRocket1024 gives an sample that BigRocket+1MB LLC with PIC.

  11. Delete the original benchmarks under /root/chipyard/toolchains/riscv-tools/riscv-tests/. Add benchmarks in this repo to chipyard folder /root/chipyard/toolchains/riscv-tools/riscv-tests/.

  12. Under benchmarks, besides the original existed common, there are some extra tests proveded.

    For example, ACC_test is to test the functionality of accumulator.

    The C code of these tests is generated by python script for easy changing the test parameters.

    In each test, there is ISA.c/h, which is automatically generated during hardware (scala code) compile.

    To control where the ISA.c/h put during hardware (scala code) compile, please see Chipyard/rocket-chip-inclusive-cache-pic-virtual/design/craft/PolymorPIC/src/main/scala/SysConfig.scala line 220: header_cpy_Paths.

    The Makefile contains which test to compile when make.

3. Simulation on Chipyard

To verify the correctness of the RTL functionality, the design should first be validated using VCS or Verilator. Before this step, step Embed PolymorPIC code into Chipyard should have been finished. Make sure the benchmarks contains these tests.

⚠️Before executing commands in Chipyard, dont forget to setup the Chipyard enveriment. Go to chipyard/, run:

source env.sh

Example: Run ACC test

Go to the chipyard/toolchains/riscv-tools/riscv-tests/benchmarks/ACC_test. Then go to folder gen, where python scripts locate. The gen.py contain the main. In testSet, there are two tests provided:

testSet={
        "a1":{"len_64b":256,"srcArrayID":8*4,"srcNum":5,"destArrayID":5*4,"bitWidth":"_32b_"},
        "a2":{"len_64b":16,"srcArrayID":16*4,"srcNum":4,"destArrayID":60*4,"bitWidth":"_32b_"},
        }

a1 and a2 is test ID. Multi tests can be run one by one, which can test whether some register is reset correctly after one execution. len_64b is the number of rows in each-subarray. srcArrayID is the begin sub-array ID to accumulate. destArrayID is the sub-array to save. bitWidth only supports _32b_ currently. The test also contains the mode switch part.

The diagram of what test a1 does is shown in the following figure: ACC_test

Start the simulation:

  1. Compile C program
  2. Go back to chipyard/toolchains/riscv-tools/riscv-tests/benchmarks, make sure ACC_test is in Makefile:

    bmarks = \
    	ACC_test \
    

    Then run make. Afterwards, ACC_test.riscv will generated under the folder.

  3. Compile hardware and run simulation
  4. Go to chipyard/sims/verilator (use vcs here), then run the following code:

    make run-binary LOADMEM=1 CONFIG=PICRocket1024 BINARY=/root/chipyard/toolchains/riscv-tools/riscv-tests/benchmarks/ACC_test.riscv VERILATORTHREADS=10 -j 10
    

    The first time running needs to compile the hardware first. Then the simulation will output the C test program printf like:

    Switch successfully!
    ...
    ...
    ...
    ################ Runtime State ###############
    Available Cache Volumn (KB) = 1024
    Available PIC Volumn (KB) = 0
    Available Mats = 0
    Available Mats Range = 63~63
    ##############################################
    ACC check start!
    This Acc result is all correct!
    

4. FPGA deployment (Run Linux)

The FPGA deployment implementation is based on https://github.com/eugene-tarassov/vivado-risc-v from eugene-tarassov. The modified project is vivado_fpga_pic. We further modify the process and make it support zcu102 and customized boards. This step is elaborated based on zcu102, and directly use our modified script that supports the zcu102. With respect to how to support customized boards not from Xilinx and create the script, the turtorial is also provided in other section.

4.1 Hardware Preparation (zcu102 as example)

  1. SD card slot preparasion
  2. For easy implementation and follow the method in eugene-tarassov/vivado-risc-v, we choose to use an external memory card.

    The implementation only use the PL, and PS will not be used, so the SD slot connnected to the PS will not work.

    Therefore, an external storage card is needed.

    SD-Slot

    It uses PMOD connector, which is supported by PL side of zcu102. Connect it to the board like this:

    connect

    Go to the folder vivado_fpga_pic. Then input command:

    make vivado-tcl BOARD=zcu102 CONFIG=BigRocketPIC1024KB ROCKET_FREQ_MHZ=72.0

    Then the compile begins. Afterwards, a folder named workspace is generated. It should be copied to a machine with Vivado2022.2. For easy movement, input the pack command:

    make pack PACK_NAME=zcu102_bigrocket_PIC_1M_72mhz

    Then the necessary files will be packed into zcu102_bigrocket_PIC_1M_72mhz.tar.

    Copy zcu102_bigrocket_PIC_1M_72mhz.tar to a machine having vivado2022.2.

    Extract the files and open vivado.

    source /nvme/zcu102_bigRocket_pic_1024_72mhz/workspace/BigRocketPIC1024KB/system-zcu102.tcl

    Input the upper command is input in the following position of gui:

    connect

    The system-zcu102.tcl is the script that can setup the whole vivado project.

    Then, the vivado project is generated. The block design is like:

    connect

    The .xdc file defines the some connection to the peripheral device like SD, UART and reset button. For example, to reset, use the button in the figure:

    connect

    This constrain of the reset is defined in top.xdc:

    set_property PACKAGE_PIN AG13 [get_ports rst]
    set_property IOSTANDARD LVCMOS33 [get_ports rst]
    

    Then, run Synthesis and Implementaion and Generate Bitstream. The resource on zcu102 pl is:

    connect

4.2 Software Preparation

This step aims to prepare the linux image, which contains the programs to run on FPGA. Take ACC_test as an example here.

Example: ACC test

The Accumulator has been test on RTL simulator like VCS and verilator, however, those simulation is based on baremetal.

To run simulation on Linux on system-level, the page needs to be locked. The code is like:

#ifdef LINUX
    if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0) 
    {
        perror("mlockall failed");
        exit(1);
    }
#endif

#ifdef PIC_MODE
    conduct_alloc(15);
#endif
    printf("Test a1 begin.\n");

The pre-build image is ready under folder debian_img. Steps to compile and move the execution file to the image:

  1. Compile the riscv execution using cross-comepile
  2. The host machine should install:

    apt install gcc-riscv64-linux-gnu g++-riscv64-linux-gnu

    Then go to the ACC_test folder and run:

    riscv64-linux-gnu-gcc  -DLINUX -o  ACC main.c ISA.c

    Then we get elf file ACC.

  3. Copy file to the Linux image
  4. The image file should be mount first. The file system is the second section of the provided debian-riscv64.sd.img.

    You can use the provided scrip to mount:

    ./mount.sh debian-riscv64.sd.img 2 /root/img_mount/

    It is mount to the folder /root/img_mount/:

    connect

    Directly copy the elf ACC to root or any other folders.

    Then, umount:

    ./umount.sh debian-riscv64.sd.img /root/img_mount/

    Then, flash the debian-riscv64.sd.img to the SD card using balenaEtcher.

    4.3 Start FPGA

    Connect zcu102 jtag and uart to the machine. Connect the SD card to the zcu102. Use vivado to program the device. Open serial monitor to watch the uart output (115200):

    connect

    Then, there is login step:

    debian login: root
    Password: 
    Linux debian 6.9.6-dirty #1 SMP Fri Jul 19 23:29:15 CST 2024 riscv64
    
    The programs included with the Debian GNU/Linux system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    
    Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
    permitted by applicable law.
    root@debian:~#
    

    The username is root and the password is 1.

    Run ACC.

About

Polymorphic Processing-in-Cache

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors