Skip to content

PercyHayes/operation_compare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Performance Comparison: W4A4 vs. W4ASpike

This project provides a comprehensive comparative analysis between two different arithmetic implementations for neural network operations:

  1. W4A4: Standard 4-bit integer quantization using conventional multiply-accumulate (MAC) units.
  2. W4ASpike: A novel spike-based accumulation method that decomposes activations into bit-planes and performs shift-add operations, eliminating the need for hardware multipliers.

The comparison evaluates performance, area, and power consumption across both FPGA (Xilinx Zynq UltraScale+) and ASIC (TSMC 28nm) platforms.

WorkLoad

  • Vector Dot Product: executing 128-dimensional vector dot product 1024 times

Comparison metrics

  • Implemented each fuction unit by High Level synthesis.
  • Component part is xczu9eg-ffvb1156-2-e. Target clock is 200MHz with a clock_uncertainty of 27%. Resource usage is evaluated based on the post-synthesis results from Vitis HLS. latency, resource usage and timing are reported.
design clock co-sim latency post_syn timing (ns) LUT FF DSP BRAM SRL CLB
naive dot-dsp 200MHz 1138 2.299 6731 10131 58 30 407 1523
naive dot-fabric 200MHz 1131 2.210 8050 8650 0 30 287 1665
s-bin-par 200MHz 1170 2.430 16919 14940 0 30 1847 3380
s-bin-ser 200MHz 9301 2.353 7985 10863 1 30 1344 1759

Note: result will be updated

  • Implemented using verilog and compiled by Vivado 2024.
  • Synthesized at 200MHz. Component part is xczu9eg-ffvb1156-2-e. The adder tree designs all use a seven-stage pipeline. Resource usage and timing are obtained from implementation report.
  • note: before syn and im, we run set_property -name {STEPS.SYNTH_DESIGN.ARGS.MORE OPTIONS} -value {-mode out_of_context} -objects [get_runs synth_1] in Tcl Console to avoid IO utilization report.
  • Startup latency:
    • naive_dot: 7 cycles
    • spike_add_dot: 9 cycles
design clock WNS (ns) sim_latency CLB LUTs CLB Registers CARRY8 F7 Muxes F8 Muxes Startup latency
naive_dot 200MHz 4.021 1034 4142 1286 254 896 128 7 cycles
spike_add_dot 200MHz 3.738 1034 4647 3570 314 0 0 9 cycles
  • Implemented using Verilog and synthesized using Synopsys Design Compiler (DC) 2022.
  • Target technology: TSMC 28nm.
  • Target Clock: 200MHz.
  • All designs are implemented with a single pipeline stage (Latency = 1 cycle).
  • Power results are reported from DC synthesis (Total, Static, Dynamic).
Design Clock Slack (ns) Total Cell Area (um^2) Pipeline Dynamic Power (mW) Static Power (mW) Total Power (mW)
naive_dot 200MHz 3.33 12088.44 1 1.4000 0.2430 1.6431
gated_dot 200MHz 3.30 12531.96 1 1.4289 0.2514 1.6804
spike_add 200MHz 3.31 18311.00 1 2.2921 0.3543 2.6464
spike_add_optimized 200MHz 3.30 14866.15 1 1.7823 0.2958 2.0781

About

Performance Comparison between W4A4 and W4ASpike

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors