This project provides a comprehensive comparative analysis between two different arithmetic implementations for neural network operations:
- W4A4: Standard 4-bit integer quantization using conventional multiply-accumulate (MAC) units.
- W4ASpike: A novel spike-based accumulation method that decomposes activations into bit-planes and performs shift-add operations, eliminating the need for hardware multipliers.
The comparison evaluates performance, area, and power consumption across both FPGA (Xilinx Zynq UltraScale+) and ASIC (TSMC 28nm) platforms.
- Vector Dot Product: executing 128-dimensional vector dot product 1024 times
- Implemented each fuction unit by High Level synthesis.
- Component part is xczu9eg-ffvb1156-2-e. Target clock is 200MHz with a clock_uncertainty of 27%. Resource usage is evaluated based on the post-synthesis results from Vitis HLS. latency, resource usage and timing are reported.
| design | clock | co-sim latency | post_syn timing (ns) | LUT | FF | DSP | BRAM | SRL | CLB |
|---|---|---|---|---|---|---|---|---|---|
| naive dot-dsp | 200MHz | 1138 | 2.299 | 6731 | 10131 | 58 | 30 | 407 | 1523 |
| naive dot-fabric | 200MHz | 1131 | 2.210 | 8050 | 8650 | 0 | 30 | 287 | 1665 |
| s-bin-par | 200MHz | 1170 | 2.430 | 16919 | 14940 | 0 | 30 | 1847 | 3380 |
| s-bin-ser | 200MHz | 9301 | 2.353 | 7985 | 10863 | 1 | 30 | 1344 | 1759 |
Note: result will be updated
- Implemented using verilog and compiled by Vivado 2024.
- Synthesized at 200MHz. Component part is xczu9eg-ffvb1156-2-e. The adder tree designs all use a seven-stage pipeline. Resource usage and timing are obtained from implementation report.
- note: before syn and im, we run
set_property -name {STEPS.SYNTH_DESIGN.ARGS.MORE OPTIONS} -value {-mode out_of_context} -objects [get_runs synth_1]in Tcl Console to avoid IO utilization report. - Startup latency:
- naive_dot: 7 cycles
- spike_add_dot: 9 cycles
| design | clock | WNS (ns) | sim_latency | CLB LUTs | CLB Registers | CARRY8 | F7 Muxes | F8 Muxes | Startup latency |
|---|---|---|---|---|---|---|---|---|---|
| naive_dot | 200MHz | 4.021 | 1034 | 4142 | 1286 | 254 | 896 | 128 | 7 cycles |
| spike_add_dot | 200MHz | 3.738 | 1034 | 4647 | 3570 | 314 | 0 | 0 | 9 cycles |
- Implemented using Verilog and synthesized using Synopsys Design Compiler (DC) 2022.
- Target technology: TSMC 28nm.
- Target Clock: 200MHz.
- All designs are implemented with a single pipeline stage (Latency = 1 cycle).
- Power results are reported from DC synthesis (Total, Static, Dynamic).
| Design | Clock | Slack (ns) | Total Cell Area (um^2) | Pipeline | Dynamic Power (mW) | Static Power (mW) | Total Power (mW) |
|---|---|---|---|---|---|---|---|
| naive_dot | 200MHz | 3.33 | 12088.44 | 1 | 1.4000 | 0.2430 | 1.6431 |
| gated_dot | 200MHz | 3.30 | 12531.96 | 1 | 1.4289 | 0.2514 | 1.6804 |
| spike_add | 200MHz | 3.31 | 18311.00 | 1 | 2.2921 | 0.3543 | 2.6464 |
| spike_add_optimized | 200MHz | 3.30 | 14866.15 | 1 | 1.7823 | 0.2958 | 2.0781 |