Performance Comparison: W4A4 vs. W4ASpike

This project provides a comprehensive comparative analysis between two different arithmetic implementations for neural network operations:

W4A4: Standard 4-bit integer quantization using conventional multiply-accumulate (MAC) units.
W4ASpike: A novel spike-based accumulation method that decomposes activations into bit-planes and performs shift-add operations, eliminating the need for hardware multipliers.

The comparison evaluates performance, area, and power consumption across both FPGA (Xilinx Zynq UltraScale+) and ASIC (TSMC 28nm) platforms.

WorkLoad

Vector Dot Product: executing 128-dimensional vector dot product 1024 times

Comparison metrics

Vitis HLS

Implemented each fuction unit by High Level synthesis.
Component part is xczu9eg-ffvb1156-2-e. Target clock is 200MHz with a clock_uncertainty of 27%. Resource usage is evaluated based on the post-synthesis results from Vitis HLS. latency, resource usage and timing are reported.

design	clock	co-sim latency	post_syn timing (ns)	LUT	FF	DSP	BRAM	SRL	CLB
naive dot-dsp	200MHz	1138	2.299	6731	10131	58	30	407	1523
naive dot-fabric	200MHz	1131	2.210	8050	8650	0	30	287	1665
s-bin-par	200MHz	1170	2.430	16919	14940	0	30	1847	3380
s-bin-ser	200MHz	9301	2.353	7985	10863	1	30	1344	1759

Vivado Design suite

Note: result will be updated

Implemented using verilog and compiled by Vivado 2024.
Synthesized at 200MHz. Component part is xczu9eg-ffvb1156-2-e. The adder tree designs all use a seven-stage pipeline. Resource usage and timing are obtained from implementation report.
note: before syn and im, we run set_property -name {STEPS.SYNTH_DESIGN.ARGS.MORE OPTIONS} -value {-mode out_of_context} -objects [get_runs synth_1] in Tcl Console to avoid IO utilization report.
Startup latency:
- naive_dot: 7 cycles
- spike_add_dot: 9 cycles

design	clock	WNS (ns)	sim_latency	CLB LUTs	CLB Registers	CARRY8	F7 Muxes	F8 Muxes	Startup latency
naive_dot	200MHz	4.021	1034	4142	1286	254	896	128	7 cycles
spike_add_dot	200MHz	3.738	1034	4647	3570	314	0	0	9 cycles

Design Compiler (ASIC)

Implemented using Verilog and synthesized using Synopsys Design Compiler (DC) 2022.
Target technology: TSMC 28nm.
Target Clock: 200MHz.
All designs are implemented with a single pipeline stage (Latency = 1 cycle).
Power results are reported from DC synthesis (Total, Static, Dynamic).

Design	Clock	Slack (ns)	Total Cell Area (um^2)	Pipeline	Dynamic Power (mW)	Static Power (mW)	Total Power (mW)
naive_dot	200MHz	3.33	12088.44	1	1.4000	0.2430	1.6431
gated_dot	200MHz	3.30	12531.96	1	1.4289	0.2514	1.6804
spike_add	200MHz	3.31	18311.00	1	2.2921	0.3543	2.6464
spike_add_optimized	200MHz	3.30	14866.15	1	1.7823	0.2958	2.0781

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ASIC		ASIC
HLS		HLS
vivado		vivado
.gitattributes		.gitattributes
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Comparison: W4A4 vs. W4ASpike

WorkLoad

Comparison metrics

Vitis HLS

Vivado Design suite

Design Compiler (ASIC)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Performance Comparison: W4A4 vs. W4ASpike

WorkLoad

Comparison metrics

Vitis HLS

Vivado Design suite

Design Compiler (ASIC)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages