A pipelined systolic array AI accelerator designed for FP16 matrix multiplication, achieving 1 output per cycle throughput.
The core is a parameterizable systolic array — the SystemVerilog generator can produce any N×N configuration with automatic data forwarding between processing elements. FP16 multiply-accumulate blocks were generated using Vitis HLS.
The accelerator connects to on-chip BRAM through a Xilinx AXI BRAM Controller, bridging the systolic array and memory over a standard AXI4 interface.
Used Verilator, Yosys, and OpenSTA for rapid iteration and early timing analysis during development. Final synthesis and timing closure done in Vivado targeting Xilinx silicon.
Sustained 1 output/cycle throughput for matrix multiply operations. Parameterizable design allows scaling to different array sizes depending on target FPGA resource budget.