Parallel Accurate Minifloat MACCs for Neural Network Inference on Versal FPGAs
Damsgaard, Hans Jakob; Hossfeld, Konstantin J.; Nurmi, Jari; Preusser, Thomas B. (2024-12-04)
Avaa tiedosto
Lataukset:
Damsgaard, Hans Jakob
Hossfeld, Konstantin J.
Nurmi, Jari
Preusser, Thomas B.
04.12.2024
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202501131336
https://urn.fi/URN:NBN:fi:tuni-202501131336
Kuvaus
Peer reviewed
Tiivistelmä
<p>Machine Learning (ML) is ubiquitous in contemporary applications. Its need for efficient acceleration has driven vast research efforts into the quantization of neural networks with low-precision numerical formats. Models quantized with minifloat formats of eight or fewer bits have proven capable of outperforming models quantized into same-size integers. However, unlike integers, minifloats require accurate accumulation to prevent the introduction of rounding errors. We explore the design space of parallel accurate minifloat Multiply-Accumulators (MACCs) targeting the AMD Versal FPGA fabric. We experiment with three variations of the multiply-and-shift and adder tree components of a minifloat MACC. For comparison, we apply similar alterations to a parallel integer MACC. Our results show that custom compressor trees with external sign-inversion gates reduce the mean area of the minifloat MACCs by 17.7% and increase their clock frequency by 16.2%. In comparison, custom compressor trees with absorbed partial product generation gates reduce the mean area of integer MACCs by 28.1% and increase their clock frequency by 3.60%. Comparing the best-performing designs, we observe that minifloat MACCs consume 20% to 180% more resources than integer ones with same-size operands without accounting for a conversion back into a floating-point format, and 60% to 300% more resources when including it. Our data enable engineers to make informed decisions in their designs of deeply-integrated embedded ML solutions when trading off training and fine-tuning effort vs. resource cost.</p>
Kokoelmat
- TUNICRIS-julkaisut [20161]