Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Parallel Accurate Minifloat MACCs for Neural Network Inference on Versal FPGAs

Damsgaard, Hans Jakob; Hossfeld, Konstantin J.; Nurmi, Jari; Preusser, Thomas B. (2024-12-04)

 
Avaa tiedosto
Parallel_Accurate_Minifloat_MACCs_for_Neural_Network_Inference_on_Versal_FPGAs.pdf (2.711Mt)
Lataukset: 



Damsgaard, Hans Jakob
Hossfeld, Konstantin J.
Nurmi, Jari
Preusser, Thomas B.
04.12.2024

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
doi:10.1109/TCAD.2024.3511343
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202501131336

Kuvaus

Peer reviewed
Tiivistelmä
<p>Machine Learning (ML) is ubiquitous in contemporary applications. Its need for efficient acceleration has driven vast research efforts into the quantization of neural networks with low-precision numerical formats. Models quantized with minifloat formats of eight or fewer bits have proven capable of outperforming models quantized into same-size integers. However, unlike integers, minifloats require accurate accumulation to prevent the introduction of rounding errors. We explore the design space of parallel accurate minifloat Multiply-Accumulators (MACCs) targeting the AMD Versal FPGA fabric. We experiment with three variations of the multiply-and-shift and adder tree components of a minifloat MACC. For comparison, we apply similar alterations to a parallel integer MACC. Our results show that custom compressor trees with external sign-inversion gates reduce the mean area of the minifloat MACCs by 17.7% and increase their clock frequency by 16.2%. In comparison, custom compressor trees with absorbed partial product generation gates reduce the mean area of integer MACCs by 28.1% and increase their clock frequency by 3.60%. Comparing the best-performing designs, we observe that minifloat MACCs consume 20% to 180% more resources than integer ones with same-size operands without accounting for a conversion back into a floating-point format, and 60% to 300% more resources when including it. Our data enable engineers to make informed decisions in their designs of deeply-integrated embedded ML solutions when trading off training and fine-tuning effort vs. resource cost.</p>
Kokoelmat
  • TUNICRIS-julkaisut [20161]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste