Neural Network Accelerator Design for System on Chip
Ammari, Samaneh (2022)
Ammari, Samaneh
2022
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-11-02
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202210267912
https://urn.fi/URN:NBN:fi:tuni-202210267912
Tiivistelmä
ML is vastly utilized in a variety of applications such as voice recognition, computer vision, image classification, object detection, and plenty of other use cases. Protecting data privacy and the importance of preventing latency in different applications and saving the network bandwidth to process data locally without the need to transfer it to the cloud. The approach is called edge computing. It is challenging to design a deep learning accelerator suitable for edge devices. Two main factors affect the chip design. On-chip memory is the first and the most power and area consuming unit. The second one is multipliers. In this thesis, we are focusing on the latter.
Most machine learning algorithms use convolution, which is calculated by multiplying and accumulating input feature maps and weights. Most of the deep learning accelerators use the precision scalable Multiply and Accumulate (MAC) architecture and an array of MAC units. Most of the chip’s area is taken up by the array of MAC units, especially multipliers, which also use a lot of power.
This master’s thesis consists of two parts. First, a new deep learning accelerator architecture is proposed. Second, different multiplier algorithms are explored. These algorithms were implemented in the SystemVerilog language and synthesized via Cadence tools. The aim was to find a smaller area and lower power consumption multiplier with higher performance. In this work, the Braun multiplier, the Booth multiplier, the Baugh-Wooley array multiplier, the Wallace multiplier, the Parallel prefix Vedic multiplier, and the Modified-Booth multiplier are implemented. The power consumption, chip area usage, and performance of the multipliers at different clock frequencies are measured and considered to select the optimal multiplier. Then the precision flexibility feature is added to the selected multiplier algorithms to perform one 8-bit*8-bit, two 4-bit*4-bit, or four 2-bit*2-bit multiplication. It is worth mentioning that both data (multiplicand) and weight (multiplier) can be in different bit width ranges, such as 1,2,4,8. In the proposed deep learning accelerator, the area and power of the systolic array are measured and reported. Among all other multipliers, the signed flexible Modified Booth multiplier which can calculate 2,4, and 8-bits is selected. It occupies 866.461 um2, and consumes 0.653 mW power at 1 GHz. The area and power of the systolic array with bit precision flexible are 283,564 mm2 and 223,156 mW power at 1 GHz, respectively.
Most machine learning algorithms use convolution, which is calculated by multiplying and accumulating input feature maps and weights. Most of the deep learning accelerators use the precision scalable Multiply and Accumulate (MAC) architecture and an array of MAC units. Most of the chip’s area is taken up by the array of MAC units, especially multipliers, which also use a lot of power.
This master’s thesis consists of two parts. First, a new deep learning accelerator architecture is proposed. Second, different multiplier algorithms are explored. These algorithms were implemented in the SystemVerilog language and synthesized via Cadence tools. The aim was to find a smaller area and lower power consumption multiplier with higher performance. In this work, the Braun multiplier, the Booth multiplier, the Baugh-Wooley array multiplier, the Wallace multiplier, the Parallel prefix Vedic multiplier, and the Modified-Booth multiplier are implemented. The power consumption, chip area usage, and performance of the multipliers at different clock frequencies are measured and considered to select the optimal multiplier. Then the precision flexibility feature is added to the selected multiplier algorithms to perform one 8-bit*8-bit, two 4-bit*4-bit, or four 2-bit*2-bit multiplication. It is worth mentioning that both data (multiplicand) and weight (multiplier) can be in different bit width ranges, such as 1,2,4,8. In the proposed deep learning accelerator, the area and power of the systolic array are measured and reported. Among all other multipliers, the signed flexible Modified Booth multiplier which can calculate 2,4, and 8-bits is selected. It occupies 866.461 um2, and consumes 0.653 mW power at 1 GHz. The area and power of the systolic array with bit precision flexible are 283,564 mm2 and 223,156 mW power at 1 GHz, respectively.