Design of hardware accelerators for embedded multimedia applications
Brunelli, C. (2008)
Brunelli, C.
Tampere University of Technology
2008
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-200908056870
https://urn.fi/URN:NBN:fi:tty-200908056870
Tiivistelmä
The subject of this work is the design and the implementation of hardware components which can accelerate the computation in a microprocessor-based digital system controlled by a RISC (Reduced Instruction Set Computer) core. Indeed a RISC core alone cannot achieve the desired computational capability needed tomeet the requirements of modern applications, especially demanding ones like audio/video compression, image processing and 3D graphics.
Thus some additional dedicated resources are needed to provide the required performance boost; such resources are referred to as accelerators and come in various forms. In particular, this work focuses on a co-processor based approach, which aims at maximizing the modularity and the portability by adding such accelerators as additional components to be coupledwith a programmable microprocessor.
Moreover, especially in the embedded systems domain, limited area and energy budgets are key design constraints which make the designof such accelerators even more challenging. Such limitations call for innovative and effective solutions; since the last few years we have been witnessing agrowth in popularity of configurable and reconfigurable architectures, due to the fact that they are flexible and reusable. As a matter of fact an important advantage they provide is that they reuse at run time the same physical portion of a chip for different functionalities (provided that they can be time-multiplexed). This way it is possible to fit additional functionalities in the same chip without increasing the area occupation. Above all, they are flexible in that they are able to meet the fast changing needs of the market.
Depending on the flexibility, we could briefly summarize the spectrum of computer architectures like that: parameterized, configurable, reconfigurable, dynamically/run-time reconfigurable, programmable. The term emph{reconfigurable} has been used quite loosely, sometimes identifying concepts and architectures differing significantly from each other. One possible interpretation indicates machines which can be reconfigured only statically (and not at run time). Those machines are not real reconfigurable architectures, but rather configurable or customizable architectures which can be changed at design time only. Reconfigurable machines, instead, are characterized by dedicated memory cells used to store special bits used possibly for the run-time or off-line reconfiguration of the architecture (the so called configware). Strictly speaking, the configware of reconfigurable machines is changed (possibly) off-line in the field, while in emph{dynamically reconfigurable architectures} the configware can be changed at run-time, selecting dynamically which parts of the circuits must be used by the current calculations and which ones are unused, how the hardware resources are connected to each other, and so on.
Reconfigurable hardware can be further split in coarse-grain and fine-grain, depending typically on the bit-width of their datapath and on the complexity of the elementary resources which can be selected and interconnected in order to implement a physical circuit to implement a given logical function. The simpler the building logic blocks, the finer the grain. A typical example of fine-grain reconfigurable machines is given by the FPGAs (Field Programmable Gate Arrays): their elementary building blocks are usually relatively simple and dependent on the manufacturer and on the model. Typical elementary blocks that can be found inside FPGA devices are small dedicated memories used as lookup tables (LUTs) in order to implement simple logical functions (usually the locations inside those memories are just a few bits wide), small multiplexers, and carry-chain devices. In FPGAs which are oriented to high-end performance there are also dedicated components like entire multipliers and adders (the so called Digital Signal Processing (DSP) blocks). Such components represent an attempt to bridge the gap between coarse-grain and fine-grain reconfigurable machines (hybrid granularity). Unlike fine-grain reconfigurable architectures, there is no typical example of coarse-grain machines. In this thesis the author tries to give a comprehensive report of such architectures, highlighting the ones which turned out to be particularly meaningful and successful.
Fine-grain architectures are usually featured by the fact that they are more efficient in the usage of the available logic resources when implementing a given logic function. Coarse-grain architectures, on the other hand, show several advantages like energy-efficiency and ease of programming, meeting the data abstraction present in high-level programming languages. Another key point which distinguishes them from fine-grain machines is that due to the presence of large and complicated logic blocks inside their architecture they are especially suitable for a very efficient implementation of number-crunching applications and computationally heavy tasks. Such applications would require a significant overhead in terms of interconnection usage and latency when mapped on fine-grain machines, which are typically more suited for other application domains.
The research presented in this thesis consists mainly of the design space exploration of two coprocessors which can be used as accelerators for RISC microprocessors.
The first accelerator introduced is a floating-point unit (FPU), while the second one is a coarse-grain reconfigurable machine. Their design space was thoroughly explored using a parametric, synthesizable VHDL (Very High Speed Integrated Circuits Hardware Description Language) model, which was implemented on ASIC (Application Specific Integrated Circuit) standard-cell technologies and on FPGA devices. The implementation on FPGA was useful at first as a fast prototyping platform, but secondly it became also a platform for running real-world applications like Mp3 and H.264 decoders.
When this work started the main specification was to create an open source VHDL description. This lead to specific choices like adopting the IEEE-754 (Institute of Electrical and Electronics Engineers) standard for floating-point arithmetic, which is nowadays widely accepted. For the same reason the code is as much plain and straightforward as possible. A thorough exploration of the possible architectural choices was made, eased also by the parametric nature of the VHDL code created. A series of different implementations lead to a comprehensive description of the trade-off between area and performance, especially related to architectural variations. A tool based on a Graphical User Interface (GUI) for the simulation and debugging of the execution within the FPU was also developed.
The second part of this work describes the architecture of a coarse-grain reconfigurable machine named Butter. This machine was initially meant to be an accelerator to enable running multimedia applications, audio/video processing on a digital system based on a RISC processor. Next, the range was broadened to other application domains like image processing, Global Positioning System (GPS) signal acquisition and tracking, and 3D graphics, enabled by the introduction of special architectural features like support to subword and floating-point operations, which represent an absolute novelty in the field of coarse-grain reconfigurable machines. A GUI-based tool was developed for the automatic generation of the configware used to configure Butter. An entire System on Chip (SoC) featuring a 32-bit RISC core, Milk FPU and Butter takes 65173 Advanced Look-Up Tables (ALUTs) on a Stratix II EP2S180 FPGA device, and runs at 34 MegaHertz (MHz). Using Butter and Milk, some algorithms can achieve a speed-up (compared to their corresponding software implementation) from one up to two orders of magnitude.
Thus some additional dedicated resources are needed to provide the required performance boost; such resources are referred to as accelerators and come in various forms. In particular, this work focuses on a co-processor based approach, which aims at maximizing the modularity and the portability by adding such accelerators as additional components to be coupledwith a programmable microprocessor.
Moreover, especially in the embedded systems domain, limited area and energy budgets are key design constraints which make the designof such accelerators even more challenging. Such limitations call for innovative and effective solutions; since the last few years we have been witnessing agrowth in popularity of configurable and reconfigurable architectures, due to the fact that they are flexible and reusable. As a matter of fact an important advantage they provide is that they reuse at run time the same physical portion of a chip for different functionalities (provided that they can be time-multiplexed). This way it is possible to fit additional functionalities in the same chip without increasing the area occupation. Above all, they are flexible in that they are able to meet the fast changing needs of the market.
Depending on the flexibility, we could briefly summarize the spectrum of computer architectures like that: parameterized, configurable, reconfigurable, dynamically/run-time reconfigurable, programmable. The term emph{reconfigurable} has been used quite loosely, sometimes identifying concepts and architectures differing significantly from each other. One possible interpretation indicates machines which can be reconfigured only statically (and not at run time). Those machines are not real reconfigurable architectures, but rather configurable or customizable architectures which can be changed at design time only. Reconfigurable machines, instead, are characterized by dedicated memory cells used to store special bits used possibly for the run-time or off-line reconfiguration of the architecture (the so called configware). Strictly speaking, the configware of reconfigurable machines is changed (possibly) off-line in the field, while in emph{dynamically reconfigurable architectures} the configware can be changed at run-time, selecting dynamically which parts of the circuits must be used by the current calculations and which ones are unused, how the hardware resources are connected to each other, and so on.
Reconfigurable hardware can be further split in coarse-grain and fine-grain, depending typically on the bit-width of their datapath and on the complexity of the elementary resources which can be selected and interconnected in order to implement a physical circuit to implement a given logical function. The simpler the building logic blocks, the finer the grain. A typical example of fine-grain reconfigurable machines is given by the FPGAs (Field Programmable Gate Arrays): their elementary building blocks are usually relatively simple and dependent on the manufacturer and on the model. Typical elementary blocks that can be found inside FPGA devices are small dedicated memories used as lookup tables (LUTs) in order to implement simple logical functions (usually the locations inside those memories are just a few bits wide), small multiplexers, and carry-chain devices. In FPGAs which are oriented to high-end performance there are also dedicated components like entire multipliers and adders (the so called Digital Signal Processing (DSP) blocks). Such components represent an attempt to bridge the gap between coarse-grain and fine-grain reconfigurable machines (hybrid granularity). Unlike fine-grain reconfigurable architectures, there is no typical example of coarse-grain machines. In this thesis the author tries to give a comprehensive report of such architectures, highlighting the ones which turned out to be particularly meaningful and successful.
Fine-grain architectures are usually featured by the fact that they are more efficient in the usage of the available logic resources when implementing a given logic function. Coarse-grain architectures, on the other hand, show several advantages like energy-efficiency and ease of programming, meeting the data abstraction present in high-level programming languages. Another key point which distinguishes them from fine-grain machines is that due to the presence of large and complicated logic blocks inside their architecture they are especially suitable for a very efficient implementation of number-crunching applications and computationally heavy tasks. Such applications would require a significant overhead in terms of interconnection usage and latency when mapped on fine-grain machines, which are typically more suited for other application domains.
The research presented in this thesis consists mainly of the design space exploration of two coprocessors which can be used as accelerators for RISC microprocessors.
The first accelerator introduced is a floating-point unit (FPU), while the second one is a coarse-grain reconfigurable machine. Their design space was thoroughly explored using a parametric, synthesizable VHDL (Very High Speed Integrated Circuits Hardware Description Language) model, which was implemented on ASIC (Application Specific Integrated Circuit) standard-cell technologies and on FPGA devices. The implementation on FPGA was useful at first as a fast prototyping platform, but secondly it became also a platform for running real-world applications like Mp3 and H.264 decoders.
When this work started the main specification was to create an open source VHDL description. This lead to specific choices like adopting the IEEE-754 (Institute of Electrical and Electronics Engineers) standard for floating-point arithmetic, which is nowadays widely accepted. For the same reason the code is as much plain and straightforward as possible. A thorough exploration of the possible architectural choices was made, eased also by the parametric nature of the VHDL code created. A series of different implementations lead to a comprehensive description of the trade-off between area and performance, especially related to architectural variations. A tool based on a Graphical User Interface (GUI) for the simulation and debugging of the execution within the FPU was also developed.
The second part of this work describes the architecture of a coarse-grain reconfigurable machine named Butter. This machine was initially meant to be an accelerator to enable running multimedia applications, audio/video processing on a digital system based on a RISC processor. Next, the range was broadened to other application domains like image processing, Global Positioning System (GPS) signal acquisition and tracking, and 3D graphics, enabled by the introduction of special architectural features like support to subword and floating-point operations, which represent an absolute novelty in the field of coarse-grain reconfigurable machines. A GUI-based tool was developed for the automatic generation of the configware used to configure Butter. An entire System on Chip (SoC) featuring a 32-bit RISC core, Milk FPU and Butter takes 65173 Advanced Look-Up Tables (ALUTs) on a Stratix II EP2S180 FPGA device, and runs at 34 MegaHertz (MHz). Using Butter and Milk, some algorithms can achieve a speed-up (compared to their corresponding software implementation) from one up to two orders of magnitude.
Kokoelmat
- Väitöskirjat [4864]