Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review HANS JAKOB DAMSGAARD, ALEKSANDR OMETOV, and JARI NURMI, Tampere University, Finland With the increasing popularity of the Internet of Things and massive Machine Type Communication technologies, the number of connected devices is rising. However, while enabling valuable efects to our lives, bandwidth and latency constraints challenge Cloud processing of their associated data amounts. A promising solution to these challenges is the combination of Edge and approximate computing techniques that allows for data processing nearer to the user. This paper aims to survey the potential beneits of these paradigms’ intersection. We provide a state-of-the-art review of circuit-level and architecture-level hardware techniques and popular applications. We also outline essential future research directions. CCS Concepts: · General and reference→ Surveys and overviews; · Computing methodologies; · Computer systems organization→ Reconigurable computing; Architectures; Distributed architectures; · Hardware; Additional Key Words and Phrases: approximate computing, edge computing 1 INTRODUCTION Historically, the invention of the smartphone has enabled near-immediate connection of people across the globe and vastly increased the amount of data being produced [13, 26]. The growing interest in the Internet of Things (IoT) and massive Machine Type Communication (mMTC) technologies indicates that this trend is not slowing down. It is key to note that aggregating these data implies considerable computing demands that have not been met outside the Cloud, until recently when rapid developments in computing hardware have distributed high-performance devices at the Edge of the Internet [197]. This distribution enables new types of data processing away from the Cloud, nearer to the users. The fact that more computations imply greater energy consumption is troublesome in small battery-driven devices such as smartphones and wearables [121]. Many applications in this domain also fail to optimally exploit their error resilience. Instead, they utilize high-precision computations that cause additional energy consumption. Here is an evident improvement opportunity, as users expect low-latency operation and good enough quality content in a trade-of for long enough battery life [129]. There is, thus, an important balance of time, quality, and energy to be identiied. Computing, as we know it, is based on previous research eforts that have focused on the time-energy trade- of and led to advances in high-capacity battery technologies, eicient (application-speciic) processor design, algorithms, and low-overhead communication. Time has, nevertheless, shown that these techniques alone do not suice. Moreover, we are reaching the end of helpful historic trends like Moore’s law and Dennard scaling [46]. Thus, radical changes to the computing paradigm are needed to address future increases in the number of The authors gratefully acknowledge funding from European Union’s Horizon 2020 Research and Innovation programme under the Marie Skłodowska Curie grant agreement No. 956090 (APROPOS: Approximate Computing for Power and Energy Optimisation, http://www.apropos- itn.eu/.) Authors’ address: Hans Jakob Damsgaard, hans.damsgaard@tuni.i; Aleksandr Ometov; Jari Nurmi, Tampere University, Electrical Engineering Unit, Korkeakoulunkatu 1, Tampere, Finland, 33720. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2022 Copyright held by the owner/author(s). 0360-0300/2022/11-ART https://doi.org/10.1145/3572772 ACM Comput. Surv. 2 • Damsgaard et al. connected devices and computing demands [169]. We expect two techniques to be crucial in this transition: oloading and Approximate Computing (AxC). While the former of these, oloading, is well-known from Cloud computing in which computations are oloaded from constrained, often battery-driven devices to powerful, always-connected servers; the latter, AxC, in which computations are performed approximately to save power, energy, or latency, is yet to be fully explored. Combining the two is expected to enable the time-quality-energy trade-of mentioned above. Let us irst understand these techniques in more detail. The oloading technique was historically motivated by the low performance but good connectivity of battery- driven devices. These features efectively forced these devices to focus their energy consumption on capturing sensor data. Thus, rather than processing data locally, the devices would oload these tasks to another more powerful computer [8]. Today, oloading is essential in three computing paradigms: Cloud, Fog, and Edge computing, which difer in where data is stored and processed. Cloud computing manages all data in large, centralized data centers. Fog computing is a term coined by CISCO describing an intermediate step between Cloud and local computing utilizing so-called Fog nodes [22]. Finally, in Edge computing, data is processed on the nearest Edge node [155]. The Fog and Edge computing terms are often used synonymously, and we will consider them as one. We provide an overview of the computing paradigms and their relations in Fig. 1. It displays examples of devices and their current and projected connections, as well as the compute power and expected approximation beneits in each paradigm. We motivate this expectation by the fact that as more data is eiciently processed (approximately) near the End or Edge layers, less data will incur communication overheads arising from routing through many network layers, reducing overall system energy consumption [51, 134]. The Cloud can be considered as a near-ininite source of computational power relative to data-generating end devices, but communicating with it comes at a high cost of latency and power consumption. The longer networking distances also imply that more links are shared with other devices causing network backbone contention. Hence, this paradigm does not scale with the increasing number of connected devices and data amounts [22]. Yet, as oloading remains beneicial for nearly all compute-heavy tasks [6, 51], research has recently focused on the Edge computing paradigm as a potential solution. While there exist many deinitions of which devices constitute the Edge, the common factor is that they possess better computational capabilities than end devices such as wearables, sensor nodes, etc. As such, smartphones, laptops, desktop PCs, and distributed servers (e.g., at base stations) can be considered Edge devices [18], as shown in Fig. 1. The transition to Edge computing and its efects on moving computations and content closer to end users is interesting for multiple reasons: Firstly, Fig. 1. Overview of the diferent computing domains and some of their characteristics. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 3 the data-generating devices retain the beneits of oloading, including shorter-range communication. Secondly, communication passes fewer links as less traic will reach the Cloud, reducing network backbone contention. Finally, data security becomes more easily manageable as data resides in fewer shared links and devices [155]. In parallel to this transition, applications have become more user-centric and solve more complex, less well- deined problems. They challenge the historical notion of application output correctness by instead producing acceptable outputs. Examples of such are multimedia applications whose outputs are consumed by humans with inherently low sensitivity to noise [44], and Recognition, Mining, and Synthesis (RMS) applications that aggregate large amounts of data and produce some useful output [33]. Both are characterized by being insensitive or resilient to computational errors [169]. This characteristic has motivated and remains at the core of the AxC paradigm, enabling energy or latency savings. In AxC, approximations are applied at diferent levels of a system to achieve reductions in computational complexity, memory demands, or communication bandwidth [169]. Traditional techniques fall into two classes: computing on approximate data and computing on unreliable hardware [14]. Computing on approximate data essentially means applying approximations at the application or architecture level; the textbook example is quantization of loating-point operations to simpler ixed-point operations. Although quantization is a powerful technique, it should be supplemented by orthogonal techniques such as loop perforation, message skipping, neural approximation, etc. to take full advantage of an application’s error resilience [59, 105]. Contrarily, computing on unreliable hardware means introducing errors at the circuit level. This can be done using faulty circuits that would traditionally be discarded due to their inherently non-deterministic behavior but may bed used in approximate systems instead, assuming that the detected faults introduce acceptable errors. Alternatively, designers can manually introduce faulty behavior by over-scaling operating voltage or frequency or by using reduced-voltage Static RAM (SRAM) and reduced refresh-rate Dynamic RAM (DRAM) [10, 105]. Recent research eforts have focused on developing cross-layer techniques to maximize the beneits from approximation while satisfying given quality constraints [120]. Granted that this is still a relatively young research ield, existing results are promising [145, 171]. Unfortunately, despite their common goals of achieving energy savings and latency reductions, research into oloading and AxC have mostly been separate ields, underlined by extensive works on either topic but a lack of works on their intersection. The techniques are orthogonal and expected to complement one another well. Speciically, their combination allows for the execution of applications with varying accuracy at diferent levels of the network. The potential beneits of this include improved computing eiciency and infrastructure lexibility. In this paper, we focus on the combination of oloading and AxC and provide a systematic literature review on circuit-level and architecture-level approximation techniques and applications of them to areas relevant to Edge computing. We note that several surveys on AxC already exist, but none have a particular focus on Edge computing. The interested reader can refer to [10, 32, 33, 105, 169, 187] for introductions to the topic. Our main goals and contributions are: G1 to give an overview of recent works that implement and evaluate circuit-level and architecture-level AxC techniques, G2 to survey works that utilize these techniques in Edge-related applications, and G3 to identify open challenges and research directions in the intersection of the AxC and Edge computing paradigms. The remainder of the paper is structured as follows (also shown in Fig. 2). The next three sections present the reviewed works according to our classiication, detailed in Appendix A. First, Sec. 2 covers works on fundamental AxC techniques; second, Sec. 3 summarizes works on AxC-enabled hardware architectures; and third, Sec. 4 presents applications which can beneit from oloading and AxC. In Sec. 5, we contribute by highlighting ACM Comput. Surv. 4 • Damsgaard et al. challenges and future research directions that have become apparent through the review. Sec. 6 concludes the review. Appendix B presents tabulated, short summaries of all works from Secs. 2, 3, and 4. Fig. 2. Overviews of the topics covered in this paper, the number of papers in each category (parenthesized), and the structure of the remaining sections. 2 FUNDAMENTAL APPROXIMATE COMPUTING TECHNIQUES This section reviews fundamental AxC techniques. We irst consider circuit-level techniques before moving to designs of arithmetic circuits, stochastic computing, function approximation, and other general techniques, as shown in Fig. 3. The techniques presented here serve as a basis for the architectures and application-speciic designs reviewed in the following sections. We provide an overview of commonly used benchmark applications in Tab. 2 and short summaries of all works in the appendix, see Tab. 6. Note that we do not report any numerical results in text due to the di culty of comparing diferent evaluation strategies, yet we display select metrics in plots. Regardless, we attempt to provide a qualitative comparison based on the numerical data found in the literature. Fig. 3. Coarse classification of publications on fundamental AxC techniques. 2.1 Circuit-level Techniques Traditional digital hardware design focuses on circuits meant to operate in an error-free manner, which is ensured by following process-speciic design parameters determining, among others, operating voltage and frequency ranges and their respective safety ranges. However, the error-free operating points are not necessarily the most energy eicient. Recall that power consumption in digital circuits consists of two parts: static power consumption �������� ∝ �0��� from leakage current �0 through open transistors and dynamic power consumption ���� = � ��� 2 �� from charging and discharging of a circuit’s intrinsic capacitances [10]. For most designs operating at nominal conditions, dynamic power consumption is dominant [43, 53], and, thus, the greatest gains ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 5 can be achieved by reducing operating voltage ��� or frequency � , the switched capacitance � , or the circuit activity � . Deviating from these conditions in a controlled way enables operating with a low error rate, but a high impact on power consumption; for example, reducing voltage leads to linear and quadratic reductions in static and dynamic power consumption. We review six works on this topic. 2.1.1 Voltage over-scaling. Two works consider so-called Voltage Over-Scaling (VOS), a technique in which the operating voltage is reduced below its safety margin. Moons and Verhelst [109] propose combining truncation with non-destructive voltage scaling. By truncating operands to a circuit, only parts of its critical paths will be in use, efectively decreasing its latency, an efect lowering the operating voltage can counter. However, the technique does not trivially extend to pipelined designs as their critical paths are not necessarily shortened by truncation. The authors provide two solutions: implementing additional bypassable pipeline registers or multiplexing diferent circuit paths to the existing pipeline registers. Either solution introduces an overhead proportional to the number of voltage-accuracy conigurations wanted. Evaluated on a Discrete Cosine Transform (DCT) algorithm for Joint Photographic Experts Group (JPEG) compression, the technique can halve energy consumption with negligible impact on image quality. Tziantzioulis et al. [166] focus on approximate voltage over-scaled Function Units (FUs) in traditional processor architectures. They describe how FUs are often under-utilized and propose to exploit this feature to let operations issued to the approximate units execute beyond their nominal delay, collecting their results only once equivalent operations are issued to them. This way, the results have more time to converge towards their exact values, reducing experienced error magnitudes in low-activity code regions. They control the approximation with a custom instruction and implement the hardware necessary for lazy write-back and forwarding. The authors demonstrate reduced error rates in three multimedia applications. A related work by Tagliavini et al. [163] proposes a cache architecture implemented partly in SRAM and partly in Standard Cell Memories (SCMs) designed to operate at Near-Threshold Voltage (NTV) ś an extreme case of VOS. Operating circuits at NTV reduces both dynamic and static power consumption but makes them more vulnerable to erroneous efects from process and temperature variations [10]. The authors illustrate this with their proposed design, exploiting that while SCMs are more costly in terms of area, they are more reliable at lower voltages than SRAM, enabling compiler-guided placement of MSBs in reliable SCM and LSBs in unreliable SRAM. They demonstrate reduced energy consumption in the cache and the System on Chip (SoC) that comprises it across several applications. 2.1.2 Approximatememories. Another twoworks also explore the use of unreliablememories. Shoushtari et al. [156] describe that typical SRAMs utilize costly redundancy techniques and are kept at high voltage to guarantee error-free data retention, leading to over-design and high static power consumption. Eliminating some of the redundancy and selectively lowering operating voltage can lead to reduced power consumption at the cost of some randomly induced errors. They illustrate their proposal with a cache architecture that implements some unreliable ways, a programmer-controlled criticality-aware replacement policy, and hardware to disable ways that introduce too many errors. Evaluated on two media applications, they show greatly reduced leakage energy at little quality and throughput degradation. Orthogonally, Ganapathy et al. [53] consider embedded DRAMs that are gaining popularity as a higher- density alternative to SRAMs, but whose predictability and data retention time scale poorly with smaller process technologies. To counter these efects, the DRAMs require frequent, costly refresh operations to function reliably. The authors propose relaxing refresh requirements while accepting a low probability of errors, reducing dynamic power consumption and improving availability (as the memory is less busy refreshing itself). Using this technique, they show how diferent error probabilities afect the output quality of two Machine Learning (ML) applications. ACM Comput. Surv. 6 • Damsgaard et al. 2.1.3 Alternative logic. Lastly, Yang et al. [192] propose implementing approximate designs in adiabatic logic ś a clock-powered alternative to Complementary Metal-Oxide Semiconductor (CMOS) that lengthens signal transitions to reduce dynamic power consumption. The authors present two approximate adders and evaluate them at diferent operating frequencies, reporting an order of magnitude lower power consumption compared to CMOS equivalents. 2.2 Arithmetic Having reviewed circuit-level techniques, it is relevant also to get an overview of AxC applied at a higher level of abstraction. As such, we review and classify 27 works on approximate arithmetic into three main groups based on the operations they focus on. Whenever relevant, we provide pointers to the aforementioned circuit-level techniques. 2.2.1 Adders. We irst review nine works that present approximate adders, distinguishing designs based solely on inexact full adders from segmented, partially-exact designs with or without error compensation. The works presenting inexact full adder designs all focus on Ripple-Carry Adders (RCAs) that are characterized by being slow but area-eicient, both traits being results of their single-bit carry propagation [7]. Recall that a full adder takes three inputs: two operand bits� and � and a carry-in-bit��� , and calculates a sum-bit � = �⊕� ⊕��� and a carry-out-bit ���� = � · � + � · ��� + � · ��� . Approximate full adders can, thus, be classiied by their outputs being A1) exact sum � and inexact carry �ˆ��� , A2) inexact sum �ˆ and exact carry ���� , or A3) inexact sum �ˆ and inexact carry �ˆ��� . Yang et al. [194] irst propose three diferent full adder designs: two of type A2 and one of type A3, all implemented with pass transistor logic ś another alternative to CMOS. Later [193], they propose another two designs of type A3 implemented with transmission gates. Allen et al. [7] present four designs: two of type A1 and two of type A2. Dutt et al. [42] present four designs of type A1, two of which coincide with [7]. Lastly, in preliminary work to their more recent one [192], Yang and Thapliyal [191] present two designs, one A1 and one A2 implemented in adiabatic logic. Both [7] and [42] motivate their design decisions by combining knowledge about logic gate design parameters and signal probabilities derived from extensive analyses. The ive works evaluate their designs diferently, but all achieve improvements in one or more key metrics: power [7, 42, 191, 193, 194], area [42, 194], or error [42, 193]. Unfortunately, the results do not clarify whether approximating the the sum-bit � or the carry-out-bit ���� is most beneicial. Focusing on segmented adder architectures, we ind that works fall into two categories depending on the type of adder they implement. Dutt et al. [42] retain their focus on RCAs, proposing to split the adder into two parts, approximate the least-signiicant part, but recover quality using error compensation logic. Dalloo [35] proposes a feedback-based error-correction scheme with the same idea in mind. Kim et al. [83] propose a carry-lookahead adder whose intermediate carries are approximated by considering only some less signiicant bits. Yang et al. [189] also present a carry-lookahead adder that supports selective masking of certain carries. All four works show improved error metrics, but often at the cost of increased area and delay [42, 83, 189]. Nomani et al. [116] propose several approximate adder designs targeting Field-Programmable Gate Array (FPGA) implementation. All designs map several bits of a sum to a pair of 6-input LUTs, common to most modern FPGAs, thereby removing nearly all carry propagation and achieving single-Lookup Table (LUT) logic depth. Like [83], they recover some accuracy by overlapping adders and show how this has minimal impact on the adder’s power consumption, area, and delay. 2.2.2 Multipliers and multiply-accumulators. The second-most popular arithmetic unit to approximate is the multiplier or its combination with an adder, i.e., the Multiply-Accumulate (MAC) unit. We identify ten works that focusing on these with proposals ranging from an inexact compressor to reduced-complexity partial product encodings. Another three works consider larger multiplication-related circuits. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 7 Like adders, multipliers can be implemented in manyways to achieve speciic area and delay characteristics. The simplest designs are sequential: they generate partial products one by one and accumulate them over some cycles. Others generate partial sub-products and add them together in a tree-like architecture using compressors [106, 144]. More advanced designs generate all partial products at once and accumulate (compress) them concurrently, typically employing diferent encodings to reduce the number of partial products needed. The latter category is prominent in current research, as we will see. Zanandrea et al. [199] explore both accurate and approximate versions of these designs, showing that array-based multipliers achieve the lowest power consumption, tree-based multipliers have the lowest delay, and approximate Booth-encoded array-based multipliers produce the smallest errors. Following this classiication, Mannepalli et al. [101] propose utilizing approximate RCAs for partial product accumulation in a sequential multiplier. They report reduced area, delay, and power, with no noticeable impact on output quality in an edge detection application. Focusing on array-based multipliers, Moaiyeri et al. [106] and Salmanpour et al. [144] propose each an approximate compressor for partial product compression, one being simpliied to only use majority gates and the other being approximated without full adders. Orthogonally, Baba et al. [17] present a compression scheme implemented with incomplete adders whose carries are not propagated within the partial product array. Instead, they are extracted to form an error recovery vector that is later added to the reduced partial product to form the inal result using a RCA with maskable carries. Piri et al. [132] propose input-aware approximation of a multiplier, truncating its partial product array to produce satisfactory errors for a particular input distribution. The three works have somewhat diferent objectives, but all report improvements in power, area, and error. Another four works consider Booth-encoded array-based multipliers. Shao and Li [154] describe a ixed-width multiplier with exact partial product generation but truncated reduction. Rather than simply discarding the truncated bits, they design a low-overhead error compensation unit that approximates the sum of the bits and generates a signature from certain partial product bits to identify how the error should be compensated. Along the same lines, He et al. [62] propose a probability analysis-based error compensation scheme for approximate partial product compression. Instead of a signature-based solution, they add particular, truncated partial product bits to the LSBs of the accurate partial product array. Both designs achieve reduced area and power at minor errors. While the above two works use exact Booth encoding, it is also possible to approximate the partial product generation. Liu et al. [97] explore this and present two approximate radix-4 encodings that they apply column- wise. Zhu et al. [204] propose a radix-256 encoding that avoids the generation of so-called hard multiples (e.g., ±3� requiring a carry-propagate adder) and apply it to the least-signiicant partial product rows. Approximating only the Booth encoders does not give rise to signiicant area reductions, but using a higher radix encoding does, as it decreases the number of partial products. As a result, [204] report reduced area, delay, and power compared to both exact designs and that of [154]. The approximate multipliers described above may also be applied in larger arithmetic circuits. Speciically, Imani et al. [64] propose an approximate loating-point multiplier that approximates the product of the mantissas by their sum, selectively re-executing the multiplication exactly if the MSBs of the sum show certain patterns. Osorio and Rodríguez [122] present an approximate Single Instruction Multiple Data (SIMD)-capable multiplier whose partial product array is truncated in a way that ensures reasonable operation in vectorized conigurations. Like [62, 154], the multiplier uses exact partial products, but certain bits are implemented with maskable carries so as to enable SIMD operation. Lastly, Xiao et al. [183] implement a MAC unit with partial product compression and product reduction based on approximate compressors. Like [132], they apply approximations matching expected input patterns. All three works report improved energy eiciency with little impact on output quality in diferent applications. ACM Comput. Surv. 8 • Damsgaard et al. 2.2.3 Others. A handful of works consider other arithmetic operators. Firstly, in addition to their segmented adder, Kim et al. [83] also propose an approximate comparator utilizing their scheme for reduced carry propagation. Compared to its exact counterparts, it achieves a very low error rate but vast improvements in delay and energy consumption. Secondly, two works propose diferent approximate squarers, i.e., symmetric multipliers whose operands are identical. Shao and Li [154] re-use their array-based design technique and propose a design with signature-based error compensation. More recently, Reddy et al. [136] propose three approximate squarers based on approximate Booth encoding. Their irst design only applies this, the second combines the encoding with approximate adders, and the third adds an error recovery module to the second design. While Shao and Li [154] report a reduced mean error at the same hardware costs as comparable designs, Reddy et al. [136] achieve reductions in both area, delay, and error. Thirdly, another two works present approximate dividers. Both propose several inexact subtractors (very similar to full adders) and utilize them in array-based dividers. Chen et al. [29] irst present three diferent subtractors based on pass-transistor logic. Afterward, they compare the efects of diferent schemes for replacing exact subtractor cells with their approximate ones, reporting that triangle replacement, i.e., column-wise substitution of subtractor cells in the LSBs, leads to the best error characteristics. Jha and Mekie [66] compare CMOS-based subtractors with ones based on pass-transistor logic and conclude that the former performs the best, leading them to propose four CMOS-based subtractors. Both works show reductions in power and energy compared to an accurate divider. Lyu et al. [99] describe how traditional implementations of �th roots based on the Coordination Rotation Digital Computer (CORDIC) algorithm are costly in terms of delay and area. As a solution, they propose using a piece-wise linear approximation combined with segmented and quantized, tabulated approximations of comprised sub- functions. This proposal enables implementation in a simple, pipelined architecture, which achieves similar error magnitudes as state-of-the-art CORDIC-based architectures, but at a much lower area and power consumption. Finally, most reviewed works fail to account for aggregate errors arising from concatenated approximate arithmetic circuits with the same error polarity. Mazahir et al. [103] propose the concept of self-compensation ś canceling out approximation errors by evenly mixing circuits with opposite error polarities ś to solve this. They demonstrate their proposal on circuits comprising several adders and multipliers, showing decreased errors at negligible overhead. 2.3 Stochastic Computing In some instances, resource constraints might prohibit implementing large arithmetic units operating on binary numbers like the ones reviewed above. Bit-serial processing is an alternative design strategy for such cases that drastically reduces area at the expense of longer execution time. Stochastic computing combines bit-serial architectures with AxC by operating on pseudo-randomly generated, inite-length bitstreams, reducing additions and multiplications to simple OR and AND operations. We ind six works presenting stochastic computing techniques. Firstly, Seva et al. [153] and Pamidimukkala et al. [127] propose further approximating stochastic computing architectures by truncating operands used for bitstream generation, reducing the size of the (often costly) Linear- Feedback Shift Registers typically used for this. In their irst work, they statically truncate operands, keeping only some MSBs, while in their second work, they dynamically select bits following the most signiicant asserted bit. Both techniques achieve good results when evaluated on an edge detection algorithm. Secondly, two works describe how stochastic computing does not allow for simple implementations of sub- tractors and dividers. As such, Bharathi et al. [20] propose a subtractor unit for so-called scaled population arithmetic ś and approach in which the bitstreams are paired with an exponent to allow for greater numerical ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 9 range. Their implementation uses only well-established negation and addition operations. Yu et al. [198] describe a counting-based divider that combines deterministic bitstream generation and early termination. These features enable it to signiicantly outperform similar designs in a contrast stretch application. Both works report reductions in area, delay, power, and error. Lastly, two works combine stochastic computing with traditional binary designs. Speciically, Joe and Kim [70] propose implementing approximate adders split in two, performing exact binary addition on the upper part and inexact stochastic addition on the lower part. The resulting design achieves much smaller errors than a lower-part OR-based approximate adder. Faraji et al. [48] apply a similar technique to constant-coeicient multipliers, utilizing a unary thermometer encoding for the lower part and a multiplexer for selecting an appropriate bias for the upper part before performing traditional binary addition. The thermometer encoding enables approximate multiplication using only wires. The authors demonstrate reduced area for most of their benchmarked conigurations. 2.4 Other Techniques Another eight works present fundamental techniques which do not fall under the above three categories. Re- gardless, we ind they can be grouped into two classes based on their underlying idea: memoization or function approximation. 2.4.1 Memoization. Memoization is a technique often used in software to remember prior computations and later reuse their results. The same technique is applicable in hardware and provides opportunities for approximation. We ind four works present related techniques: two are static, the others dynamic. While the static techniques tend to have lower overheads, they lack the adaptability of their dynamic counterparts. Later in Sec. 3.1.2, we will see dynamic memorization applied in General-Purpose Processor (GPP) architectures. We irst explore the static memoization proposals. Parhami [128] describes hardware complexity and di culty of error analysis as the main drawbacks of existing AxC techniques. Motivated by advances in memory density, the author proposes implementing table-based approximate arithmetic circuits, which are both area- and power- eicient and enable easy error analysis; an interesting alternative to [103]. Jordan et al. [71] suggest using K-means to produce clusterings of input-output pairs for approximable code regions at compile-time and selecting the best matching instance using nearest-centroid classiication at run-time. Both works describe how hierarchical tables allow for ine-grained quality control. The latter reports lower hardware overhead yet similar error metrics to a neural approximation approach. Two contrasting proposals focus primarily on dynamic memoization techniques. They both describe how most circuit-level AxC techniques fail to achieve their expected beneits when used in FPGAs, suggesting that memory- based techniques can more eiciently utilize the hardened primitives common to FPGAs. Echavarria et al. [43] explore this idea and report beneits in terms of power and resource utilization. Sinha and Zhang [157] propose a High-Level Synthesis (HLS) post-processing step that generates and inserts static or dynamic memoization wrappers on given logic blocks. The static wrapper is conigured at synthesis time, while the dynamic wrapper stores input-output pairs at regular intervals at run-time. Despite resource overheads, they also report power savings over an exact baseline. 2.4.2 Function approximation. The remaining four works present diferent approaches to implementing complex function approximation. The iterative nature of software-based implementations often slows down applications that rely heavily on such approximations. Hence, hardware acceleration can provide signiicant speedup. The reviewed implementations are rather diverse and, thus, di cult to group. del Campo et al. [38] notice that the accuracy of quantized non-linear functions is di cult to manage yet highly impactful. To resolve this, they propose using statically memoized, quantized Taylor expansions produced by an error analysis low to achieve ACM Comput. Surv. 10 • Damsgaard et al. Table 1. Select characteristics of the reviewed works on fundamental AxC techniques. AE =Autoencoder, SE = Square Error, ED = Error Distance, RED=Relative Error Distance. Prefixes: (M) =Mean, (R) = Root, and (N) =Normalized. Class Ref. Evaluation KPIs Error metrics platform Power Energy Area Delay Rate (M)AE (R/M)SE (M/N)ED (M)RED C ir cu it -l e v e l VOS [109] Sim. (ASIC) x x x [166] Sim. (ASIC) x x [163] ASIC x x x Mems. [156] Sim. (ASIC) x [53] Sim. (ASIC) x Logic [192] Sim. (ASIC) x x x A ri th m e ti c General [103] Sim. (anal.) x x x x Adders [194] Sim. (ASIC) x x x [7] Sim. (anal.) x x x x [193] Sim. (ASIC) x x x x x [83] Sim. (ASIC) x x x x x x x [42] Sim. (ASIC) x x x x x x x [189] Sim. (ASIC) x x x x x x [35] FPGA x x x x x [116] FPGA x x x x x [191] Sim. (ASIC) x x x Mults. [154] Sim. (ASIC) x x x x [97] Sim. (ASIC) x x x x x x [106] Sim. (ASIC) x x x x [64] Sim. (ASIC) x x x x [17] Sim. (ASIC) x x x x x [122] Sim. (ASIC) x x x [62] Sim. (ASIC) x x x x x [183] Sim. (ASIC) x x x x x [101] Sim. (ASIC) x x x x x [199] Sim. (ASIC) x x x [144] Sim. (ASIC) x x x x x x [132] Sim. (ASIC) x x x x x x [204] Sim. (ASIC) x x x x x x x Others [29] Sim. (ASIC) x x x [66] Sim. (ASIC) x x x [136] Sim. (ASIC) x x x x x x [99] Sim. (ASIC) x x x x S to ch a st ic General [153] Unspeciied x [127] Sim. (anal.) x Adders [70] FPGA x x x x [20] Sim. (ASIC) x x x x Mults. [48] FPGA x x x x x Others [198] Sim. (ASIC) x x x x x O th e rs Memo. [157] FPGA x x x x [128] Unspeciied [43] FPGA x [71] Sim. (ASIC) x x x x Func. [38] FPGA x x x x appx. [142] Sim. (anal.) x x [98] Sim. (ASIC) x x x x [72] FPGA x x x x x given error bounds. They demonstrate the eicacy of their proposal by training Neural Networks (NNs) with approximated sigmoid activations. Rust et al. [142] describe a lack of design methods for multi-variable function approximation. Their technique segments a function’s input space, performs linear regression within each segment, quantizes the identiied weights, and implements the result in a binary tree-style architecture without multipliers. The authors report large ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 11 Table 2. Applications used for evaluating at least two reviewed, fundamental AxC techniques. SNR= Signal to Noise Ra- tio, SSIM= Structural Similarity, RED=Relative Error Distance, SE = Square Error. Prefixes: (P) = Peak, (M) =Mean, and (N) =Normalized. Application References Application quality metrics Class Name (P)SNR (M)SSIM (M)RED (N)MSE Class. acc. Media Edge detection [64, 70, 101, 127, 153, 156] x x Image smoothing [43, 103, 156] x x Image sharpening [17, 189, 193] x x Image multiplication [97, 106, 144] x x Background removal [29, 66] x x Change detection [29, 66] x x DCT for JPEG [42, 48, 62, 71, 109, 116, 122, 166, 204] x x ML �-means [64, 71] x �-nearest neighbors [53, 64] x NN training [38, 64, 83] x NN inference [53, 64, 163, 183] x x Others DSP tasks [71, 132, 163] x x area savings for most benchmarked functions compared to similar works. Similarly, Luong et al. [98] combine Lagrange interpolationwith piece-wise linear approximations. Their hardware implementations combine statically memoized weights in LUTs with stochastic computing techniques for addition and multiplication, as seen in Sec. 2.3. As a result, they are comparable to binary implementations in terms of errors while being smaller and consuming less power. Lastly and most recently, Kan et al. [72] propose using bisection NNs ś networks with binary tree-like connec- tions between neurons ś for regression-based function approximation and present a reconigurable hardware accelerator for this. They highlight their proposal’s lexibility and area savings over other designs, albeit at the cost of greater errors. In this section, we have reviewed works on fundamental AxC techniques and highlighted how and with which metrics these techniques are evaluated in Tab. 1. The normalized, reported power and area results are summarized in Fig. 4. The entries are ordered from the oldest to the most recent publication. Most of the techniques reduce overall power consumption, with some showing slight increases over comparable systems. Similarly, most reduce overall area, while [142, 163] trade increases in area for improved performance and energy eiciency. 3 APPROXIMATE COMPUTING-ENABLED HARDWARE ARCHITECTURES Having provided an overview of fundamental AxC techniques, we now review general hardware architectures that implement some of them. We continue to provide a coarse, yet diferent, classiication of works, now into three categories: GPPs, reconigurable architectures, and Network on Chips (NoCs), as shown in Fig. 5. The works presented in this section emphasize hardware design, while in the subsequent section, the emphasis is instead on algorithms or one or more speciic applications. Nonetheless, many techniques frequently re-appear. We provide an overview of commonly used benchmarks in Tab. 4 and short summaries of all works in the appendix, see Tab. 7. As before, we do not report any numerical results in text due to the di culty of comparison. 3.1 Approximate General Purpose Processors Modern GPPs are complex circuits and one of the most commonly used computing platforms that ofer a variety of opportunities for approximation. Therefore, hardware and tool support in GPPs is a requirement for a broad adoption of AxC [82, 170]. The reviewed works apply ive approximations: inexact arithmetic, memoization, ACM Comput. Surv. 12 • Damsgaard et al. 0% 25% 50% 75% 100% 125% N or m al iz ed p ow er N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A Baseline Decrease Increase [1 84 ] [3 3] [6 ] [1 46 ] [1 44 ] [2 5] [1 83 ] [4 9] [1 01 ] [7 6] [1 56 ] [1 47 ] [1 43 ] [1 32 ] [9 0] [3 7] [9 8] [1 79 ] [1 53 ] [5 9] [1 5] [3 1] [1 13 ] [1 19 ] [3 8] [1 18 ] [9 5] [9 1] [6 1] [6 5] [1 07 ] [6 6] [1 27 ] [1 81 ] [6 7] [1 8] [9 ] [5 7] [1 73 ] [9 2] [9 3] [1 89 ] [1 88 ] [1 34 ] [1 82 ] [1 23 ] [1 94 ] [4 4] References 0% 25% 50% 75% 100% 125% N or m al iz ed a re a N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A 287% 171% Fig. 4. Normalized power and area metrics from the works on fundamental AxC techniques in chronological order from let to right. Fig. 5. Coarse classification of publications on AxC-enabled architectures. approximate caches, neural approximation, and unreliable control low. Several works combine these with fundamental AxC techniques. 3.1.1 Arithmetic. The irst approximation opportunity in GPPs is their arithmetic and logic operations. This aspect is explored in two works, both exposing quality control in the Instruction Set Architecture (ISA). Venkatara- mani et al. [170] attribute the limited adoption of AxC in GPPs to the lack of error bounds in typically applied approximations. They present a vector processor comprising a mesh of approximate Processing Elements (PEs) and linear arrays of mixed-precision PEs, all implementing dynamic truncation. Evaluated on a set of compute- intensive applications, they show signiicant energy reductions. Focusing instead on a traditional Reduced Instruction Set Computer-style processor, Ndour et al. [113] explore the energy-error trade-of from using dy- namically truncated integer instructions. Based on results from a test chip, they develop a mathematical model of a core’s energy consumption analogous to Amdahl’s law indicating only small savings can be expected in applications containing little approximable code. 3.1.2 Memoization. Energy consumption in GPPs is often dominated by memory and control instructions rather than arithmetic [82]. Thus, the beneits of approximating only arithmetic instructions are limited, calling for exploring alternatives; one of such being memoization that, in the context of GPPs, exploits approximate similarity between computing instances ś e.g., procedure calls or loop iterations ś to skip executing one or more instructions. In the context of GPPs, memoization means exploiting approximate similarity between computing instances ś e.g., procedure calls or loop iterations ś to skip executing one or more instructions. Chandrasekharan et al. [27] bridge the gap between approximate arithmetic and memoization by targeting costly loating-point instructions. They equip the Floating-Point Unit with dynamic memoization tables and enable/disable approximations using a custom instruction that adjusts the number of LSBs to ignore when performing lookups. With these additions, they report a noticeable speedup. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 13 Approximately reusing results from individual instructions also only leads to limited performance gains. It is generally more beneicial to memoize and skip multi-instruction instances. This is explored in three other works implementing more involved schemes, all requiring ISA support. Firstly, Sato et al. [146] aim to solve two issues: a lack of frameworks for applying AxC to diferent GPP applications and large memory-related overheads of software-based memoization. Based on the design in [164], they propose an auto-memoizing processor that automatically memoizes function calls’ inputs and results with minimal additions to the original architecture’s tables. The approximations are enabled/disabled as in [27]. Secondly, He et al. [61] propose a signiicance-driven scheme consisting of two modules: a regression module of MAC-based PEs that manages linear models for input signiicance and logistic models for conditional branches; and a ternary Content Addressable Memory (CAM)-based cache of previous computation instances. Lookups in the CAM use exact branch vectors and signiicance-weighted, truncated distance measures between computing instances. They manage accuracy by maintaining multiple regression models for diferent input value domains. Lastly, Kim et al. [82] propose a full ISA extension, Value Similarity eXtensions, with corresponding micro-architectural changes for in-order pipelines. The extension allows for skipping entire sequences of instructions, loop perforation, and memoization, enabled by an extended data cache that compares cache line entries and generates similarity bits accordingly, and an instruction skip controller that extends the core’s program counter logic. The skip controller also manages the core’s operand selection logic to enable result reuse. Despite its broader scope, the auto-memoizing processor of [146] only achieves speedup similar to thememoized loating-point instructions of [27]. Both the other designs [61, 82] report many times higher speedup with comparable error bounds, albeit in diferent applications, but as expected since they reuse larger computing instances. 3.1.3 Memory. Three other works explore approximating cores’ accompanying memory hierarchies in diferent ways. Miguel et al. [104] approximate the last-level cache, exploiting approximate similarity between data cache lines to reduce cache size. Their architecture measures similarity using hashes of cache lines stored in the tag array and allows for associating multiple tag entries to a cache block. The degree of approximation is, thus, controlled by the number of bits used in the hashes; the fewer, the more cache lines are reused. These changes make reads, writes, allocations, and replacements more problematic, but the authors present solutions to them all. Nongpoh et al. [117] instead approximate the cache coherence policy in many-core GPPs by relaxing coherence on approximable, shared data identiied by automated sensitivity analysis. With the protocol implemented, a core can avoid reading updated cache lines owned by other cores and invalidating shared cache lines on writes. The proposal is supported in hardware by a custom store instruction, extensions to the GPP’s directory-based coherence scheme, and extra cache(s) for recently invalidated cache lines. Zhang et al. [201] suggest solving degrading restore time challenges in future DRAM by approximation. They propose three precision-aware scheduling techniques that dynamically adjust diferent row segments’ restore times, ensuring that some segments operate reliably while others may give rise to random errors. Their schemes can improve performance by cleverly mapping low-signiicance bits to unreliable segments but require Operating System (OS) support and introduce hardware overheads for each row in the DRAM controller and the memory management unit. Though distinct, the three proposals all trade of other metrics to reduce overall energy consumption: the similarity cache increases execution time [104], which is reduced by the approximated DRAM that in turn introduces randomness [201], and the results of [117] are polluted by unclear implementation details of the extra cache(s). 3.1.4 Neural approximation. Yet another three works consider neural approximation as an alternative to memo- ization, substituting compute-heavy kernels with NNs implemented in small, programmable accelerators [47]. Moreau et al. [110] attribute limited adoption of NN-based approximation to the complexity and tight integration ACM Comput. Surv. 14 • Damsgaard et al. of historic accelerators. Instead, they propose a decoupled, FPGA-based NN accelerator based on systolic arrays for heterogeneous SoCs. The accelerator is memory-mapped, comprises a set of PEs, and implements microcode- based evaluation for low-overhead run-time reconiguration over two Advanced eXtensible Interface-based processor-FPGA interconnects. Grigorian et al. [57] aim to improve the reliability of neural approximations in RMS applications. They propose to ensure output quality at run-time through an iterative low with increasingly complex NNs whose outputs are evaluated by application-speciic Light-Weight Checks (LWCs) (like in [67]). To support their technique, the authors describe a simple accelerator architecture comprising a number of PEs with global coniguration logic. This architecture is designed particularly for NoC-based accelerator-rich processors. Song et al. [161] target maximizing invocations of approximations given some error bound. Rather than using LWCs, the authors use tiny NNs to classify computing instances as approximable and other NNs to approximate them, increasing their technique’s coverage by using multiple of the latter. They present two strategies: one employing cascaded pairs of classiiers and approximators, and one with a multi-class classiier and a corresponding number of approximators. The latter is emphasized for avoiding the multi-classiier overheads of the former and themulti-approximator overheads of [57]. Like the above works, the authors also describe a parallel, PE-based accelerator that supports having both a classiier and several approximators loaded simultaneously. Despite their implementation diversity, all three works use small Multi-Layer Perceptrons (MLPs), all rely on developer annotations, and two implement run-time quality management that re-computes approximations of insuicient quality exactly [57, 161]. They all report signiicant speedups and energy reductions, albeit, when evaluated diferently. 3.1.5 Control flow. Returning to the core, Nongpoh et al. [118] also explore approximating control low, using custom instructions to skip roll-back on a branch and load-value misprediction in speculative pipelines. Like in [117], they use automated sensitivity analysis to identify approximation opportunities: irst, individually approximable data and branches, and then, jointly approximable branches. They provide hardware extensions needed to support the custom instructions, and their results show energy and execution time reductions at acceptable error bounds. 3.2 Approximate Reconfigurable Accelerators We now turn our attention to generic, reconigurable accelerator architectures. We review three works that present Coarse-Grained Reconigurable Array (CGRA)-like designs that balance reconigurability overheads, performance, and energy eiciency in highly parallel workloads. This renders them suitable for computing environments where bit-level reconiguration is mostly unnecessary. The three works have many similarities: they consider the CGRA a processor-controlled accelerator, use a centralized coniguration unit, support exact operation, and include extended memory architectures for eiciency. Their diferences lie primarily in how they integrate AxC. Speciically, Chippa et al. [33] propose a CGRA with run-time quality control, denoted efort scaling, imple- mented using clock gating-based truncation and VOS (refer to Sec. 2.1.1) of its PEs. A global feedback-driven controller manages the efort scaling by estimating the current output quality and tuning approximations towards given error bounds. The authors use a sensitivity analysis low for detecting approximable kernels of programs, as in [32]. The two other works implement PEs whose FUs comprise both exact and approximate adders and multipliers (refer to Secs. 2.2.1 and 2.2.2) and select between these using coniguration bits. They both rely on developer annotation of approximable functions and their error bounds, and both provide tools for searching for optimal approximation conigurations. Nevertheless, the works difer in that Akbari et al. [2] implement PEs with only one approximate adder and multiplier and control the output quality by enabling/disabling subsequent correction ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 15 Table 3. Select characteristics of the reviewed works on AxC-enabled architectures. Note that we specify application run-time. Class Ref. Evaluation KPIs Fundamental techniques platform Power Energy Area Run-time Circuit-level Arithmetic Stochastic Others G P P s Arith. [170] Sim. (ASIC) x x x x [113] ASIC x x Memoiz. [146] Sim. (anal.) x x [61] Sim. (ASIC) x x [27] FPGA x x [82] Sim. (ASIC) x x Memory [104] Sim. (anal.) x x appx. [201] Sim. (anal.) x x x [117] Sim. (anal.) x x x x Neural [110] FPGA x x appx. [57] Sim. (ASIC) x x x [161] Sim. (anal.) x x Control [118] Sim. (anal.) x x C G R A s [33] Unspec. x x x [2] Sim. (ASIC) x x x [41] Sim. (anal.) x x N o C s Wired [111] Sim. (ASIC) x x [139] Sim. (anal.) x x [185] Sim. (anal.) x x [107] Sim. (anal.) x x x x Wireless [15] Sim. (anal.) x x [49] Sim. (anal.) x x units, further speciied in [3]. Dickerson et al. [41] implement several approximate adders and multipliers with diferent error characteristics and select between them at run-time. Both works also implement dynamic quality control circuitry that monitors the execution and selects between predetermined operating modes. All three works report signiicant energy savings and/or power reductions. Notably, the beneits of cross-layer approximations, and VOS in particular, are highlighted in [33]. The works also indicate the necessity of automated tooling for sensitivity analysis. 3.3 Approximate Networks-on-Chips Modern multi-core processors and SoCs often implement high-speed NoCs for core-to-core and core-to-memory communication. Traditionally, these networks have been wired, but increasing core frequencies have made the interconnect a bottleneck, leading to the proposal of using wireless on-chip networks when ultra-low latency operation is needed, e.g., for coherence. Either NoC type ofers approximation opportunities, yet we distinguish between the two here, noting that despite the main focus on wireless NoCs in [15] and [49], all six reviewed works include wired NoCs in their proposals. Moreover, the works all agree on their main motivation: poor scaling, both in terms of wire delay and power consumption relative to logic and packet latency in many-core chips. We irst consider techniques for wired NoCs. 3.3.1 Wired NoCs. Five works present AxC techniques for wired NoCs. Firstly, Reza and Ampadu [139] pro- pose truncating packets containing developer-annotated data before transmission and zero-padding them on arrival. The involved network interfaces employ a per-packet error bound that enables low-complexity resilience evaluation, disregarding accumulated errors. Secondly, Ascia et al. [15] propose to selectively VOS (refer to Sec. 2.1.1) the interconnect links when performing stores and loads of developer-annotated data (listing automatic sensitivity analysis as future work in [16]). Thirdly, Momeni and Shahhoseini [107] extend this concept to 3D ACM Comput. Surv. 16 • Damsgaard et al. Table 4. Applications used for evaluating at least two reviewed, AxC-enabled architectures. Abbreviations as in Tab. 2. Application References Application quality metrics Class Name PSNR SSIM Class. acc. Benchmark AxBENCH [47] [41, 61, 104, 107, 110, 113, 118, 185, 201] x x suite PARSEC [21] [49, 104, 107, 110, 118, 185] x x SPLASH [143, 181] [15, 49, 117] x Media Image processing [2, 27] x Machine �-means [33, 170] x learning SVM inference [33, 170] x NN inference [82, 170] x NoCs. In addition to the links, they also apply VOS to the routers and truncate packets based on current network congestion. Fourthly, orthogonal to [15] and [107], Najai et al. [111] aim to improve the reliability of wired links under VOS while avoiding the overheads of Forward Error Correction (FEC) and Automatic Repeat Request (ARQ). They propose two integer value coding schemes that swap and invert wires and eliminate undesirable bit patterns to reduce crosstalk. Doing so makes neighboring wires less likely to stabilize at wrong logic values. Finally, Xiao et al. [185] extend upon [139]. They propose applying dynamic packet dropping and approximation according to oline error and congestion modeling from developer annotations and managing them at run-time using a set of quality controllers. The controllers sample the network congestion and approximation impact of speciic packets to drop or approximate low-impact packets. Dropped packets are estimated as the mean of their predecessor and successor packets, while approximated packets are truncated before transmission and zero-padded on arrival. Four of the ive works report potential for energy and latency savings [15, 107, 139, 185]; the remaining work [111] instead highlighting reduced errors in a set of applications. Neither [139] nor [107] evaluate the errors introduced by their schemes, while [15] demonstrate small application output errors and the thorough error evaluations of [185] show large savings by nearing, but always satisfying, given error bounds. Finally, we note that [139, Fig. 4] is wrong: latency is shown to be consistently lower when using approximation but has a greater mean than the exact system. 3.3.2 Wireless NoCs. Two works focus on wireless NoCs. Ascia et al. [15] propose adapting their scheme for VOS wired links to instead dynamically select between transmission power levels on wireless links (explored in-depth in [16]). Fernando et al. [49] present a more involved technique that extends the architecture of [1], exposing globally-synchronized broadcast memories to the software to enable data sharing via these and their wireless links. Their design measures network contention and uses it to dynamically adapt the transmission protocol to diferent traic patterns and enable/disable approximations on developer-annotated data. They support several approximations, including packet dropping, approximate locks, and skipping negligible updates. The two works report signiicant savings over exact wireless NoCs and approximate wired NoCs, the latter also demonstrates noticeable speedup in several applications [49]. In this section, we have reviewed general AxC-enabled architectures, speciically GPPs, CGRAs, and NoCs. Like in Sec. 2, we have highlighted typically used benchmark applications in Tab. 4 and performance metrics in Tab. 3. Fig. 6 summarizes normalized, reported results in energy and run-time. Essentially all works report improved energy consumption, the exception being [104] that trade of small run-time overheads for reduced energy. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 17 0% 25% 50% 75% 100% 125% N or m al iz ed e ne rg y N /A N /A N /A N /A N /A Baseline Decrease Increase [2 9] [1 60 ] [1 02 ] [5 2] [9 6] [1 36 ] [2 3] [5 6] [1 91 ] [2 ] [1 51 ] [1 03 ] [1 09 ] [1 3] [4 5] [1 05 ] [1 30 ] [1 08 ] [3 6] [7 5] [1 75 ] [9 9] References 0% 25% 50% 75% 100% 125% N or m al iz ed r un -ti m e N /A N /A N /A N /A N /A N /A N /A N /A Fig. 6. Normalized energy and run-time metrics from the works on AxC-enabled architectures, in chronological order. 4 APPLICATIONS After reviewing fundamental AxC techniques and AxC-enabled architectures, we now focus on applications of AxC, classifying works into ive categories: Machine Learning, image processing, video processing, reliability, and other applications (see Fig. 7). When relevant, we reference works from the prior sections. We give short summaries of all works in Tab. 8 and continue to refrain from reporting numerical results in text. Fig. 7. Coarse classification of publications on applications of AxC. 4.1 Machine Learning The reviewed works reveal that AxC is most frequently applied to ML. This is no surprise, as ML applications are known to be error-resilient and compute-heavy and, thus, provide apparent opportunities for optimization [171]. We also notice a trend toward designing architectures and techniques for the low-power acceleration of ML suitable for Edge computing. In this section, we divide works into three sub-classes: generic accelerators for NNs, accelerators for other ML models, and application-speciic accelerators. When possible, we relate these to works in Secs. 2 and 3.1.4. 4.1.1 Generic NN architectures. Diferent NN models require speciic architectures to be eiciently accelerated. As such, we distinguish between accelerators for MLPs or Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and Spiking Neural Networks (SNNs), of which the former three are well-known algorithms but the latter less so. As such, we also provide a brief introduction to SNNs before covering related papers. Multi-Layer Perceptrons and Deep Neural Networks. Kim et al. [80] propose approximating the often unnecessar- ily costly MACs used in MLP accelerators. They present a design (like [110, 137]) that combines this with on-chip training, synapse skipping, and dynamic bit-width adaptation of small-gradient synapses. Beyond showing signiicant power and area savings, they describe how training MLPs at lower precision enables inference with more quantized weights. ACM Comput. Surv. 18 • Damsgaard et al. Datalow is a crucial concept for DNN accelerators. It determines in which order memory and arithmetic opera- tions occur, often greatly afecting achievable throughput. Three works explore this in diferent ways, describing PE-based accelerators like in [110, 161]: Tu et al. [165] suggest accelerating individual DNN layers with diferent datalow scheduling, speciically full paralleling, neural extension, or computation extension; Reddy et al. [137] focus on optimizing memory accesses to minimize initialization latency and maximize bandwidth utilization, targeting matrix-matrix operations implemented in FPGAs; and Wu and Miguel [182] propose supporting dif- ferent datalows with extra multiplexers in the PEs. The former two describe in detail how their architectures implement multi-level memory hierarchies, streaming-interfaced external memories, quantized MACs, and global control units. They evaluate their designs with neural approximation workloads and show improved power ei- ciency [137, 165]. The latter only present their quantized PE design and evaluate their architecture for DNN-based function approximation, showing improved energy eiciency at the expense of accuracy [182]. Sarwar et al. [145] explore the beneits of cross-layer AxC techniques. They combine pruning with inexact multipliers and VOS embedded SRAMs, giving rise to drastic energy savings. Venkataramani et al. [171] present an industrial perspective on the topic, outlining techniques explored while developing their accelerator (building upon [170] and akin to [33, 165]) and its corresponding tool low. Their accelerator supports reduced-precision loating-point, integer, and binary operations. Their tool low enables precision scaling and gradient compression during training. Convolutional Neural Networks. Another ive works focus on CNNs. Castro-Godínez et al. [24] consider the im2col function used to convert convolutions into matrix multiplications. They propose using approximate adders for re-ordering the matrix entries and present an architecture for the operation. Klemmer et al. [84] propose an approximate dot product encoding for binary input operands. They split dot products into a crossbar-style encoding of positive and negative accumulation parts, whose connections are determined by network weights. Multiplying inputs by the two binary-entry accumulators and summing the products gives the inal result. Like [80, 145, 171], they apply re-training to minimize accuracy degradation. Both report signiicant speedup at negligible accuracy degradation. Yang et al. [190] propose dynamically masking some input pixel LSBs to reduce switching activity, passing in only segments whose MSB is the most signiicant asserted bit when enabled. This allows for maintaining accuracy while saving some energy. Lastly, Wang et al. [178] and Gong et al. [56] present architectures for high-throughput CNN acceleration: the former being a SIMD-style architecture with support for sub-word operations, and the latter implementing PEs with approximate adders and LUT-based multipliers. In addition to their architecture, the former present an oline tool low for adapting networks to quality constraints by exploring ilter resizing, pruning, and precision scaling. The latter proposes early termination (like [67]) in combination with model compression through reduced ilter sizes, weight thresholding, quantization, and optimized convolutions. Both works report improved energy eiciency over other accelerators, albeit without implementation details or when evaluated on a relatively smaller technology node. Spiking Neural Networks. SNNs are widely considered the third generation of NNs, replacing irst-generation threshold-based and second-generation, non-linear, continuous activation-based networks such as the above [100]. They are event-based, self-triggered networks whose (leaky) Integrate-and-Fire (IF) neurons model the integration of incoming, asynchronous spike events over time, producing outgoing spikes once a certain threshold is reached, well illustrated in [141, Fig. 3]. Owing to their event-based nature, SNNs are expected to enable highly energy- eicient implementations, yet the most popular hardware platform for ML ś Graphics Processing Units ś fails to leverage these beneits [141]. Instead, researchers and large corporations have proposed bespoke, integrated hardware implementations, the arguably most famous examples are TrueNorth [4] (a chip of parallel neurosynaptic cores) and Loihi [36] (a chip of conigurable neuromorphic cores). Both are highly parallel and capable of simulating millions of neurons at once. Despite encoding spikes in an address event representation, the spikes’ binary nature ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 19 reduces multiplications to simple additions, enabling the architectures to avoid costly multipliers at the expense of more complex control logic [131]. More recent eforts have focused on emerging technology-based designs, particularly in-memory and mixed-signal analog accelerators, both of which are inherently approximate [141]. The architectures of [4] and [36] and their predecessors have undoubtedly inspired new designs, including those reviewed here. Wang et al. [174] present a PE-based accelerator that supports online training of a liquid state machine SNN. Their architecture implements the leaky IF neuron model, uses approximate adders, and power-gates low-activity PEs to reduce switching power. Sen et al. [151] problematize parallelizing acceleration of SNNs owing to their irregular memory access patterns and many negligible-impact neuron updates, inlexibly solved by implementing one PE per neuron in [174]. They propose evaluating neurons in a time-multiplexed manner and assigning them each an approximation level ś high levels meaning normal operation and low levels leading to synapse skipping (as in [80]) ś that is updated periodically depending on spike activity, enabling early termination in the case of just one active output neuron remaining. Both works report signiicant energy savings at negligible accuracy degradation over exact baselines. 4.1.2 Other generic ML architectures. NNs are not the only ML models that can beneit from AxC, the feature also applies to Hyperdimensional Computing (HDC) and Support Vector Machines (SVMs). Neither model type appears as frequently as NNs, yet the former is currently gaining traction due to its one-shot learning capabilities and potential for in-memory acceleration [76]. Life for SNNs, we provide brief introductions to the models before covering related works. Hyperdimensional Computing. As we described before, the underlying idea of SNNs is to accurately model the neuronal activity in biological brains. HDC abstracts away neurons and models distinct concepts as points in a high-dimensional space [73]. Originally motivated by its tolerance against noise from unreliable memory cells, it also has beneits in terms of training time and operation complexity yet sufers from lower achievable accuracy [54]. In HDC, a basis of random hypervectors is irst selected to represent all symbols in a dataset and stored in an item memory. Classifying an input datum involves encoding it by combining basis vectors with three simple operations: bundling (element-wise addition), binding (element-wise multiplication), and permutation (shuling), and comparing the result against learned prototype vectors, each representing an entire class, stored in an associative memory [54]. The high dimensionality and randomness enable learning these prototype vectors from relatively few training examples [73]. Like SNNs, recent eforts have explored in-memory accelerators for classiication [76]. The process of searching through an associative memory of prototype vectors is costly, requiring many memory and comparison operations. Imani et al. [65] consider this and propose three CAM-like architectures designed to calculate the distances between a search vector and stored vectors. Their designs are fully digital, partially memristive, and fully analog, and their results reveal the fully analog design as the most area and energy eicient. Khaleghi et al. [78] instead propose approximating the distance calculations in FPGA-based designs using inexact adders and majority operators. Moreover, they reduce the encoding complexity by using rotated versions of a single vector instead of multiple unique vectors. The authors show signiicantly reduced area and energy consumption, disregarding this optimization. Support Vector Machines. SVMs are another type of classiier, designed to work with two classes [34]. They model examples as points in a high-dimensional space and train a soft maximum-margin hyperplane that best possibly separates the two classes [115]. The plane is represented by a combination of weighted training examples known as support vectors [34]. Classiication involves computing a decision function using the support vectors and a predeined kernel that is commonly either linear, polynomial, or Gaussian [25]. SVMs generally require only a few support vectors to achieve reasonable accuracy [34] but sufer from long training times to select these as dataset size grows [25]. ACM Comput. Surv. 20 • Damsgaard et al. van Leussen et al. [168] note that SVMs are popular in battery-driven devices that demand high energy eiciency. They propose using quantization and approximate multipliers (like [103]) and present an accelerator architecture that implements these. Interestingly, they notice a greater accuracy degradation from quantization than from approximate multiplications. Zhou et al. [203] extend upon this and utilize approximate adders and multipliers. While the former supports linear and higher-order polynomial kernels, the latter’s architecture is limited to Gaussian kernels. Lastly, Ferretti et al. [50] support several kernels and combine approximations by quantizing data and weights and reducing the number of features and support vectors. All three works report energy and area savings over accurate baselines. 4.1.3 Application-specific ML architectures. One application domain appears particularly popular from the reviewed papers: speech recognition. As such, we irst cover works in this domain. Speech recognition. Liu et al. are motivated by the low-power requirements of always-on Voice Activity Detection (VAD) and present two diferent accelerator designs targeting this. In [92], they propose a DNN accelerator that uses two types of PEs: some implementing approximate multipliers (like in [145]) and some implementing delay-based adders for accumulating partial results. However, their more recent work [89] proposes an alternative digital accelerator operating on features extracted from an analog signal processing low. They avoid implementing costly multipliers by utilizing Binarized Weight Networks (BWNs). They further approximate their evaluation using dynamic-precision approximate adders and adaptive bit-widths determined by a SNR estimate derived from the inputs. The authors report improved power eiciency and improved classiication accuracy over related works. In addition to their works on VAD, Liu et al. also present several accelerators for Keyword Spotting (KWS). In [93], they propose a PE-based CNN accelerator. They adaptively quantize both inputs and weights during training to ind the minimal number of bits needed to achieve satisfactory accuracy. To support these in hardware, they design two types of PEs (like [92]), implementing either digital or approximate voltage-based multipliers. In line with this, they propose a SIMD-style multiplier for quantized CNNs [94]. Their design combines approximate Booth encoding (like in [62, 154]) with approximate adders for partial product reduction. In their other works, they present PE-based accelerators for BWNs. Speciically, in [91], they present a fully digital design that uses ixed-width data and the inexact adder of [89]. The architecture in [86] is very similar, yet it replaces inexact digital adders with analog delay-based ones, mitigating the corresponding efects through a noisy training low. Lastly, in [87], they extend upon their design from [91] by adding support for dual-voltage operation with error-biased inexact adders. In all ive works, the authors report greatly reduced power consumption at the same classiication accuracy as related works. Despite the chronology of these publications, the two last works somewhat combine elements of the above more recent publications. Firstly, Liu et al. [90] propose a full speech recognition low implementing both a BWN-based VAD/KWS module and a Long Short-Term Memory (LSTM)-based speech recognition module. They quantize all data and implement both modules with each their style of PEs. Speciically, they use the analog delay-based adders of [92] for the KWS module’s PEs and digital approximate multipliers with error recovery for the speech recognition module. Lastly, Jo et al. [69] also propose a hardware accelerator for LSTMs. In addition to using approximate multipliers like [90], they exploit the temporal similarity between inputs to the LSTM cells to determine which updates to approximate or completely skip. Their accelerator implements sparsity-awareness, making it capable of skipping memory accesses to zero-weighted inputs. The two works present signiicant improvements in power eiciency. We note that the results in most of Liu et al.’s works [87, 89ś93] are biased by the use of a smaller technology node than in their related works. Other ML applications. The remaining six works focus on applications or techniques that do not naturally it into the above categories. First, Kim and Shanbhag [81] propose an AxC-enabled stereo image matching ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 21 (used in machine vision) accelerator with Algorithmic Noise Tolerance (ANT), an alternative to, e.g., self- compensation [103]. Their message passing architecture uses ANT to mitigate errors from approximate addition and voltage over-scaling. The authors show potential power reductions at the cost of (signiicant) overheads for implementing ANT-related estimation modules. Second, Kar et al. [75] consider low-power always-on anomaly detection. They propose a hardware accelerator for an ensemble of one-class classiiers, i.e., single-layer NNs with a single output neuron. Their architecture supports online training, replaces weight memories in the irst layer with pre-determined pseudo-random number generators, and evaluates neurons in time-division multiplexed MAC units. Furthermore, it can activate only a subset of classiiers and quantize their computations when inputs are easily distinguishable from anomalies. The authors show that the approximations increase energy eiciency at no noticeable drop in detection accuracy. Third, Chen et al. [30] consider communication in NoCs used for image classiication. They propose a scheme that scales the contrast of input images to reduce the bit-width of pixels and quantizes loating-point weights and activations to 8-bit integers. They support these approximations in the NoC’s network interfaces, thus reducing data transfers. An online quality manager keeps track of output quality and adjusts approximations accordingly. The authors evaluate several image classiication NNs and show reduced network latency and power consumption at little accuracy loss. Fourth, Li et al. [85] propose using stochastic computing to reduce the hardware cost of NN accelerators. Speciically, they propose a neuronmodel that accumulates several incoming bit-streams concurrently and needing no activation function due to how weights are encoded in the stochastic domain. They evaluate their proposal on an MLP and a CNN and show greatly reduced area and power consumption at no accuracy degradation. Fifth, Wang et al. [179] present a PE-based processor architecture designed to accelerate transformer NNs. The design combines self-adapting multipliers with inexact Booth encoding (like [62, 154]) and inexact compressors (like [17, 106]), speculation on weight sparsity, and out-of-order scheduling to maximize PE utilization and fully exploit the transformer’s error resilience. These optimizations enable the architecture to achieve improved energy eiciency over related works. Last, and orthogonal to aiming for power/energy and latency savings through AxC, Guesmi et al. [58] propose applying it to reduce CNNs’ vulnerability to adversarial attacks. Such attacks involve manufactured inputs that closely resemble real inputs but intentionally cause inference failure. AxC introduces input-dependent noise and makes it more di cult to manufacture adversarial inputs eiciently. Speciically, the authors apply an approximate-mantissa loating-point multiplier and ind that it greatly reduces attack transferability and success rate. 4.2 Image Processing As seen in the prior sections, highlighted in Tabs. 2 and 4, image processing applications are frequently approxi- mated. Ten works fall into this category, the irst four presenting approximate DCT implementations, common to image compression algorithms. As these algorithms are inherently lossy, it is natural to consider approximating parts of them. Almurib et al. [9] describe a three-step framework for developing high-eiciency implementations of the DCT, namely: 1) select a low-complexity algorithm, 2) ilter out high-frequency components, and 3) apply AxC techniques such as truncation and inexact adders. They report an extensive error analysis of the algorithm using diferent approximate adders. Similar ideas motivate Snigdha et al. [159], who focus on a multiplier-less DCT algorithm. Provided with a quality factor and an error variance budget, they ind the optimal number of LSBs to approximate using an ILP solver. Xing et al. [186] similarly approximate the constant DCT weights, apply thresholding to intermediate pixel values, and implement inexact adders. Wang et al. [180] instead propose replacing the DCT algorithm with neural approximation (refer to Sec. 3.1.4). To accelerate inferences, they present a PE-based architecture (like in [137, 165]) with distributed memories and activation function approximation [38], ACM Comput. Surv. 22 • Damsgaard et al. all optimized for operation at NTV (refer to Sec. 2.1.1). All four works report that signiicant energy savings are achievable with little visual impact on compressed images. Another four works apply AxC to edge detection. de Oliveira et al. [37] irst explore the use of inexact adders in the Gaussian and gradient ilters common to many edge detection algorithms, highlighting the potential savings this enables. Their subsequent work, Soares et al. [160], speciies an algorithm to assign approximation levels to adders in the ilters based on estimated input magnitudes. The authors also provide a full datapath that approximates gradient magnitude and direction computations to be multiplier-free. Their most recent work, Monteiro et al. [108], combines these proposals with varying ilter sizes to enable area-power-quality trade-ofs. Orthogonally, Usami et al. [167] propose using a simple memoization scheme ś a single-entry version of [43] ś that skips ilter convolutions by reusing the most recently computed value if the inputs are approximately similar. The former three works report area and power reductions, while the latter shows some hardware overheads but reduced energy consumption. The last two works focus on diferent algorithms. Siva and Jayakumar [158] present a deeply pipelined architecture that implements a bi-linear interpolation scaling algorithm. They approximate the algorithm’s edge detection and sharpening ilters by using approximate multipliers but fail to achieve improvements over related works. Yao et al. [195] consider an architecture for bilateral denoising ilters. They estimate ilter convolutions using piece-wise linear models and normalize pixel values with an inexact divider, signiicantly improving throughput compared to related works. 4.3 Video Processing Another seven works present AxC applied to video processing applications. Palumbo and Sau [126] motivate research in this direction by varying user or system requirements regarding quality, power consumption, etc. Having surveyed related works, they argue for lexible software and hardware implementations of video-related algorithms ś speciically for Advanced Video Coding (AVC) and High-Eiciency Video Coding (HEVC) ś enabling run-time reconigurability. In line with this, four works present AxC applied to the HEVC algorithm. The irst two consider its motion estimation encoding step. El-Harouni et al. [44] label this step as cumbersome owing to its many Sum-of- Absolute-Diferences (SAD) computations between sub-frames. They exploit its error resilience in a heterogeneous accelerator architecture that implements tiles of unrolled SAD operations with diferent approximate adders used for some LSBs. A controller switches between the tiles and manages their memory accesses according to user-provided quality constraints, power-gating any unused tiles. Paltrinieri et al. [125] present a more static design and evaluate the beneits of replacing subsets of its adders/subtractors with approximate alternatives. Both works report power and area savings, although those of [44] are not matched by [125]. The other two consider the HEVC algorithm’s fractional pixel interpolation decoding step that involves compute-heavy Finite Impulse Response (FIR) ilters for both chroma (color) and luma (brightness) channels. Sau et al. [147] propose a reconigurable architecture in which a number of ilter taps may be skipped. They insert multiplexers and clock- gating logic into an existing interpolation circuit targeting FPGA, enabling low-overhead reconiguration without reprogramming the FPGA. Preatto et al. [133] propose an alternative architecture implementing reconigurable ilters with adapted weights for diferent numbers of ilter taps. Moreover, they use approximate adders in their approximate luma ilter implementation. Both works report that power consumption is proportional to the number of ilter taps used and note some overheads from the introduced reconiguration logic. The remaining two works are much more diverse. Schafner et al. [149] describe how most video processing algorithms boil down to least-squares problems. With this in mind, they present an approximate direct solver architecture based on Cholesky decomposition that prunes insigniicant intermediate computations, giving rise to signiicant run-time savings. Qiao et al. [135] instead propose a hybrid cache architecture for approximating ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 23 chrominance pixel values in AVC video algorithms. The approximate cache maps several indices to the same entry (like [104]), increasing efective cache size and reducing external memory accesses for chrominance data. 4.4 Reliability Yet another seven works consider applying AxC to fault tolerance, ive of which trade of fault detection guarantees for reduced duplication overheads in � -Modular Redundancy (NMR) techniques. Chen et al. [28] aim to resolve fail-silent faults in majority-voting-based NMR systems, i.e., the acceptance of a majority of incorrect results. They propose two approximate voting schemes for Double Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) systems, both involving comparing some most-signiicant input bits to a given threshold to determine their validity. If no input is classiied as valid, the schemes lag an error; otherwise, an approximate result is produced by averaging or voting. The other four works approximate the redundant modules. Rodrigues et al. [140] precision-scale the modules, reducing the width of their datapaths. In addition to reducing their area, doing so also reduces the number of critical wires that lead to output errors if afected by soft errors. Anajemba et al. [12] propose replacing the two redundant modules in TMR systems with over- and under-approximated versions of the original. They introduce more 1s or 0s in their truth tables, simplifying the minterms required to represent them in SOP form while satisfying given error constraints. Combining over- and under-approximated modules implies that at least two modules produce exact outputs in the presence of no faults. Deveautour et al. [39] propose an approximate Quadruple Modular Redundancy (QMR) scheme. They randomly split a design’s outputs into four equally-sized subsets, removing one subset and its fan-in cone from each to generate four approximate circuits. The modules’ outputs are combined using four majority voters: one for each subset of outputs. The resulting system guarantees single-fault coverage. Lastly, Nazar et al. [112] propose approximating all modules in a DMR system and utilizing the saved area for implementing more modules, improving system throughput. They mitigate the efects of such broad approximations through a heuristic optimization algorithm and by using the voting schemes of [28]. Four works report positive results: the voting schemes of [28] almost consistently achieve higher output quality than an exact TMR scheme; precision scaling is shown to improve reliability against accumulated faults [140]; reduced area and improved energy eiciency is reported in [12]; and decreased error efects from faults are reported by [112]. Contrarily, [39] reveal uncertain beneits of approximate QMR, improving fault tolerance only for some circuits. The two remaining works present orthogonal ideas. Verdeja and Li [172] focus on redundant general-purpose computing systems and propose to exploit application fault tolerance by skipping crashes in non-critical code regions, avoiding the long run-time overheads of restarting the application. Their scheme requires OS and hardware support but achieves signiicant performance and energy eiciency improvements. Wang et al. [177] combine AxC with soft error tolerance by proposing a data format consisting of an error-correctable control part and a parity-protected data part. They suggest truncating 32-bit numbers to their two most signiicant, non-zero 4-bit segments and assigning them pre-computed control signals. The authors report improved output quality under the inluence of soft errors. 4.5 Other Applications Lastly, we review thirteen works that do not naturally fall into any of the above categories. Instead, we group these works relating to wireless communication, Digital Signal Processing (DSP) applications, machine vision, and others. 4.5.1 Wireless communication. Six of the ten works focus on diferent aspects of wireless communication. Sen et al. [152] notice a systematic nature of errors introduced at the physical layer of wireless channels and aim to reduce these without resorting to FEC and ARQ. They propose making the transmitter aware of the errors ACM Comput. Surv. 24 • Damsgaard et al. Table 5. Select characteristics of the reviewed works on applications of AxC. Note that we specify application run-time. Class Ref. Evaluation KPIs Fundamental techniques platform Power Energy Area Run-time Circuit-level Arithmetic Stochastic Others M L NNs [174] FPGA x x x x [80] Sim. (ASIC) x x x x [178] Sim. (ASIC) x x [151] Sim. (ASIC) x x x [145] Sim. (ASIC) x x x [165] Sim. (ASIC) x x [56] Sim. (ASIC) x x x x [24] FPGA x x x x [171] ASIC x x x [84] FPGA x x x [137] FPGA x x x x x [182] Sim. (anal.) x x x [190] Sim. (ASIC) x x SVMs & [168] Sim. (ASIC) x x x x HDC [203] Sim. (ASIC) x x x x x [50] Sim. (ASIC) x x x [65] Sim. (ASIC) x x x x [78] FPGA x x x x VAD [92] Sim. (ASIC) x x x x [89] Sim. (ASIC) x x x KWS [93] Sim. (ASIC) x x x x x [91] Sim. (ASIC) x x x [86] Sim. (ASIC) x x x x [94] Sim. (ASIC) x x x [87] ASIC x x x x x Speech [90] Sim. (ASIC) x x x x x recog. [69] Sim. (ASIC) x x x x Others [81] Sim. (ASIC) x x x x [85] Sim. (ASIC) x x x x x [75] ASIC x x [58] Sim. (anal.) x x [30] Sim. (anal.) x x x [179] ASIC x x x x x Im a g e p ro ce ss in g DCT [159] Sim. (ASIC) x x x [9] Sim. (anal.) x x x x [186] Sim. (ASIC) x x x x x x [180] ASIC x x x x Edge [37] Sim. (anal.) x x x detection [160] Sim. (ASIC) x x x x x x [167] FPGA x x x x x [108] Sim. (ASIC) x x x Others [158] Sim. (ASIC) x x x x [195] FPGA x x x x V id e o p ro ce ss in g General [126] FPGA x x x HEVC [44] Sim. (ASIC) x x x x [147] FPGA x x x x [125] Sim. (ASIC) x x x x [133] Sim. (ASIC) x x x x Others [149] Sim. (anal.) x [135] Sim. (anal.) x R e li a b il it y NMR [28] Sim. (ASIC) x x [140] FPGA x x x [12] Sim. (anal.) x [39] Sim. (anal.) x [112] Sim. (ASIC) x Others [172] Sim. (ASIC) x [177] Sim. (ASIC) x x x x O th e rs Wireless [152] Sim. (anal.) comm. [23] FPGA x x x [202] FPGA x x x x [60] Unspec. x x [184] Sim. (anal.) x x [63] FPGA x x x Continued on the next page ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 25 cont. [102] Sim. (ASIC) x x x x [19] Sim. (ASIC) x x [88] ASIC x x x x [40] FPGA x x Machine [162] ASIC x x x vision [55] FPGA x x x x Others [68] FPGA x x x x and the relative importance of bit positions in transmitted packets to transmit important bits in better-protected positions. To support this, they present extended transmitter and receiver architectures. Hao et al. [60] explore the use of approximate adders in the Fast Fourier Transform (FFT) and Inverse FFT steps in a conventional wireless communication system. Idrees et al. [63] consider a hypothetical 6G downlink. They propose using FIR ilters with approximate ixed- point MACs to minimize the efects of channel noise over a wireless channel used for approximable data. They evaluate several combinations of multipliers and adders applied to diferent modulation schemes. Sen et al. [152] report improved output quality, Hao et al. [60] highlight the suitability of fail-rare approximations to their application, and Idrees et al. [63] demonstrate potential for signiicant power savings with little impact on quality. The three other works focus on more particular technologies. Castaneda et al. [23] develop an approximate algorithm for data detection in Multiple-Input Multiple-Output systems. They apply semideinite relaxation to reduce the complexity of the otherwise exponential-scale problem, quantize all intermediate values to ixed-point formats, and provide a PE-based accelerator targeting FPGAs. The accelerator achieves comparable throughput and improved error rates to related works. Xiao et al. [184] present three mathematical approximations of the expectation propagation step in Sparse Code Multiple Access detection: 1) Jacobi approximation, 2) fewer multipli- cations and divisions, and 3) max operations instead of some divisions and logarithms. Their algorithm nearly retains the performance of its exact counterpart yet permits simpler hardware implementation. Zhou et al. [202] consider the inherently serial successive cancellation decoding algorithm for polar codes an error-resilient DSP task. They propose using approximate comparators, invert-and-add-one modules, and adders, further truncating and quantizing variables to suitable ixed-point formats. The resulting architecture improves throughput at the expense of some error correction performance. 4.5.2 Digital Signal Processing. In line with [60, 63, 202], another four works focus on DSP-related applications. Martina et al. [102] propose a multiplier-less architecture for approximately computing the Discrete Wavelet Transform. They utilize inexact adders and describe a result-biasing technique for balancing out errors in diferent stages. Their designs achieve similar output quality as comparable designs. Basu et al. [19] focus on bio-DSP applications, describing an architecture that combines task parallelism, vectorization, and approximation. It implements several VOS processor cores and a SIMD-capable CGRA co-processor used only for exact acceleration. An execution monitoring system manages the control, data low, and reconiguration, ensuring resets upon non-recoverable memory errors. This approximate acceleration leads to signiicant energy reductions. In parallel to their works on speech recognition [86, 87, 89ś93], Liu et al. [88] present an approximate archi- tecture for pre-processing audio into Mel-scale Frequency Cepstral Coeicients. They implement a reduced- complexity FFT, quantize all computations, utilize inexact adders and multipliers, and apply a dual-voltage supply scheme. The architecture is shown to drastically reduce the power consumption of a VAD system with negligible accuracy degradation. Unrelated to this, Dhaou [40] presents a hardware architecture for fuel consumption estimation in cars, approximating the engine’s rotational velocity to save area for memories with negligible loss of accuracy. 4.5.3 Machine vision. Two works consider vision-related applications. Tagliavini et al. [162] present a NTV parallel architecture for motion detection, closely related to [163]. By dynamically adjusting the voltage of the data memory during diferent kernels, the authors clearly demonstrate the energy-quality trade-of that their ACM Comput. Surv. 26 • Damsgaard et al. architecture enables. Gkeka et al. [55] consider compute-heavy algorithms used for unmanned aerial vehicles to navigate the world safely. Speciically, they target throughput improvements and propose an extensive list of exact and approximate code and hardware optimizations for accelerating localization and mapping kernels on FPGAs. Their resulting architectures achieve signiicant speedup over exact baselines at acceptable errors. 4.5.4 Others. The work of Jiang et al. [68] is not easily grouped with the above. Targeting imprecise mixed- criticality systems, they present an AxC-equipped processor architecture implementing an inexact loating- point unit. Supplementing this, they propose a medium-criticality mode for executing otherwise low-criticality applications approximately. With low software and hardware overheads, this mode enables executing more tasks within a given time quantum. To recap, we have now reviewed fundamental AxC techniques, AxC-enabled hardware architectures, and application-speciic designs proiting fromAxC. This section clearly shows that ML is the most popular application domain. Yet, AxC has also been applied to image and video processing workloads, reliability and fault detection systems, wireless communication, DSP, and other Edge-related applications. Like before, we summarize frequently used performance metrics in Tab. 5 and reported results in terms of power and area in Fig. 8. Clearly, most works report power savings with only two exceptions instead reporting improved throughput [165] or fault tolerance [68]. The trend is similar for reported area: exceptions improve power consumption [81], energy eiciency [179, 180], or resource eiciency [23]. 0% 25% 50% 75% 100% 125% N or m al iz ed p ow er N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A Baseline Decrease Increase [1 42 ] [1 39 ] [7 4] [1 26 ] [9 4] [1 52 ] [3 2] [1 49 ] [1 64 ] [2 1] [7 3] [1 41 ] [3 9] [1 68 ] [2 4] [1 93 ] [1 58 ] [1 37 ] [7 8] [8 ] [1 70 ] [1 7] [1 76 ] [8 3] [1 35 ] [8 6] [1 17 ] [1 92 ] [1 55 ] [1 16 ] [8 5] [5 1] [1 62 ] [5 5] [1 31 ] [1 74 ] [7 9] [2 2] [1 57 ] [1 48 ] [7 1] [1 24 ] [8 4] [8 2] [1 50 ] [1 61 ] [6 4] [6 9] [7 7] [6 3] [1 04 ] [1 72 ] [5 0] [1 67 ] [5 3] [1 1] [3 4] [1 28 ] [8 1] [5 8] [2 6] [1 69 ] [8 0] [1 00 ] [3 5] [8 7] [1 80 ] [1 85 ] References 0% 25% 50% 75% 100% 125% N or m al iz ed a re a N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A N /A 209% 354% 541% Fig. 8. Normalized power and area metrics from the works on applications of AxC, in chronological order. 5 CHALLENGES AND FUTURE WORK While our technical review reveals a vast amount of work on AxC techniques and their applications, only two explicitly apply to Edge computing [19, 84]. This observation is shared by Barua and Mondal [18], who suggest that Edge computing may be better utilized by appropriately applying AxC techniques, indicating that open research directions remain. We cover the main part of these in the following three sections. Later, we present overarching research directions needed for the mainstream adoption of AxC. 5.1 Fundamental Approximate Computing Techniques The irst part of our review revealed a plethora of works on fundamental techniques useful in a wide range of architectures and applications. Despite the variation of works, there are still open research directions related to ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 27 Fundamental AxC techniques AxC-enabled architectures Applications of AxC Publication class 0% 20% 40% 60% 80% 100% R el at iv e co nt ri bu tio n ASIC FPGA Sim. (analytic) Sim. (ASIC) Unspecified Fig. 9. Distribution of evaluation platforms used across the three classes of publications. this ield. For example, although VOS already appears to be the most efective circuit-level technique, enabling quadratic power reductions, Amanollahi et al. [10] highlight it as a direction of future research, especially in combination with dynamic threshold voltage adjustments. In parallel to this, researchers may explore hybrid or emerging technology-based memory architectures [163] or non-volatile memories as alternatives to volatile ones [10]. Although there exist many inexact arithmetic units [96, 105], related authors suggest further research into this area [124]. Some call for units designed with speciic applications in mind, which assert either “fail smallž or “fail rarež behavior [32]. The former is necessary for ANT [83], while the latter can be used for ML applications [32]. Others call for the design of reconigurable units [80, 116, 200], potentially extended with low-overhead error compensation [199]. Architectures subject to strict area requirements have been shown to beneit from stochastic computing techniques. However, while stochastic arithmetic circuits are small, the logic required to generate (pseudo-)random bitstreams is often costly. Future research may consider approximating such circuits further by truncating them and the bitstreams they generate [153] or by using memristive devices in their place [5]. Researchers may also study how to tame memristive architectures’ erroneous nature to efectively parallelize stochastic computations, including sequential divisions [5]. Lastly, our review has shown memoization as a low-complexity alternative to neural approximation [71]. Future research may focus on improving its use of local memories [71] and reducing its hardware overheads and lookup and update complexity [61, 157]. Moreover, concerning function approximation, stochastic computing implementations are particularly interesting because of their low area and power consumption. Yet, they require further research to be fully optimized and explored in relevant applications [98]. 5.2 Approximate Computing-enabled Hardware Architectures We have also shown how AxC is mostly used for application-speciic architectures, likely due to its easier adaptation and potentially greater beneits. Yet, research into generic AxC-enabled architectures is also important, and many techniques are applicable to both [105]. Using AxC eiciently in GPPs is interesting as it ofers diverse workloads and widespread application. Yet, its needs for ISA and tool low changes limit its adoption [124, 187], and the lack of automatic sensitivity analysis tools limits its efectiveness [27]. Other research directions include the approximation of other instruction classes than arithmetic, load/store, and branches [113, 118]; approximation- aware cache replacement and insertion policies [104]; low-overhead memoization techniques [146]; and solutions to DRAM scaling challenges [201]. We ind a need for further development of reconigurable architectures that achieve a good balance of per- formance and low reconiguration overhead. The current works focusing on CGRAs [2, 33, 41] fail to report reconiguration overheads, making it unclear whether they have struck this balance. Furthermore, the potential beneits and implications of SIMD-style extensions to these architectures (like those of [19]) have yet to be fully ACM Comput. Surv. 28 • Damsgaard et al. explored. Future works may also consider mixed-signal implementations (like [86, 90, 92]) to achieve even greater energy eiciency [124]. Regarding NoCs, research may focus on packet prediction and compression [139], value coding in wired NoCs [111], methods for approximate communication in wireless NoCs [49], as well as more optimal workload mapping and packet compression even for exact operation [107]. Unfortunately, related works mostly avoid hardware simulations opting for simpler, analytical alternatives, as shown in Fig. 9, they typically fail to report power and area estimates, which are often relevant in AxC-equipped systems. 5.3 Applications of Approximate Computing In our review, we saw that AxC techniques have already been applied to a variety of applications. Although many of the reviewed works report improvements, they also outline many directions for future research. Note that the technical sections of our review have not covered many works related to security, as the included works are mainly surveys. Instead, we cover their suggestions here. ML is undoubtedly the most popular application of AxC techniques. With the potential beneits arising from combining AxC and Edge acceleration and the increasing computational demands of larger models, we expect it to remain so for the near future. Our review covers many works on hardware accelerators for NNs with AxC techniques applied, e.g., [80, 85, 90, 91, 93, 171]. Yet, as the ield is evolving rapidly, further research remains interesting. Such research may, for example, seek to understand the synergies between AxC and ML [171] and extend existing techniques to more NN layer types [24, 69]. Attaining the desired power savings with negligible impact on network accuracy requires fully understanding the beneits and drawbacks of AxC techniques applied to ML [80]. Understanding this also enables deterministically evaluating per-layer approximations [200] and supports the evaluation of dynamic AxC as a controlled source of noise, improving robustness against adversarial attacks [58]. In this direction, reviewing the training algorithms used is also relevant. For example, one work calls for application-speciic ensemble learning methods [75], while another suggests developing general training algo- rithms with simpliications for improved execution time [171]. Concurrently, researchers may explore alternative ML model types. The review reveals SNNs as the most likely successor to current DNNs. Initial results are promis- ing, but the networks have yet to see mainstream use and be applied to more complex learning tasks than image recognition [151, 174]. HDC and SVMs also appear as intriguing alternatives allowing for eicient acceleration. Both SNNs, HDC, and SVMs ofer many open research directions, particularly in emerging technology-based hardware architectures [25, 54, 65, 141]. Multimedia applications are also often approximated, highlighted by the number of reviewed works in Secs. 4.2 and 4.3, the frequently used benchmark applications in Tabs. 2 and 4, and by Palem and Lingam- neni [124]. Most of these works target a speciic step in an algorithm (e.g., DCT for JPEG or motion estimation for HEVC). However, some suggest extending their approaches to other steps and evaluating them more carefully in diferent environments [37, 125, 167]. The reviewed works on reliability and fault tolerance ofer many opportunities for improvement. Reducing the overhead of traditional NMR schemes while maintaining a satisfactory level of fault detection is both intriguing and challenging. We ind that while some existing works present promising results, there is still a need for further research into, for example, the potential of further approximations in complex DMR, TMR, and QMR systems [12, 39, 112], the beneits arising from truncating module outputs [140], and how to model errors from skipping crashes in non-critical code regions [172]. Moreover, it is perhaps even more interesting to delve into the intersection of these techniques. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 29 Hardware security is becoming increasingly important as more personal data is being stored and processed at a distance from the users. While Edge processing reduces the number of (shared) storage and processing locations, it does not guarantee complete security [119]. Therefore, Edge devices must, apart from being reliable, also contain security features that ensure no attackers can detect and propagate secret information about their users. The reviewed works outline many open challenges in this direction, of which some recur in multiple works. These topics include AxC for generating physically unclonable functions [95, 96] usable for authentication and identiication; insertion, activation, and detection of hardware trojans [95, 96, 196]; and distinguishing errors caused by AxC from those caused by trojans [95]. It is also relevant to analyze using AxC against fault injection and side-channel attacks [96, 138]. Regrettably, although AxC can obfuscate logic circuits and hide their attack surfaces, it may also bring new attack surfaces whose detection and mitigation is crucial, especially in the context of compound attacks [176]. Applying AxC to post-quantum cryptography algorithms is another interesting research direction yet to be explored [138]. 5.4 Overarching Research Topics As we have seen, there are numerous possibilities for further research into the three main technical directions presented in our review. However, some research directions, such as quality management and cross-layer AxC techniques, span a multitude of them. A common notion in related works is that both require more work to enable wider adoption of AxC. We present some ideas for such work here. Most of the reviewed works on AxC-enabled architectures and applications of AxC statically apply approx- imations with little-to-none regard for their efects on output quality. This approach is unlikely to maximize potential beneits from AxC, as it implies using techniques conservatively. Instead, quality management systems are required to control approximations efectively [187], existing techniques difering in whether they are applied oline or at run-time. In the irst category, Chen et al. [31] propose using developer annotations and static code analysis to determine variables’ error sensitivity to truncation, applying their technique to the NoC of [139]. More works fall into the second category, using either LWCs to make circuits self-tunable [188] (like [103]) or various regression or NN models to predict errors and adapt approximations accordingly [67, 77, 79, 175]. Unfortunately, most of these techniques are either too conservative [31] or fail to ensure quality constraints [67, 79, 175, 188]. One work stands out, giving thresholds as mean relative error values that are almost fully utilized yet guaranteed [77]. Moreover, although some report promising results, they are often followed by high hardware/software overheads that limit the savings arising from AxC. Preferably, an application should need minimal overheads to communicate its quality constraints to the underlying hardware, which in turn should monitor and adapt approximations accordingly to guarantee output quality. Research into low-overhead quality monitoring and management at the circuit and architecture levels is needed [14]. Techniques may likely utilize LWCs and NNs as in the reviewed works [79, 188], perhaps in combination with iterative classiication and minimal-error SOP approximations [67, 77]. Monitors could also implement traditional fault detection hardware conigured to lag only faults known to cause unsatisfactory output quality [14]. Orthogonally, some authors call for tools to formally verify that approximate systems satisfy their quality constraints [169, 187]. Designers of such can beneit from traditional veriication tools extended to support quality-oriented properties or approximate equivalence checks [124, 169]. Finally, guaranteeing quality in non- deterministic approximate chips requires developments in in-circuit test lows to ignore acceptable approximation errors [95]. Taking full advantage of AxC requires further development of cross-layer techniques, tools, and frameworks.Ma- ture tools can perform design space exploration and hardware/software co-design, combining various techniques ACM Comput. Surv. 30 • Damsgaard et al. to satisfy user-speciied quality constraints while maximizing energy savings or minimizing latency [114, 124, 200]. Such exploration, however, implies a need for further development of techniques and analysis of their efects. Although some works already combine techniques, e.g., [145, 171], they mostly use truncation with another of the reviewed fundamental techniques, as shown in Tabs. 1, 3, and 5. The tools must also be able to reason about non-deterministic behavior in approximate hardware, for example, through special software libraries or new programming languages [187]. Once well-developed, designers can use these tools for optimizations, a process quickly necessitating computer- aided tools for complex designs. Historically, this has been solved by raising the abstraction level, most recently to HLS. Several works follow this idea and propose extensions to such tools [84, 112, 114, 157, 188]. Alternative tools could be designed speciically to target circuits implemented using probabilistic Boolean logic [124] or memoization [71, 157]. In general, approximate logic synthesis, as surveyed by Scarabottolo et al. [148], is very promising but requires great eforts to overcome its current drawbacks in terms of scaling, support for run-time reconigurability, cross-layer technique exploitation, and dependence on synthesis and simulation [114]. In this space, ML-based design space exploration has recently gained traction. Especially evolutionary algorithms seem well-suited for optimization problems with very large search spaces due to their randomized generational evolution [173]. Such algorithms can eiciently identify near-optimal design parameter conigurations for larger designs, or they can be used to optimize smaller components [150]. However, they still need signiicant research eforts to become functional in practice. We generally ind that techniques should be easily applicable and cause minimal strain on developers, which implies updates to several steps of the development process from compilers, veriication, simulation, synthesis tools, and the languages they operate on to tools for sensitivity analysis and formal reasoning. Furthermore, although results arising from AxC applied to various workloads are promising, the combined beneits of AxC and Edge computing are yet to be explored. Lastly, we note that this list is not comprehensive but comprises the ideas we ind necessary for a broad adoption of AxC within the Edge computing scope. 6 REVIEW SUMMARY As the number of connected devices that generate data increases, so do the computational demands to aggregate the data. However, many of the devices are battery-driven and, thus, need to perform these computations at minimal energy consumption. Traditionally, they have completely oloaded their computations to far-away centralized Cloud data centers at great costs in terms of communication energy and latency, but the expected future data amounts render this solution unit. Moreover, this method fails to exploit the inherent error resilience of the applications. This calls for alternatives to be explored. One such alternative is the intriguing combination of the AxC and Edge computing domains. Doing so allows for exploiting the beneits of approximation and oloading at once. This also opens new optimization opportunities by extending the traditional time-energy trade-of by an accuracy aspect. In this paper, we have presented a systematic literature review of publications about AxC and Edge computing. As such, we have considered works in three classes: fundamental AxC techniques, AxC-enabled architectures, and applications of AxC. From this, we observed that many diferent techniques, from the circuit level to the architecture level, have found application both in GPPs, CGRAs, and NoCs. However, we also noticed a lack of work exploring cross-layer techniques. We expect that combining them will lead to greater savings but that it also requires further developments in low-overhead quality management. Our investigation into applications revealed several trends. First, AxC has primarily been utilized for accelerat- ing ML and multimedia workloads, which exhibit the error resilience expected of them. However, we noticed the surprising application of AxC to reduce overheads in fault tolerant systems; a somewhat unexpected though interesting direction. Secondly, despite many works presenting application-speciic accelerators, very few apply ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 31 Power Area Fundamental AxC techniques -50% -25% 0% 25% 50% 75% 100% R ep or te d sa vi ng s Energy Run-time AxC-enabled architectures Power Area Applications of AxC Fig. 10. Box-plots of results in the two most frequently reported performance metrics of each publication class. The savings axis is limited to [−50%, 100%] for clarity, excluding two fundamental AxC technique area outliers (from [142, 163]) and three applications of AxC area outliers (from [81, 179, 180]). Negative values mean increased metrics. Boxes are colored, indicating recurring metrics. more than one or two AxC techniques simultaneously, once again implying a need for research into cross-layer techniques. And lastly, most publications report savings in power, energy, area, or delay/run-time, while some trade of one for the other. The distribution of reported results, previously presented in Figs. 4, 6, and 8, is shown in Fig. 10, clearly demonstrating the beneits one can expect from using the reviewed techniques. Despite the amount of work that already exists in the domains of AxC and Edge computing, only relatively few explicitly consider their intersection. We believe that there is still room for signiicant improvements in these ields. We have outlined ideas for further research to motivate researchers to advance them towards mainstream adoption. REFERENCES [1] Sergi Abadal, Albert Cabellos-Aparicio, Eduard Alarcon, and Josep Torrellas. 2016. WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication. ACM SIGARCH Computer Architecture News 44, 2 (2016), 3ś17. [2] Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and Muhammad Shaique. 2018. Toward Approximate Computing for Coarse-Grained Reconigurable Architectures. IEEE Micro 38, 6 (2018), 63ś72. [3] Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and Muhammad Shaique. 2019. X-CGRA: An Energy-Eicient Approximate Coarse-Grained Reconigurable Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2019), 2558ś2571. [4] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, et al. 2015. Truenorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1537ś1557. [5] Mohsen Riahi Alam, Mohammadreza Hassan Najai, Nima Taherinejad, Mohsen Imani, and Raju Gottumukkala. 2022. Stochastic Computing in Beyond Von-Neumann Era: Processing Bit-Streams in Memristive Memory. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2423ś2427. [6] Daria Alekseeva, Aleksandr Ometov, Otso Arponen, and Elena Simona Lohan. 2022. The Future of Computing Paradigms for Medical and Emergency Applications. Computer Science Review 45 (2022), 100494. [7] Christopher Allen, Derrick Langley, and James Lyke. 2014. Inexact Computing with Approximate Adder Application. In Proc. of National Aerospace and Electronics Conference (NAECON). IEEE, 21ś28. [8] Mario Almeida, Stefanos Laskaridis, Stylianos Venieris, Ilias Leontiadis, and Nicholas Lane. 2022. Dyno: Dynamic Onloading of Deep Neural Networks from Cloud to Device. ACM Transactions on Embedded Computing Systems (TECS) 21, 6 (2022), 1ś24. [9] Haider Almurib, Thulasiraman Nandha Kumar, and Fabrizio Lombardi. 2017. Approximate DCT Image Compression Using Inexact Computing. IEEE Trans. Comput. 67, 2 (2017), 149ś159. [10] Saba Amanollahi, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2020. Circuit-Level Techniques for Logic and Memory Blocks in Approximate Computing Systemsx. Proc. IEEE 108, 12 (2020), 2150ś2177. ACM Comput. Surv. 32 • Damsgaard et al. [11] Rida Amjad, Rehan Haiz, Muhammad Usman Ilyas, Muhammad Shahzad Younis, and Muhammad Shaique. 2019. m-SAAC: Multi-Stage Adaptive Approximation Control to Select Approximate Computing Modes for Vision Applications. Microelectronics Journal 91 (2019), 84ś91. [12] Joseph Henry Anajemba, James Adu Ansere, Frederick Sam, Celestine Iwendi, and Gautam Srivastava. 2021. Optimal Soft Error Mitigation in Wireless Communication Using Approximate Logic Circuits. Sustainable Computing: Informatics and Systems 30 (2021), 100521. [13] Cisco and/or its ailiates. 2020. Cisco Annual Internet Report (2018ś2023). Technical Report C11-741490-01. Cisco. https://www.cisco. com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html [14] Lorena Anghel, Mounir Benabdenbi, Alberto Bosio, Marcello Traiola, and Elena Ioana Vatajelu. 2018. Test and Reliability in Approximate Computing. Journal of Electronic Testing 34, 4 (2018), 375ś387. [15] Giuseppe Ascia, Vincenzo Catania, Salvatore Monteleone, Maurizio Palesi, Davide Patti, and John Jose. 2018. Approximate Wireless Networks-on-Chip. In Proc. of Conference on Design of Circuits and Integrated Systems (DCIS). IEEE, 1ś6. [16] Giuseppe Ascia, Vincenzo Catania, Salvatore Monteleone, Maurizio Palesi, Davide Patti, John Jose, and Valerio Mario Salerno. 2020. Exploiting Data Resilience in Wireless Network-on-Chip Architectures. ACM Journal on Emerging Technologies in Computing Systems (JETC) 16, 2 (2020), 1ś27. [17] Hiroyuki Baba, Tongxin Yang, Masahiro Inoue, Kaori Tajima, Tomoaki Ukezono, and Toshinori Sato. 2018. A Low-Power and Small-Area Multiplier for Accuracy-Scalable Approximate Computing. In Proc. of Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 569ś574. [18] Hrishav Bakul Barua and Kartick Chandra Mondal. 2019. Approximate Computing: A Survey of Recent TrendsÐBringing Greenness to Computing and Communication. Journal of The Institution of Engineers (India): Series B 100, 6 (2019), 619ś626. [19] Soumya Basu, Loris Duch, Miguel Peón-Quirós, David Atienza, Giovanni Ansaloni, and Laura Pozzi. 2018. Heterogeneous and Inexact: Maximizing Power Eiciency of Edge Computing Sensors for Health Monitoring Applications. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 1ś5. [20] Kunal Bharathi, Jiang Hu, and Sunil Khatri. 2020. Scaled Population Subtraction for Approximate Computing. In Proc. of 38th International Conference on Computer Design (ICCD). IEEE, 348ś355. [21] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proc. of 17th International Conference on Parallel Architectures and Compilation Techniques (PACT). ACM, 72ś81. [22] Flavio Bonomi, Rodolfo Milito, Jiang Zhu, and Sateesh Addepalli. 2012. Fog Computing and Its Role in the Internet of Things. In Proc. of 1st Edition of the MCC Workshop on Mobile Cloud Computing. ACM, 13ś16. [23] Oscar Castaneda, Tom Goldstein, and Christoph Studer. 2016. FPGA Design of Approximate Semideinite Relaxation for Data Detection in Large MIMO Wireless Systems. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 2659ś2662. [24] Jorge Castro-Godínez, Deykel Hernández-Araya, Muhammad Shaique, and Jörg Henkel. 2020. Approximate Acceleration for CNN-based Applications on IoT Edge Devices. In Proc. of 11th Latin American Symposium on Circuits & Systems (LASCAS). IEEE, 1ś4. [25] Jair Cervantes, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, and Asdrubal Lopez. 2020. A Comprehensive Survey on Support Vector Machine Classiication: Applications, Challenges and Trends. Neurocomputing 408 (2020), 189ś215. [26] Patrik Cerwal et al. 2021. Ericsson Mobility Report. Technical Report EAB-21:010887. Ericsson. https://www.ericsson.com/en/reports- and-papers/mobility-report/reports/november-2021 [27] Arun Chandrasekharan, Daniel Große, and Rolf Drechsler. 2017. ProACt: A Processor for High Performance On-Demand Approximate Computing. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 463ś466. [28] Ke Chen, Jie Han, and Fabrizio Lombardi. 2017. Two Approximate Voting Schemes for Reliable Computing. IEEE Trans. Comput. 66, 7 (2017), 1227ś1239. [29] Linbin Chen, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2015. Design of Approximate Unsigned Integer Non-Restoring Divider for Inexact Computing. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 51ś56. [30] Yuechen Chen, Shanshan Liu, Lombardi Fabrizio, and Ahmed Louri. 2022. A Technique for Approximate Communication in Network- on-Chips for Image Classiication. IEEE Transactions on Emerging Topics in Computing (2022). Early access. [31] Yuechen Chen and Ahmed Louri. 2020. Learning-based Quality Management for Approximate Communication in Network-on-Chips. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3724ś3735. [32] Vinay Kumar Chippa, Srimat Tirumala Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Analysis and Characterization of Inherent Application Resilience for Approximate Computing. In Proc. of 50th Design Automation Conference (DAC). ACM, 1ś9. [33] Vinay Kumar Chippa, Swagath Venkataramani, Srimat Tirumala Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Approximate Computing: An Integrated Hardware Approach. In Proc. of Asilomar Conference on Signals, Systems and Computers. IEEE, 111ś117. [34] Corinna Cortes and Vladimir Vapnik. 1995. Support-Vector Networks. Machine Learning 20, 3 (1995), 273ś297. [35] Ayad Dalloo. 2018. Enhance the Segmentation Principle in Approximate Computing. In Proc. of International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET). IEEE, 1ś7. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 33 [36] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. 2018. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro 38, 1 (2018), 82ś99. [37] Julio de Oliveira, Leonardo Soares, Eduardo Costa, and Sergio Bampi. 2016. Exploiting Approximate Adder Circuits for Power-Eicient Gaussian and Gradient Filters for Canny Edge Detector Algorithm. In Proc. of 7th Latin American Symposium on Circuits & Systems (LASCAS). IEEE, 379ś382. [38] Ines del Campo, Javier Echanobe, Estibaliz Asua, and Raul Finker. 2015. Controlled-Accuracy Approximation of Nonlinear Functions for Soft Computing Applications: A High Performance Co-Proccessor for Intelligent Embedded Systems. In Proc. of Symposium Series on Computational Intelligence (SSCI). IEEE, 609ś616. [39] Bastien Deveautour, Marcello Traiola, Arnaud Virazel, and Patrick Girard. 2021. Reducing Overprovision of Triple Modular Reduncancy Owing to Approximate Computing. In Proc. of 27th International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, 1ś7. [40] Imed Ben Dhaou. 2022. Implementation of a Fuel Estimation Algorithm Using Approximated Computing. Journal of Low Power Electronics and Applications 12, 1 (2022), 17. [41] Jonathan Dickerson, Ioannis Galanis, Zois-Gerasimos Tasoulas, Lincoln Kinley, and Iraklis Anagnostopoulos. 2020. Adaptive Approxi- mate Computing on Hardware Accelerators Targeting Internet-of-Things. In Proc. of 6th World Forum on Internet of Things (WF-IoT). IEEE, 1ś6. [42] Sunil Dutt, Sukumar Nandi, and Gaurav Trivedi. 2017. Analysis and Design of Adders for Approximate Computing. ACM Transactions on Embedded Computing Systems (TECS) 17, 2 (2017), 1ś28. [43] Jorge Echavarria, Katja Schütz, Andreas Becher, Stefan Wildermann, and Jürgen Teich. 2018. Can Approximate Computing Reduce Power Consumption on FPGAs?. In Proc. of 25th International Conference on Electronics, Circuits and Systems (ICECS). IEEE, 841ś844. [44] Walaa El-Harouni, Semeen Rehman, Bharath Srinivas Prabakaran, Akash Kumar, Rehan Haiz, and Muhammad Shaique. 2017. Embracing Approximate Computing for Energy-Eicient Motion Estimation in High Eiciency Video Coding. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1384ś1389. [45] Elsevier. 2022. Scopus. https://www.scopus.com/home.uri. [46] Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In Proc. of 38th International Symposium on Computer Architecture (ISCA). ACM, 365ś376. [47] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural Acceleration for General-Purpose Approximate Programs. In Proc. of 45th International Symposium on Microarchitecture (MICRO). IEEE, 449ś460. [48] Sayed Rasoul Faraji, Pierre Abillama, and Kia Bazargan. 2021. Approximate Constant-Coeicient Multiplication Using Hybrid Binary-Unary Computing for FPGAs. ACM Transactions on Reconigurable Technology and Systems (TRETS) 15, 3 (2021), 1ś25. [49] Vimuth Fernando, Antonio Franques, Sergi Abadal, Sasa Misailovic, and Josep Torrellas. 2019. Replica: A Wireless Manycore for Communication-Intensive and Approximate Data. In Proc. of 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 849ś863. [50] Lorenzo Ferretti, Giovanni Ansaloni, Laura Pozzi, Amir Aminifar, David Atienza, Leila Cammoun, and Philippe Ryvlin. 2019. Tailoring SVM Inference for Resource-Eicient ECG-based Epilepsy Monitors. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 948ś951. [51] Farnaz Forooghifar, Amir Aminifar, and David Atienza. 2019. Resource-aware Distributed Epilepsy Monitoring Using Self-Awareness from Edge to Cloud. IEEE Transactions on Biomedical Circuits and Systems 13, 6 (2019), 1338ś1350. [52] Davide Gadioli, Emanuele Vitali, Gianluca Palermo, and Cristina Silvano. 2018. Margot: A Dynamic Autotuning Framework for Self-Aware Approximate Computing. IEEE Trans. Comput. 68, 5 (2018), 713ś728. [53] Shrikanth Ganapathy, Adam Teman, Robert Giterman, Andreas Burg, and Georgios Karakonstantis. 2015. Approximate Computing with Unreliable Dynamic Memories. In Proc. of 13th International New Circuits and Systems Conference (NEWCAS). IEEE, 1ś4. [54] Lulu Ge and Keshab Parhi. 2020. Classiication Using Hyperdimensional Computing: A Review. IEEE Circuits and Systems Magazine 20, 2 (2020), 30ś47. [55] Maria Rafaela Gkeka, Alexandros Patras, Christos Antonopoulos, Spyros Lalis, and Nikolaos Bellas. 2021. FPGA Architectures for Approximate Dense SLAM Computing. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 828ś833. [56] Yu Gong, Bo Liu, Wei Ge, and Longxing Shi. 2019. ARA: Cross-Layer Approximate Computing Framework based Reconigurable Architecture for CNNs. Microelectronics Journal 87 (2019), 33ś44. [57] Beayna Grigorian, Nazanin Farahpour, and Glenn Reinman. 2015. BRAINIAC: Bringing Reliable Accuracy into Neurally-Implemented Approximate Computing. In Proc. of 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 615ś626. [58] Amira Guesmi, Ihsen Alouani, Khaled Khasawneh, Mouna Baklouti, Tarek Frikha, Mohamed Abid, and Nael Abu-Ghazaleh. 2021. Defensive Approximation: Securing CNNs Using Approximate Computing. In Proc. of 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 990ś1003. ACM Comput. Surv. 34 • Damsgaard et al. [59] Jie Han and Michael Orshansky. 2013. Approximate Computing: An Emerging Paradigm for Energy-Eicient Design. In Proc. of 18th European Test Symposium (ETS). IEEE, 1ś6. [60] Mingjie Hao, Ardalan Najai, Alberto García-Ortiz, Ludwig Karsthof, Stefen Paul, and Jochen Rust. 2019. Reliability of an Industrial Wireless Communication System using Approximate Units. In Proc. of 29th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS). IEEE, 87ś90. [61] Xin He, Shuhao Jiang, Wenyan Lu, Guihai Yan, Yinhe Han, and Xiaowei Li. 2016. Exploiting the Potential of Computation Reuse through Approximate Computing. IEEE Transactions on Multi-Scale Computing Systems 3, 3 (2016), 152ś165. [62] Yajuan He, Xilin Yi, Ziji Zhang, Bin Ma, and Qiang Li. 2020. A Probabilistic Prediction-based Fixed-Width Booth Multiplier for Approximate Computing. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 12 (2020), 4794ś4803. [63] Maryam Idrees, Mohammed Manzar Maqbool, Muhammad Khurram Bhatti, Muhammad Mahboob Ur Rahman, Rehan Haiz, and Muhammad Shaique. 2021. An Approximate-Computing Empowered Green 6G Downlink. Physical Communication 49 (2021), 101444. [64] Mohsen Imani, Ricardo Garcia, Saransh Gupta, and Tajana Rosing. 2018. RMAC: Runtime Conigurable Floating Point Multiplier for Approximate Computing. In Proc. of International Symposium on Low Power Electronics and Design (ISLPED). ACM, 1ś6. [65] Mohsen Imani, Abbas Rahimi, Deqian Kong, Tajana Rosing, and Jan Rabaey. 2017. Exploring Hyperdimensional Associative Memory. In Proc. of 23rd International Symposium on High Performance Computer Architecture (HPCA). IEEE, 445ś456. [66] Chandan Jha and Joycee Mekie. 2019. Design of Novel CMOS based Inexact Subtractors and Dividers for Approximate Computing: An in-Depth Comparison with PTL based Designs. In Proc. of 22nd Euromicro Conference on Digital System Design (DSD). IEEE, 174ś181. [67] Shuhao Jiang, Jiajun Li, Xin He, Guihai Yan, Xuan Zhang, and Xiaowei Li. 2018. RiskCap: Minimizing Efort of Error Regulation for Approximate Computing. In Proc. of 27th Asian Test Symposium (ATS). IEEE, 133ś138. [68] Zhe Jiang, Xiaotian Dai, and Neil Audsley. 2021. HIART-MCS: High Resilience and Approximated Computing Architecture for Imprecise Mixed-Criticality Systems. In Proc. of 42nd Real-Time Systems Symposium (RTSS). IEEE, 290ś303. [69] Junseo Jo, Jaeha Kung, and Youngjoo Lee. 2020. Approximate LSTM Computing for Energy-Eicient Speech Recognition. Electronics 9, 12 (2020), 2004. [70] Hounghun Joe and Youngmin Kim. 2019. Eicient Approximate Image Processor with Low-Part Stochastic Computing. In Proc. of Asia Paciic Conference on Postgraduate Research in Microelectronics and Electronics (PrimeAsia). IEEE, 29ś32. [71] Michael Jordan, Marcelo Brandalero, Guilherme Malfatti, Geraldo Oliveira, Arthur Lorenzon, Bruno da Silva, Luigi Carro, Mateus Rutzig, and Antonio Carlos Beck. 2020. Data Clustering for Eicient Approximate Computing. Design Automation for Embedded Systems 24, 1 (2020), 3ś22. [72] Yirong Kan, Man Wu, Renyuan Zhang, and Yasuhiko Nakashima. 2020. A Multi-grained Reconigurable Accelerator for Approximate Computing. In Proc. of Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 90ś95. [73] Pentti Kanerva. 1992. Sparse Distributed Memory and Related Models. Technical Report. NASA. [74] Mingu Kang, Sujan Gonugondla, and Naresh Shanbhag. 2020. Deep in-Memory Architectures in SRAM: An Analog Approach to Approximate Computing. Proc. IEEE 108, 12 (2020), 2251ś2275. [75] Bapi Kar, Pradeep Kumar Gopalakrishnan, Sumon Kumar Bose, Mohendra Roy, and Arindam Basu. 2020. ADIC: Anomaly Detection Integrated Circuit in 65-nm CMOS Utilizing Approximate Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 12 (2020), 2518ś2529. [76] Geethan Karunaratne, Manuel Le Gallo, Giovanni Cherubini, Luca Benini, Abbas Rahimi, and Abu Sebastian. 2020. In-Memory Hyperdimensional Computing. Nature Electronics 3, 6 (2020), 327ś337. [77] Taylor Kemp, Yao Yao, and Younghyun Kim. 2021. MIPAC: Dynamic Input-Aware Accuracy Control for Dynamic Auto-Tuning of Iterative Approximate Computing. In Proc. of 26th Asia and South Paciic Design Automation Conference (ASP-DAC). IEEE, 248ś253. [78] Behnam Khaleghi, Sahand Salamat, Anthony Thomas, Fatemeh Asgarinejad, Yeseong Kim, and Tajana Rosing. 2020. SHEAR er: Highly-Eicient Hyperdimensional Computing by Software-Hardware Enabled Multifold Approximation. In Proc. of International Symposium on Low Power Electronics and Design (ISLPED). ACM, 241ś246. [79] Daya Shanker Khudia, Babak Zamirai, Mehrzad Samadi, and Scott Mahlke. 2015. RUMBA: An Online Quality Management System for Approximate Computing. In Proc. of 42nd International Symposium on Computer Architecture (ISCA). ACM, 554ś566. [80] Duckhwan Kim, Jaeha Kung, and Saibal Mukhopadhyay. 2017. A Power-Aware Digital Multilayer Perceptron Accelerator with on-Chip Training based on Approximate Computing. IEEE Transactions on Emerging Topics in Computing 5, 2 (2017), 164ś178. [81] Eric Kim and Naresh Shanbhag. 2014. Energy-Eicient Accelerator Architecture for Stereo Image Matching Using Approximate Computing and Statistical Error Compensation. In Proc. of Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 55ś59. [82] Younghoon Kim, Swagath Venkataramani, Sanchari Sen, and Anand Raghunathan. 2021. Value Similarity Extensions for Approximate Computing in General-Purpose Processors. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 481ś486. [83] Yongtae Kim, Yong Zhang, and Peng Li. 2014. Energy Eicient Approximate Arithmetic for Error Resilient Neuromorphic Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 11 (2014), 2733ś2737. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 35 [84] Lucas Klemmer, Saman Froehlich, Rolf Drechsler, and Daniel Große. 2021. XbNN: Enabling CNNs on Edge Devices by Approximate On-Chip Dot Product Encoding. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 1ś5. [85] Bingzhe Li, Yaobin Qin, Bo Yuan, and David Lilja. 2017. Neural Network Classiiers Using Stochastic Computing with a Hardware- Oriented Approximate Activation Function. In Proc. of 35th International Conference on Computer Design (ICCD). IEEE, 97ś104. [86] Bo Liu, Hao Cai, Yu Gong, Wentao Zhu, Yan Li, Wei Ge, and Zhen Wang. 2020. Binarized Weight Neural-Network Inspired Ultra-Low Power Speech Recognition Processor with Time-Domain Based Digital-Analog Mixed Approximate Computing. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 1ś5. [87] Bo Liu, Hao Cai, Xuan Zhang, Haige Wu, Anfeng Xue, Zilong Zhang, Zhen Wang, and Jun Yang. 2022. A Target-Separable BWN Inspired Speech Recognition Processor with Low-Power Precision-Adaptive Approximate Computing. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 196ś201. [88] Bo Liu, Xiaoling Ding, Hao Cai, Wentao Zhu, Zhen Wang, Weiqiang Liu, and Jun Yang. 2021. Precision Adaptive MFCC based on R2SDF-FFT and Approximate Computing for Low-Power Speech Keywords Recognition. IEEE Circuits and Systems Magazine 21, 4 (2021), 24ś39. [89] Bo Liu, Yan Li, Lepeng Huang, Hao Cai, Wentao Zhu, Shisheng Guo, Yu Gong, and Zhen Wang. 2020. A Background Noise Self-adaptive VAD Using SNR Prediction Based Precision Dynamic Reconigurable Approximate Computing. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 271ś275. [90] Bo Liu, Hai Qin, Yu Gong, Wei Ge, Mengwen Xia, and Longxing Shi. 2018. EERA-ASR: An Energy-Eicient Reconigurable Architecture for Automatic Speech Recognition with Hybrid DNN and Approximate Computing. IEEE Access 6 (2018), 52227ś52237. [91] Bo Liu, Yuhao Sun, Hao Cai, Zeyu Shen, Yu Gong, Lepeng Huang, and Zhen Wang. 2020. An Ultra-low Power Keyword-Spotting Accelerator Using Circuit-Architecture-System Co-design and Self-adaptive Approximate Computing Based BWN. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 193ś198. [92] Bo Liu, Zhen Wang, Shisheng Guo, Huazhen Yu, Yu Gong, Jun Yang, and Longxing Shi. 2019. An Energy-Eicient Voice Activity Detector Using Deep Neural Networks and Approximate Computing. Microelectronics Journal 87 (2019), 12ś21. [93] Bo Liu, Zhen Wang, Wentao Zhu, Yuhao Sun, Zeyu Shen, Lepeng Huang, Yan Li, Yu Gong, and Wei Ge. 2019. An Ultra-Low Power Always-on Keyword Spotting Accelerator Using Quantized Convolutional Neural Network and Voltage-Domain Analog Switching Network-based Approximate Computing. IEEE Access 7 (2019), 186456ś186469. [94] Bo Liu, Zilong Zhang, Hao Cai, Reyuan Zhang, Zhen Wang, and Jun Yang. 2022. Self-Compensation Tensor Multiplication Unit for Adaptive Approximate Computing in Low-Power CNN Processing. Science China Information Sciences 65, 4 (2022), 1ś2. [95] Weiqiang Liu, Chongyan Gu, Máire O’Neill, Gang Qu, Paolo Montuschi, and Fabrizio Lombardi. 2020. Security in Approximate Computing and Approximate Computing for Security: Challenges and Opportunities. Proc. IEEE 108, 12 (2020), 2214ś2231. [96] Weiqiang Liu, Chongyan Gu, Gang Qu, and Máire O’Neill. 2018. Approximate Computing and Its Application to Hardware Security. In Cyber-Physical Systems Security. Springer, 43ś67. [97] Weiqiang Liu, Liangyu Qian, Chenghua Wang, Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2017. Design of Approximate RADIX-4 Booth Multipliers for Error-Tolerant Computing. IEEE Trans. Comput. 66, 8 (2017), 1435ś1441. [98] Tieu-Khanh Luong, Van-Tinh Nguyen, Anh-Thai Nguyen, and Emanuel Popovici. 2019. Eicient Architectures and Implementation of Arithmetic Functions Approximation based Stochastic Computing. In Proc. of 30th International Conference on Application-speciic Systems, Architectures and Processors (ASAP). IEEE, 281ś287. [99] Fei Lyu, Xiaoqi Xu, Yu Wang, Yuanyong Luo, Yuxuan Wang, and Hongbing Pan. 2020. Ultralow-Latency VLSI Architecture based on a Linear Approximation Method for Computing � th Roots of Floating-Point Numbers. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 2 (2020), 715ś727. [100] Wolfgang Maass. 1997. Networks of Spiking Neurons: The Third Generation of Neural Network Models. Neural Networks 10, 9 (1997), 1659ś1671. [101] Yashaswi Mannepalli, Viraj Bharadwaj Korede, and Madhav Rao. 2021. Novel Approximate Multiplier Designs for Edge Detection Application. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 371ś377. [102] Maurizio Martina, Guido Masera, Massimo Ruo Roch, and Gianluca Piccinini. 2015. Result-Biased Distributed-Arithmetic-based Filter Architectures for Approximately Computing the DWT. IEEE Transactions on Circuits and Systems I: Regular Papers 62, 8 (2015), 2103ś2113. [103] Sana Mazahir, Osman Hasan, and Muhammad Shaique. 2019. Self-Compensating Accelerators for Eicient Approximate Computing. Microelectronics Journal 88 (2019), 9ś17. [104] Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie Enright Jerger. 2015. Doppelgänger: A Cache for Approximate Computing. In Proc. of 48th International Symposium on Microarchitecture (MICRO). ACM, 50ś61. [105] Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing. ACM Computing Surveys (CSUR) 48, 4 (2016), 1ś33. [106] Mohammad Hossein Moaiyeri, Farnaz Sabetzadeh, and Shaahin Angizi. 2018. An Eicient Majority-based Compressor for Approximate Computing in the Nano Era. Microsystem Technologies 24, 3 (2018), 1589ś1601. ACM Comput. Surv. 36 • Damsgaard et al. [107] MasoomehMomeni and Hadi Shahriar Shahhoseini. 2022. Energy Eicient 3D Network-on-Chip based on Approximate Communication. Computer Networks 203 (2022), 108652. [108] Marcio Monteiro, Ismael Seidel, Mateus Grellert, José Luis Güntzel, Leonardo Soares, and Cristina Meinhardt. 2022. Exploring the Impacts of Multiple Kernel Sizes of Gaussian Filters Combined to Approximate Computing in Canny Edge Detection. In Proc. of 13th Latin America Symposium on Circuits and System (LASCAS). IEEE, 1ś4. [109] Bert Moons and Marian Verhelst. 2015. DVAS: Dynamic Voltage Accuracy Scaling for Increased Energy-Eiciency in Approximate Computing. In Proc. of International Symposium on Low Power Electronics and Design (ISLPED). IEEE/ACM, 237ś242. [110] ThierryMoreau, MarkWyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, andMark Oskin. 2015. SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration. In Proc. of 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 603ś614. [111] Amir Najai, Lennart Bamberg, Ardalan Najai, and Alberto Garcia-Ortiz. 2019. Integer-Value Encoding for Approximate on-Chip Communication. IEEE Access 7 (2019), 179220ś179234. [112] Gabriel Luca Nazar, Pedro Kopper, Marcos Leipnitz, and Ben Juurlink. 2021. Lightweight Dual Modular Redundancy through Approximate Computing. In Proc. of XI Brazilian Symposium on Computing Systems Engineering (SBESC). IEEE, 1ś8. [113] Geneviève Ndour, Tiago Trevisan Jost, Anca Molnos, Yves Durand, and Arnaud Tisserand. 2019. Evaluation of Variable Bit-Width Units in a RISC-V Processor for Approximate Computing. In Proc. of 16th International Conference on Computing Frontiers (CF). ACM, 344ś349. [114] Kumud Nepal, Soheil Hashemi, Hokchhay Tann, Ruth Iris Bahar, and Sherief Reda. 2016. Automated High-Level Generation of Low-Power Approximate Computing Circuits. IEEE Transactions on Emerging Topics in Computing 7, 1 (2016), 18ś30. [115] William Staford Noble. 2006. What is a Support Vector Machine? Nature Biotechnology 24, 12 (2006), 1565ś1567. [116] Tuaha Nomani, Mujahid Mohsin, Zahid Pervaiz, and Muhammad Shaique. 2020. xUAVs: Towards Eicient Approximate Computing for UAVsÐLow Power Approximate Adders with Single LUT Delay for FPGA-based Aerial Imaging Optimization. IEEE Access 8 (2020), 102982ś102996. [117] Bernard Nongpoh, Rajarshi Ray, and Ansuman Banerjee. 2019. Approximate Computing for Multithreaded Programs in Shared Memory Architectures. In Proc. of 17th International Conference on Formal Methods and Models for System Design (MEMOCODE). ACM, 1ś9. [118] Bernard Nongpoh, Rajarshi Ray, Moumita Das, and Ansuman Banerjee. 2019. Enhancing Speculative Execution With Selective Approximate Computing. ACM Transactions on Design Automation of Electronic Systems (TODAES) 24, 2 (2019), 1ś29. [119] Aleksandr Ometov, Oliver Liombe Molua, Mikhail Komarov, and Jari Nurmi. 2022. A Survey of Security in Cloud, Edge, and Fog Computing. Sensors 22, 3 (2022), 927. [120] Aleksandr Ometov and Jari Nurmi. 2022. Towards Approximate Computing for Achieving Energy vs. Accuracy Trade-ofs. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 632ś635. [121] Aleksandr Ometov, Viktoriia Shubina, Lucie Klus, Justyna Skibińska, Salwa Saai, Pavel Pascacio, Laura Flueratoru, Darwin Quezada Gaibor, Nadezhda Chukhno, Olga Chukhno, Asad Ali, Asma Channa, Ekaterina Svertoka, Waleed Bin Qaim, Raúl Casanova-Marqués, Sylvia Holcer, Joaquín Torres-Sospedra, Sven Casteleyn, Giuseppe Ruggeri, GiuseppeAraniti, RadimBurget, Jiri Hosek, and Elena Simona Lohan. 2021. A Survey on Wearable Technology: History, State-of-the-Art and Current Challenges. Computer Networks 193 (2021), 108074. [122] Roberto Osorio and Gabriel Rodriguez. 2019. Truncated SIMD Multiplier Architecture for Approximate Computing in Low-Power Programmable Processors. IEEE Access 7 (2019), 56353ś56366. [123] Matthew Page, Joanne McKenzie, Patrick Bossuyt, Isabelle Boutron, Tammy Hofmann, Cynthia Mulrow, Larissa Shamseer, Jennifer Tetzlaf, Elie Akl, Sue Brennan, et al. 2021. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 372 (2021), 11. [124] Krishna Palem and Avinash Lingamneni. 2013. Ten Years of Building Broken Chips: The Physics and Engineering of Inexact Computing. ACM Transactions on Embedded Computing Systems (TECS) 12, 2s (2013), 1ś23. [125] Alberto Paltrinieri, Riccardo Peloso, Guido Masera, Muhammad Shaique, and Maurizio Martina. 2019. On the Efect of Approximate- Computing in Motion Estimation. Journal of Low Power Electronics 15, 1 (2019), 40ś50. [126] Francesca Palumbo and Carlo Sau. 2021. Reconigurable and Approximate Computing for Video Coding. arXiv preprint: 2103.03712 (2021), 33. [127] Keerthana Pamidimukkala, Kyung Ki Kim, Yong-Bin Kim, and Minsu Choi. 2018. Generalized Adaptive Variable Bit Truncation Method for Approximate Stochastic Computing. In Proc. of 15th International SoC Design Conference (ISOCC). IEEE, 218ś219. [128] Behrooz Parhami. 2018. A Case for Table-Based Approximate Computing. In Proc. of 9th Information Technology, Electronics and Mobile Communication Conference (IEMCON). IEEE, 650ś653. [129] Jongse Park, Emmanuel Amaro, Divya Mahajan, Bradley Thwaites, and Hadi Esmaeilzadeh. 2016. AxGames: Towards Crowdsourcing Quality Target Determination in Approximate Computing. In Proc. of 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 623ś636. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 37 [130] Zhenghao Peng, Xuyang Chen, Chengwen Xu, Naifeng Jing, Xiaoyao Liang, Cewu Lu, and Li Jiang. 2018. AXNet: ApproXimate Computing Using an End-to-End Trainable Neural Network. In Proc. of 37th International Conference on Computer-Aided Design (ICCAD). ACM, 1ś8. [131] Michael Pfeifer and Thomas Pfeil. 2018. Deep Learning with Spiking Neurons: Opportunities and Challenges. Frontiers in Neuroscience 12 (2018), 774. [132] Ali Piri, Sepide Saeedi, Mario Barbareschi, Bastien Deveautour, Stefano Di Carlo, Ian O’Connor, Alessandro Savino, Marcello Traiola, and Alberto Bosio. 2022. Input-Aware Approximate Computing. In Proc. of International Conference on Automation, Quality and Testing, Robotics (AQTR). IEEE, 1ś6. [133] Stefania Preatto, Andrea Giannini, Luca Valente, Guido Masera, and Maurizio Martina. 2020. Optimized VLSI Architecture of HEVC Fractional Pixel Interpolators with Approximate Computing. Journal of Low Power Electronics and Applications 10, 3 (2020), 24. [134] Waleed Bin Qaim, Aleksandr Ometov, Claudia Campolo, Antonella Molinaro, Elena Simona Lohan, and Jari Nurmi. 2021. Understanding the Performance of Task Oloading for Wearables in a Two-Tier Edge Architecture. In Proc. of 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT). IEEE, 1ś9. [135] Fei Qiao, Ni Zhou, Yuanchang Chen, and Huazhong Yang. 2015. Approximate Computing in Chrominance Cache for Image/Video Processing. In Proc. of International Conference on Multimedia Big Data (BigMM). IEEE, 180ś183. [136] Karri Manikantta Reddy, Moodabettu Harishchandra Vasantha, Yernad Balachandra Nithin Kumar, and Devesh Dwivedi. 2020. Design of Approximate Booth Squarer for Error-Tolerant Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 5 (2020), 1230ś1241. [137] Karri Manikantta Reddy, Moodabettu Harishchandra Vasantha, Yernad Balachandra Nithin Kumar, Ch Keshava Gopal, and Devesh Dwivedi. 2021. Quantization Aware Approximate Multiplier and Hardware Accelerator for Edge Computing of Deep Learning Applications. Integration 81 (2021), 268ś279. [138] Francesco Regazzoni and Ilia Polian. 2020. Side Channel Attacks vs Approximate Computing. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 321ś326. [139] Md Farhadur Reza and Paul Ampadu. 2019. Approximate Communication Strategies for Energy-Eicient and High Performance NoC: Opportunities and Challenges. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 399ś404. [140] Gennaro Severino Rodrigues, Juan Fonseca, Fabio Benevenuti, Fernanda Kastensmidt, and Alberto Bosio. 2019. Exploiting Approximate Computing for Low-Cost Fault Tolerant Architectures. In Proc. of 32nd Symposium on Integrated Circuits and Systems Design (SBCCI). IEEE, 1ś6. [141] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. 2019. Towards Spike-Based Machine Intelligence with Neuromorphic Computing. Nature 575, 7784 (2019), 607ś617. [142] Jochen Rust, Nils Heidmann, and Stefen Paul. 2017. Approximate Computing of Two-Variable Numeric Functions Using Multiplier-Less Gradients. Microprocessors and Microsystems 48 (2017), 48ś55. [143] Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, and Alberto Ros. 2016. Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research. In Proc. of International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 101ś111. [144] Ferdos Salmanpour, Mohammad Hossein Moaiyeri, and Farnaz Sabetzadeh. 2021. Ultra-Compact Imprecise 4:2 Compressor and Multiplier Circuits for Approximate Computing in Deep Nanoscale. Circuits, Systems, and Signal Processing 40, 9 (2021), 4633ś4650. [145] Syed Shakib Sarwar, Gopalakrishnan Srinivasan, Bing Han, Parami Wijesinghe, Akhilesh Jaiswal, Priyadarshini Panda, Anand Raghunathan, and Kaushik Roy. 2018. Energy Eicient Neural Computing: A Study of Cross-Layer Approximations. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 8, 4 (2018), 796ś809. [146] Yuuki Sato, Takanori Tsumura, Tomoaki Tsumura, and Yasuhiko Nakashima. 2015. An Approximate Computing Stack based on Computation Reuse. In Proc. of 3rd International Symposium on Computing and Networking (CANDAR). IEEE, 378ś384. [147] Carlo Sau, Francesca Palumbo, Maxime Pelcat, Julien Heulot, Erwan Nogues, Daniel Menard, Paolo Meloni, and Luigi Rafo. 2017. Challenging the Best HEVC Fractional Pixel FPGA Interpolators with Reconigurable and Multifrequency Approximate Computing. IEEE Embedded Systems Letters 9, 3 (2017), 65ś68. [148] Ilaria Scarabottolo, Giovanni Ansaloni, George Anthony Constantinides, Laura Pozzi, and Sherief Reda. 2020. Approximate Logic Synthesis: A Survey. Proc. IEEE 108, 12 (2020), 2195ś2213. [149] Michael Schafner, Frank Kagan Gürkaynak, Aljosa Smolic, Hubert Kaeslin, and Luca Benini. 2014. An Approximate Computing Technique for Reducing the Complexity of a Direct-Solver for Sparse Linear Systems in Real-Time Video Processing. In Proc. of 51st Design Automation Conference (DAC). ACM, 1ś6. [150] Lukas Sekanina. 2021. Evolutionary Algorithms in Approximate Computing: A Survey. arXiv preprint arXiv:2108.07000 (2021), 12. [151] Sanchari Sen, Swagath Venkataramani, and Anand Raghunathan. 2017. Approximate Computing for Spiking Neural Networks. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 193ś198. [152] Sayandeep Sen, Tan Zhang, Syed Gilani, Shreesha Srinath, Suman Banerjee, and Sateesh Addepalli. 2012. Design and Implementation of an “Approximatež Communication System for Wireless Media Applications. IEEE/ACM Transactions on Networking 21, 4 (2012), 1035ś1048. ACM Comput. Surv. 38 • Damsgaard et al. [153] Ramu Seva, Prashanthi Metku, Kyung Ki Kim, Yong-Bin Kim, and Minsu Choi. 2016. Approximate Stochastic Computing (ASC) for Image Processing Applications. In Proc. of 13th International SoC Design Conference (ISOCC). IEEE, 31ś32. [154] Botang Shao and Peng Li. 2015. Array-based Approximate Arithmetic Computing: A General Model and Applications to Multiplier and Squarer Design. IEEE Transactions on Circuits and Systems I: Regular Papers 62, 4 (2015), 1081ś1090. [155] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. 2016. Edge Computing: Vision and Challenges. IEEE Internet of Things Journal 3, 5 (2016), 637ś646. [156] Majid Shoushtari, Abbas BanaiyanMofrad, and Nikil Dutt. 2015. Exploiting Partially-Forgetful Memories for Approximate Computing. IEEE Embedded Systems Letters 7, 1 (2015), 19ś22. [157] Sharad Sinha and Wei Zhang. 2016. Low-Power FPGA Design Using Memoization-based Approximate Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 8 (2016), 2665ś2678. [158] Midde Venkata Siva and EP Jayakumar. 2020. Approximated Algorithm and Low Cost VLSI Architecture for Edge Enhanced Image Scaling. In Proc. of International Conference on Industry 4.0, Artiicial Intelligence, and Communications Technology (IAICT). IEEE, 125ś130. [159] Farhana Sharmin Snigdha, Deepashree Sengupta, Jiang Hu, and Sachin Sapatnekar. 2016. Optimal Design of JPEG Hardware under the Approximate Computing Paradigm. In Proc. of 53rd Design Automation Conference (DAC). ACM, 1ś6. [160] Leonardo Bandeira Soares, Julio Oliveira, Eduardo Antonio César da Costa, and Sergio Bampi. 2020. An Energy-Eicient and Approximate Accelerator Design for Real-Time Canny Edge Detection. Circuits, Systems, and Signal Processing 39 (2020), 6098ś6120. [161] Haiyue Song, Chengwen Xu, Qiang Xu, Zhuoran Song, Naifeng Jing, Xiaoyao Liang, and Li Jiang. 2018. Invocation-Driven Neural Approximate Computing with a Multiclass-Classiier and Multiple Approximators. In Proc. of 37th International Conference on Computer- Aided Design (ICCAD). ACM, 1ś8. [162] Giuseppe Tagliavini, Andrea Marongiu, Davide Rossi, and Luca Benini. 2016. Always-on Motion Detection with Application-Level Error Control on a Near-Threshold Approximate Computing Platform. In Proc. of 23rd International Conference on Electronics, Circuits and Systems (ICECS). IEEE, 552ś555. [163] Giuseppe Tagliavini, Davide Rossi, Andrea Marongiu, and Luca Benini. 2016. Synergistic HW/SW Approximation Techniques for Ultralow-Power Parallel Computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 5 (2016), 982ś995. [164] Tomoaki Tsumura, Ikuma Suzuki, Yasuki Ikeuchi, Hiroshi Matsuo, Hiroshi Nakashima, and Yasuhiko Nakashima. 2007. Design and Evaluation of an Auto-Memoization Processor. In Proc. of 25th International Multi-Conference: Parallel and Distributed Computing and Networks (IASTED). ACM, 245ś250. [165] Fengbin Tu, Shouyi Yin, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2018. Reconigurable Architecture for Neural Approximation in Multimedia Computing. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2018), 892ś906. [166] Georgios Tziantzioulis, Ali Murat Gok, SM Faisal, Nikolaos Hardavellas, Seda Ogrenci-Memik, and Srinivasan Parthsarathy. 2016. Lazy Pipelines: Enhancing Quality in Approximate Computing. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1381ś1386. [167] Kimiyoshi Usami, Hajime Ochi, and Yoshinori Ono. 2020. Approximate Computing based on Latest-result Reuse for Image Edge Detection. In Proc. of 35th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). IEEE, 234ś239. [168] Martin Van Leussen, Jos Huisken, Lei Wang, Hailong Jiao, and Jose Pineda De Gyvez. 2017. Reconigurable Support Vector Machine Classiier with Approximate Computing. In Proc. of Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 13ś18. [169] Swagath Venkataramani, Srimat Tirumala Chakradhar, Kaushik Roy, and Anand Raghunathan. 2015. Approximate Computing and the Quest for Computing Eiciency. In Proc. of 52nd Design Automation Conference (DAC). ACM, 1ś6. [170] Swagath Venkataramani, Vinay Kumar Chippa, Srimat Tirumala Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality Programmable Vector Processors for Approximate Computing. In Proc. of 46th International Symposium on Microarchitecture (MICRO). IEEE, 1ś12. [171] Swagath Venkataramani, Xiao Sun, NaigangWang, Chia-Yu Chen, Jungwook Choi, Mingu Kang, Ankur Agarwal, Jinwook Oh, Shubham Jain, Tina Babinsky, et al. 2020. Eicient AI System Design with Cross-Layer Approximate Computing. Proc. IEEE 108, 12 (2020), 2232ś2250. [172] Yan Verdeja Herms and Yanjing Li. 2019. Crash Skipping: A Minimal-Cost Framework for Eicient Error Recovery in Approximate Computing Environments. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 129ś134. [173] Pradnya Vikhar. 2016. Evolutionary Algorithms: A Critical Review and Its Future Prospects. In Proc. of International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC). IEEE, 261ś265. [174] Qian Wang, Youjie Li, and Peng Li. 2016. Liquid State Machine based Pattern Recognition on FPGA with Firing-Activity Dependent Power Gating and Approximate Computing. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 361ś364. [175] Ting Wang, Qian Zhang, Nam Sung Kim, and Qiang Xu. 2016. On Efective and Eicient Quality Management for Approximate Computing. In Proc. of International Symposium on Low Power Electronics and Design (ISLPED). ACM, 156ś161. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 39 [176] Ye Wang, Jian Dong, Qian Xu, Zhaojun Lu, and Gang Qu. 2020. Is It Approximate Computing or Malicious Computing?. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 333ś338. [177] Ye Wang, Jian Dong, Qian Xu, and Gang Qu. 2021. FTApprox: A Fault-Tolerant Approximate Arithmetic Computing Data Format. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1548ś1551. [178] Ying Wang, Huawei Li, and Xiaowei Li. 2017. Real-Time Meets Approximate Computing: An Elastic CNN Inference Accelerator with Adaptive Trade-Of between QoS and QoR. In Proc. of 54th Design Automation Conference (DAC). ACM, 1ś6. [179] Yang Wang, Yubin Qin, Dazheng Deng, Jingchuan Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao Sun, Leibo Liu, Shaojun Wei, et al. 2022. A 28nm 27.5 TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing. In Proc. of International Solid-State Circuits Conference (ISSCC). IEEE, 1ś3. [180] Zhihui Wang, Shouyi Yin, Fengbin Tu, Leibo Liu, and Shaojun Wei. 2018. An Energy Eicient JPEG Encoder with Neural Network Based Approximation and Near-Threshold Computing. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 1ś5. [181] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. ACM SIGARCH Computer Architecture News 23, 2 (1995), 24ś36. [182] Di Wu and Joshua San Miguel. 2021. Special Session: When Datalows Converge: Reconigurable and Approximate Computing for Emerging Neural Networks. In Proc. of 39th International Conference on Computer Design (ICCD). IEEE, 9ś12. [183] Hang Xiao, Haobo Xu, Xiaoming Chen, Yujie Wang, and Yinhe Han. 2021. Fast and High-Accuracy Approximate MAC Unit Design for CNN Computing. IEEE Embedded Systems Letters 14, 3 (2021), 155ś158. [184] Jie Xiao, Jianhao Hu, and Kaining Han. 2019. Low Complexity Expectation Propagation Detection for SCMA Using Approximate Computing. In Proc. of Global Communications Conference (GLOBECOM). IEEE, 1ś6. [185] Siyuan Xiao, Xiaohang Wang, Maurizio Palesi, Amit Kumar Singh, Liang Wang, and Terrence Mak. 2020. On Performance Optimization and Quality Control for Approximate-Communication-Enabled Networks-on-Chip. IEEE Trans. Comput. 70, 11 (2020), 1817ś1830. [186] Yan Xing, Ziji Zhang, Yiduan Qian, Qiang Li, and Yajuan He. 2018. An Energy-Eicient Approximate DCT for Wireless Capsule Endoscopy Application. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 1ś4. [187] Qiang Xu, Todd Mytkowicz, and Nam Sung Kim. 2015. Approximate Computing: A Survey. IEEE Design & Test 33, 1 (2015), 8ś22. [188] Siyuan Xu and Benjamin Carrion Schafer. 2018. Toward Self-Tunable Approximate Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 4 (2018), 778ś789. [189] Tongxin Yang, Tomoaki Ukezono, and Toshinori Sato. 2018. A Low-Power yet High-Speed Conigurable Adder for Approximate Computing. In Proc. of International Symposium on Circuits and Systems (ISCAS). IEEE, 1ś5. [190] Tongxin Yang, Tomoaki Ukezono, and Toshinori Sato. 2022. Reducing Power Consumption using Approximate Encoding for CNN Accelerators at the Edge. In Proc. of Great Lakes Symposium on VLSI (GLSVLSI). ACM, 229ś235. [191] Wu Yang and Himanshu Thapliyal. 2020. Low-Power and Energy-Eicient Full Adders with Approximate Adiabatic Logic for Edge Computing. In Proc. of Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 312ś315. [192] Wu Yang and Himanshu Thapliyal. 2021. Approximate Adiabatic Logic for Low-Power and Secure Edge Computing. IEEE Consumer Electronics Magazine 11, 1 (2021), 88ś94. [193] Zhixi Yang, Jie Han, and Fabrizio Lombardi. 2015. Transmission Gate-based Approximate Adders for Inexact Computing. In Proc. of International Symposium on Nanoscale Architectures (NANOARCH). IEEE, 145ś150. [194] Zhixi Yang, Ajaypat Jain, Jinghang Liang, Jie Han, and Fabrizio Lombardi. 2013. Approximate XOR/XNOR-based Adders for Inexact Computing. In Proc. of 13th International Conference on Nanotechnology (IEEE-NANO). IEEE, 690ś693. [195] Ruoheng Yao, Lei Chen, Pingcheng Dong, Zhuoyu Chen, and Fengwei An. 2022. A Compact Hardware Architecture for Bilateral Filter with the Combination of Approximate Computing and Look-up Table. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 7 (2022), 3324ś3328. [196] Pruthvy Yellu, Landon Buell, Miguel Mark, Michel A Kinsy, Dongpeng Xu, and Qiaoyan Yu. 2021. Security Threat Analyses and Attack Models for Approximate Computing Systems: From Hardware and Micro-architecture Perspectives. ACM Transactions on Design Automation of Electronic Systems (TODAES) 26, 4 (2021), 1ś31. [197] Ashkan Yousefpour, Caleb Fung, Tam Nguyen, Krishna Kadiyala, Fatemeh Jalali, Amirreza Niakanlahiji, Jian Kong, and Jason Jue. 2019. All One Needs to Know about Fog Computing and Related Edge Computing Paradigms: A Complete Survey. Journal of Systems Architecture 98 (2019), 289ś330. [198] Shuyuan Yu, Yibo Liu, and Sheldon Tan. 2021. Approximate Divider Design Based on Counting-Based Stochastic Computing Division. In Proc. of 3rd Workshop on Machine Learning for CAD (MLCAD). IEEE, 1ś6. [199] Vinícius Zanandrea, Douglas Borges, Vagner Santos da Rosa, and Cristina Meinhardt. 2021. Exploring Approximate Computing and Near-Threshold Operation to Design Energy-Eicient Multipliers. In Proc. of 34th Symposium on Integrated Circuits and Systems Design (SBCCI). IEEE, 1ś6. [200] Georgios Zervakis, Hassaan Saadat, Hussam Amrouch, Andreas Gerstlauer, Sri Parameswaran, and Jörg Henkel. 2021. Approximate Computing for ML: State-of-the-art, Challenges and Visions. In Proc. of 26th Asia and South Paciic Design Automation Conference (ASP-DAC). IEEE, 189ś196. ACM Comput. Surv. 40 • Damsgaard et al. [201] Xianwei Zhang, Youtao Zhang, Bruce Childers, and Jun Yang. 2017. DrMP: Mixed Precision-Aware DRAM for High Performance Approximate and Precise Computing. In Proc. of 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 53ś63. [202] Yangcan Zhou, Zhiyu Chen, Jun Lin, and Zhongfeng Wang. 2018. A High-Speed Successive-Cancellation Decoder for Polar Codes Using Approximate Computing. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 2 (2018), 227ś231. [203] Yangcan Zhou, Jun Lin, and Zhongfeng Wang. 2017. Energy Eicient SVM Classiier Using Approximate Computing. In Proc. of 12th International Conference on ASIC (ASICON). IEEE, 1045ś1048. [204] Feiyu Zhu, Shaowei Zhen, Xilin Yi, Haoran Pei, Bowen Hou, and Yajuan He. 2022. Design of Approximate Radix-256 Booth Encoding for Error-Tolerant Computing. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 4 (2022), 2286ś2290. ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 41 A METHODOLOGY For this systematic literature review, we followed the PRISMA guidelines [123]. As such, we initially set out to identify a set of appropriate search entry formed by keywords and their synonyms, allowing us to form a comprehensive search expression. A scan of the most frequently cited, relevant survey papers led to the following expression: ( approximat* OR inexact OR inaccurate OR ❵❵good enough'' ) AND ( computing OR edge OR cloud OR fog OR communic* OR wireless ) A search was performed using Scopus [45]. We applied ilters limiting results to English publications from 2013 till 2022, whose main topic is in computing or engineering. This resulted in a total of 1309 potentially relevant publications (as of August 1, 2022). We iltered these results in three rounds: 1) a coarse iltering based primarily on paper titles; 2) a iltering based on abstracts and conclusions; and 3) a ine iltering of the remaining papers based on a brief overview of their full contents. We applied the following general exclusion criteria: C1 works not related to AxC or Edge computing; C2 works on software-only or emerging transistor technology-based AxC techniques, including, e.g., memristive devices; C3 works on AxC frameworks whose main contributions are not hardware but rather related to synthesis, compilation, signiicance analysis, or tooling; C4 invited works with no technical content; and C5 full text not available. Criteria C2 and C3 ensure the review’s relevancy to most digital hardware design researchers but imply the exclusion of articles such as [11, 52, 74, 130]. We further exclude the majority of survey and review works, including only those considered fundamental to the AxC ield and those covering related topics not comprised by other publications. We also ilter clearly iterative works, excluding all but the most recent related publication. Lastly, to avoid unreasonably broadening the scope of the review, we limit inclusions to direct search results and, thus, do not include works cited by the selected publications. All three authors have participated in the iltering to ensure its fairness. The above resulted in a set of 167 relevant publications. The publications are distributed over the included years as shown in Fig. 11. The number of publications is on an upward trend with, so far, most publications in 2019, indicating an increasing interest in the ield. 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Publication year 0 5 10 15 20 25 30 35 40 N um be r of p ub lic at io ns 2016.1 2018.5 2020.9 Fig. 11. Distribution of selected publications per year with lines for mean +/- one standard deviation. We have classiied 140 of the selected publications into the following main categories (as previously highlighted in Fig. 2): fundamental AxC techniques general approximation methods that can be applied in architectures for diferent applications; AxC-enabled hardware architectures larger-scale architectures incorporating one or more fundamental AxC techniques; and applications of AxC examples of AxC techniques applied to one or more speciic applications. The remaining 27 works are surveys or survey-style papers that motivate the use of AxC. Thus, the unclassiied works span several of the listed categories. ACM Comput. Surv. 42 • Damsgaard et al. B OVERVIEW TABLES Table 6. A summary of reviewed fundamental AxC techniques by class and publication year. Class Ref. Main content Merits Demerits C ir cu it -l e v e l [53] 2015 Exploration of availability and power gains from reducing refresh rate in embedded DRAMs. Availability gains are inversely proportional to transistor size. N/A [109] 2015 A scheme for dynamic voltage-accuracy scaling in pipelined designs. Energy savings with run-time re- conigurability. Some area and leakage power overheads. [156] 2015 Using caches with faulty ways due to manufacturing imperfections in approx- imate systems. Improved yield and leakage en- ergy savings at negligible errors. Relies on developer annotation. [163] 2016 An ultra-low power processing system for NTV operation utilizing a hybrid SRAM/standard cell memory architecture. Reduced energy consumption with compiler support. Relies on developer annotation. [166] 2016 Lazy write-back of results from voltage over-scaled FUs allowing for their results to converge closer to exact ones. Reduced bit error rate at low FU utilization. Hardware overheads for extra is- sue and write-back logic. [192] 2021 Adiabatic logic as an alternative to CMOS, speciically for approximate adders (like [191]). Improved power consumption and resilience to side-channel at- tacks. Costly clock net distribution re- quired for adiabatic logic. A ri th m e ti c [194] 2013 Three approximate full adders based on XOR/XNOR logic. Reduced area and power consump- tion. Missing error analysis when ap- plied to �-bit adders. [7] 2014 Four approximate full adders based on CMOS logic. Reduced area and power consump- tion. N/A [83] 2014 Approximate segmented adder and comparator with reduced-length carry propagation. Improved area, power, delay, and error metrics. N/A [29] 2015 Three approximate subtractors and an approximate divider utilizing them. Reduced power consumption at negligible quality degradation. N/A [154] 2015 A technique for designing approximate array-based multipliers and squarers with signature-based error compensation. Reduced error metrics. Area and power overheads from compensation logic. [193] 2015 Two approximate full adders based on transmission gates. Reduced delay and power con- sumption. N/A [42] 2017 Four approximate full adders based on CMOS logic and a segmented adder with error compensation. Reduced delay and error metrics. Area and power overheads from compensation logic. [97] 2017 Two approximate Booth encodings for multiplier design. Reduced area, power, delay, and error metrics. N/A [17] 2018 A technique for approximate partial product reduction in array-based multi- pliers with maskable carries. Reduced area and power consump- tion. Worsened error metrics. [35] 2018 Approximate segmented adder with feedback-based error compensation. Reduced area, power, delay, and error metrics. N/A [64] 2018 Approximate loating-point MAC unit with OR-based mantissa addition, error compensation, and exact re-computation. Improved energy eiciency and system speedup. Area overhead from error compen- sation. [106] 2018 An approximate compressor based on majority gates for partial product re- duction. Reduced area, power, delay, and error metrics. N/A [189] 2018 Approximate segmented adder with maskable carry propagation. Reduced power consumption with run-time reconigurability. Area overhead from carry mask- ing logic. [66] 2019 Four approximate subtractors and an approximate divider utlizing them. Reduced, area, delay, and power at negligible quality degradation. Missing error analysis when ap- plied to �-bit dividers. [103] 2019 Design of concatenated arithmetic circuit-based accelerators which cancel out errors through use of complementary circuits with opposite error polarity. Reduced error with no efect on power consumption. Requires deterministic error dis- tribution. [122] 2019 Design of approximate SIMD-style multiplier with adapted Booth encoding and truncated partial product array. Reduced area and energy con- sumption. N/A [62] 2020 A technique for designing ixed-width multipliers with Booth-encoded partial products with low errors based on probability analysis. Reduced area, power, delay, and error metrics. N/A [99] 2020 Piece-wise linear approximation of � th roots of loating-point numbers with corresponding hardware architecture. Reduced area, power, and latency at no quality degradation. N/A [116] 2020 Approximate adder designs with limited carry propagation targeting FPGAs. Reduced power consumption and error probability. High mean error distance from ap- proximate carry propagation. [136] 2020 Design of squarers with approximate Booth encoding and partial product reduction. Reduced area, power, delay, and error metrics. N/A [191] 2020 Two approximate full adders based on adiabatic logic. Reduced area and power consump- tion. Costly clock net distribution re- quired for adiabatic logic. [101] 2021 Sequential multiplier with approximate partial product accumulation. Reduced area, power, delay, and error metrics. Unclear implementation details. [144] 2021 An approximate compressor based on CMOS logic for partial product reduc- tion. Reduced area, power, and delay. Substantially worsened output quality. [183] 2021 Approximate MAC unit with inexact partial product reduction and accumula- tion. Reduced area, power, and delay. Slightly worsened output quality. Continued on the next page ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 43 cont. [199] 2021 Exploration of using inexact adders in diferent multipliers in NTV operation. Diferent multipliers show difer- ent beneits of approximation. Unclear power consumption ben- eits. [132] 2022 A technique for designing approximate multipliers for particular ranges of input values. N/A Mismatching conclusions and re- ported results. [204] 2022 Approximate high-radix Booth encoding for multiplier design. Improved error metrics at low area. N/A S to ch a st ic [153] 2016 Truncated seeds for bitstream generation compensated for by increased run- time. Reduced overall bitstream genera- tion run-time. Lacks run-time reconigurability. [127] 2018 Generalization of [153] to dynamically truncate operands based on most signiicant asserted bit. Improved output quality. Only works for unsigned operands. [70] 2019 Approximate adder design with the least-signiicant bits computing stochasti- cally. Reduced area and run-time. Uncertain output quality results. [20] 2020 Extended stochastic number representation supporting subtraction and divi- sion operations. Reduced area, power, delay, and error metrics. N/A [48] 2021 Approximate unary constant coeicient multiplier design. Reduced area. Complex implementation of unary-binary conversions. [198] 2021 Counter-based approximate divider with deterministic bitstream generation and early termination. Reduced area, delay, and energy consumption. N/A O th e rs [38] 2015 Piece-wise Taylor expansion-based approximation of non-linear functions and corresponding co-processor architecture. Low-overhead with arbitrary ac- curacy. Limited applicability given popu- lar alternatives, e.g., ReLU. [157] 2016 HLS post-processing step for generating and inserting static and dynamic memoization wrappers on speciic logic blocks. Signiicant power savings at low resource consumption overheads. No reconigurability at run-time. [142] 2017 Segmented linear approximation of complex two-variable functions imple- mented in multiplier-free binary trees. Low hardware complexity and ef- icient generation low. Potentially high memory area. [43] 2018 Evaluation of approximate adders and memoization on FPGAs. Clear beneits of memoization on FPGAs. Limited evaluation. [128] 2018 Table-based approximate arithmetic circuits. High power eiciency and easy er- ror analysis. Potentially high memory area. [98] 2019 Stochastic computing-based function approximation architectures. Low area, power, and error met- rics. Long latency per function evalua- tion. [71] 2020 Table-based technique based on pre-computed K-means clusterings and run- time nearest-centroid classiication. Low hardware complexity. No reconigurability at run-time. [72] 2020 CGRA-like accelerator for bi-section NNs used for neural approximation of arbitrary input functions. Low-complexity architecture with run-time reconigurability. Bi-section NNs are (not yet) popu- lar. Table 7. A summary of reviewed AxC-enabled architectures by class and publication year. Class Ref. Main content Merits Demerits G P P s [170] 2013 Quality-programmable ISA and corresponding vector processor micro- architecture supporting dynamic precision scaling. Fine-grained quality control and monitoring at run-time. Missing compilation tool low. [57] 2015 Heterogeneous computing platform with approximate/exact neural accelera- tors and dynamic error control through LWC-based multi-stage acceleration. Improved power consumption, quality control, and reconigura- bility. Limited NN lexibility, and invoca- tion of multiple approximators. [104] 2015 Cache architecture exploiting approximate similarity between cache lines and extra indirection to reduce data storage area. Signiicant on-chip area and power reductions. Complex hashing hardware, and no run-time quality control. [110] 2015 Systolic array-based NN accelerator targeting FPGA-equipped SoCs. Signiicant power savings and bet- ter performance than HLS. Limited lexibility for NN topolo- gies, and no run-time quality con- trol. [146] 2015 Developer-controlled approximate auto-memoization capable GPP architec- ture with compiler support. High degree of automation and ine-grained error control. Great initial adaptation efort and hardware overheads. [61] 2016 Signiicance-aware memoization scheme based on regressions and dynami- cally truncated ternary CAM lookups. Signiicant speedup with ine- grained quality control. Great hardware complexity of ternary CAM. [27] 2017 Approximate memoization-extended Floating-Point Unit of a RISC-style GPP with custom instruction-based control. Signiicant speedup and dynamic approximation control. Missing information on insertion of SXL instructions. [201] 2017 Three schemes for ine-grained data management in DRAM for trading of performance and energy-eiciency for errors. Signiicant speedup and energy savings. Requires both OS and hardware support. [161] 2018 Multi-class classiier and multiple approximator architecture for NN-based approximation with corresponding training algorithm and hardware accelera- tor. Increased number of invocations, speedup and energy reduction. No quality guarantee. [113] 2019 Energy-error trade-of exploration with truncated integer arithmetic/logic and load/store units in a RISC-V core. Easy evaluation of AxC in GPPs. Limited potential energy savings from integer arithmetic. [117] 2019 Relaxed coherence requirements on shared approximable data with an ap- proximate store instruction in many-core GPPs. Improved application run-time and energy consumption. High hardware overhead from ex- tra cache. [118] 2019 Selectively disabling roll-back on branch and load value mispredictions in speculative GPPs based on sensitivity analysis. Extensive sensitivity analysis without developer interference. No quality guarantee from Bayesian analysis. [82] 2021 ISA and micro-architecture extension to detect value similarity and subse- quently skip entire instruction sequences in GPPs. Signiicant speedup and dynamic approximation control. Di cult to adapt to out-of-order cores. Continued on the next page ACM Comput. Surv. 44 • Damsgaard et al. cont. C G R A s [33] 2013 Dynamic efort scaling-based processor for RMS applications with error re- siliency estimation low. Reduced energy consumption, quality control and reconigura- bility. Limited applicability beyond RMS kernels. [2] 2018 Quality-scalable CGRA enabling run-time adaptation of hardware approxima- tions to application quality constraints. Improved power consumption and adjustable quality constraints. Limited quality controllability due to static AxC implementation. [41] 2020 CGRA with run-time adaptation of hardware approximations to quality con- straints from predetermined operating points. Reduced power consumption with dynamic adaptation. Limited applicability due to simple PEs. N o C s [15] 2018 Developer-controlled, variable transmission power-based approximate wire- less NoC enabling unreliable loads and stores to approximable data structures in multicores. Signiicant power savings with lit- tle impact on application output quality. Manual annotation of approx- imable data structures. [49] 2019 Approximate wired/wireless NoC architecture with software-exposed broad- cast memories enabling fast synchronization and data sharing. Signiicant speedup, reduced en- ergy, and low-efort adaptation. Hardware overhead (particularly in broadcast memories). [111] 2019 Integer-value coding schemes based on swap and inversion techniques and extended by crosstalk-avoidance coding for approximate wired on-chip inter- connects. Reduced mean squared error of re- ceived data. Hardware overheads and brute- force crosstalk-avoidance coding search. [139] 2019 Truncation-based approximate network interface for wired NoCs. Simple hardware extension. No cumulative error control. [185] 2020 Dynamic quality control in approximate NoCs supporting packet dropping and compression/decompression. Reduced energy consumption and guaranteed error constraints. Relies on developer annotation and compile-time analysis. [107] 2022 Truncation- and VOS-based network interface and router for wired NoCs. Improved latency and energy ei- ciency. No quantitative error evaluation. Table 8. A summary of reviewed applications of AxC by class and publication year. Class Ref. Main content Merits Demerits M L [81] 2014 Stereo image matching accelerator with ANT mitigating approximate addi- tions and voltage over-scaling. Reduced voltage and power con- sumption. Signiicant hardware overhead. [174] 2016 Liquid state machine hardware accelerator for FPGA with activity-based power gating and quality-conigurable adders. Reduced power consumption. No SNN topology lexibility. [65] 2017 Exploration of eicient digital, memristive, and analog memory architectures for hyperdimensional computing. Analog memory revealed as the most eicient architecture. N/A [80] 2017 Approximate hardware accelerator for MLPs with on-chip training, synapse skipping, and dynamic precision control. Reduced power consumption with on-chip training. No run-time inference details. [85] 2017 Approximate hardware accelerator for NN with stochastic neurons. Greatly reduced area at negligible accuracy degradation. Inlexible hardware implementa- tion with physical neurons. [151] 2017 SNN framework and hardware accelerator with activity-based synapse skip- ping and early termination. Reduced energy consumption with run-time quality control. N/A [168] 2017 Hardware accelerator for linear and polynomial kernel SVM models with approximate multipliers. Reduced area and power consump- tion. No run-time reconigurability and quality control. [178] 2017 Design low and reconigurable hardware architecture for quality-trade-of- aware CNN acceleration. Dynamic quality (or perfor- mance/energy) trade-of. Software-level approximations only. [203] 2017 Hardware accelerator for Gaussian kernel SVM models with approximate adders and multipliers. Reduced area and power consump- tion No run-time reconigurability and quality control. [90] 2018 Mixed-signal approximate hardware accelerator for hybrid BWN and LSTM networks for speech recognition. Improved power eiciency. Limited beneits beyond process- ing scaling. [145] 2018 Comprehensive study of efects arising from applying cross-layer approxima- tion techniques to NN inference. Signiicant power savings with lit- tle impact on NN accuracy. No reconigurability at run-time. [165] 2018 NN accelerator with scheduler-supported datalow reconiguration at run- time and corresponding performance model. Signiicant speedup and increased power eiciency. No quality control. [50] 2019 Algoritmic and hardware approximations for low-power SVM acceleration. Vastly reduced area and energy consumption. Unclear implementation details. [56] 2019 Hardware accelerator for CNNs with approximate adders and multipliers. Improved throughput and energy eiciency. Complex design space exploration and mapping lows. [92] 2019 Mixed-signal hardware accelerator for DNN-based VAD with dynamic alphabet-set multipliers. Improved power eiciency with run-time quality reconigurability. Low throughput. [93] 2019 Hardware accelerator for KWS based on quantized CNNs and voltage-based approximate multiplication. Reduced word error rate. N/A [24] 2020 Approximate im2col architecture for IoT edge CNN acceleration. Speedup with little quality degra- dation. Signiicant area overheads. [89] 2020 Hardware accelerator for BWN-based VAD with dynamic SNR-based arith- metic precision scaling. Improved classiication accuracy. N/A [86] 2020 Hardware accelerator for BWN-based KWS with delay-based approximate addition. Improved resilience toward noise. N/A [91] 2020 Hardware accelerator using CNNs quantized into BWNs and approximate self-adaptive adders. Reduced power per operation. N/A [69] 2020 Hardware accelerator for LSTMs using similarity-based cell skip- ping/approximation and sparsity-aware memory accesses. Number of MACs and memory ac- cesses halved. Works only for bi-directional LSTMs. [75] 2020 Hardware accelerator for ensemble-based anomaly detection with on-chip training and quantized approximate evaluation. Improved energy eiciency with run-time adaptability. No throughput comparisons with related works. Continued on the next page ACM Comput. Surv. Approximation Opportunities in Edge Computing Hardware: A Systematic Literature Review • 45 cont. [78] 2020 Hardware accelerator for hyper-dimensional classiication tasks with approx- imate encoding and accumulation. Signiicant speedup at reduced re- source and energy consumption. No run-time reconigurability. [171] 2020 Cross-layer approximation-aware DNN design low, hardware accelerator, and model-to-hardware mapping. Run-time reconigurability and near-optimal resource utilization. No system-wide evaluation re- sults. [30] 2021 Technique combining contrast reduction and selective truncation of pixels in image classiication applications on NoCs. Overall speedup, reduced NoC la- tency and power. Narrow scope of application. [58] 2021 Approximate multipliers for mitigating the efectiveness of adversarial attacks on CNNs. Reduced transferability, resource and power consumption. No guaranteed protection. [84] 2021 Custom encoding and hardware implementation of binary-operand dot prod- ucts for CNN acceleration. Increased operating frequency and throughput. No run-time reconigurability. [137] 2021 Approximate radix-4 multiplier and GEMM-style accelerator architecture targeting FPGAs. High throughput, reduced re- source and power consumptions. Needs weight re-ordering. [182] 2021 Accelerator architecture for NNs combining matrix-matrix and linear approx- imation functionalities. Reduced area and power consump- tion. Large quality degradation in some conigurations. [94] 2022 Approximate SIMD-like multiplier for CNN acceleration with approximate partial product generation and reduction. Reduced area, power, and delay. N/A [87] 2022 Hardware accelerator using CNNs quantized into BWNs and approximate dual-voltage capable adders (extends [91]). Reduced power per operation. N/A [179] 2022 Accelerator architecture for transformer NNs with approximate arithmetic, sparsity speculation, and out-of-order scheduling. Greatly improved energy ei- ciency. N/A [190] 2022 Approximate maskable input encoding for CNN accelerators. Low-overhead and non-intrusive technique. Limited beneits without signii- cant accuracy degradation. Im a g e p ro ce ss in g [37] 2016 Canny edge detection accelerator with approximate Gaussian and gradient ilters. Reduced area and energy per im- age. No run-time reconigurability. [159] 2016 ILP-based optimization of a Loeler-style DCT implementation with approxi- mate adders and multipliers. Reduced power consumption and area with bounded error variance. Only arithmetic is approximated. [9] 2017 Framework for error analysis of multiplier-less approximate adder-based DCT implementations. Shows clear approximation poten- tial. All additions are subject to the same approximation. [180] 2018 CGRA-like accelerator for MLP-based approximation of the DCT algorithm. Signiicantly reduced latency and energy-delay product metrics. Increased area and energy per pixel. [186] 2018 DCT implementation with approximate weights, thresholding of intermediate values, and inexact adders. Improved output quality at low en- ergy consumption. N/A [158] 2020 Image scaling accelerator based on bi-linear interpolation and approximate edge detection and sharpening ilters. N/A No run-time reconigurability. [160] 2020 Canny edge detection accelerator with approximate Gaussian ilters, gradient ilters, and gradient direction and magnitude (like [37]). Reduced area and energy per im- age. No run-time reconigurability. [167] 2020 Memory-free, most-recent-result reuse technique applied to Kirsch edge de- tection. N/A Limited beneits due to hardware overheads. [108] 2022 Canny edge detection accelerator with reduced-size Gaussian and gradient ilters. Large area and power savings. Substantially worsened output quality. [195] 2022 Bilateral ilter accelerator for image denoising with approximated ilter weights and inexact division. Large area savings with negligible output quality degradation. N/A V id e o p ro ce ss in g [149] 2014 Approximate hardware accelerator for Cholesky decomposition of sparse linear systems, useful in video coding. Great throughput improvements. N/A [135] 2015 Extended cache hierarchy for approximately reusing chrominance data for image and video coding. Reduced number of of-chip mem- ory accesses. Intrusive technique with unclear hardware overhead. [44] 2017 Heterogeneous SAD accelerator with diferent approximation tiles and power- gating. Power savings from run-time re- conigurability. All additions are subject to the same approximation. [147] 2017 Reconigurable fractional pixel interpolation circuit with approximate ilters and clock-gating targeting FPGAs. Reduced power and energy with run-time reconigurability. Unclear approximation error im- pact. [125] 2019 Application of approximate additions/subtractions in an optimized SAD ac- celerator. N/A No reconigurability, insigniicant power and area savings. [133] 2020 Reconigurable approximate adder-based accelerator architecture for frac- tional pixel interpolation targeting ASICs. Reduced power consumption with run-time reconigurability. Area overhead. [126] 2021 Exploration of various AxC techniques applicable to HEVC decoding. Potentially great savings. Some techniques require purpose- built hardware. R e li a b il it y [28] 2017 Two approximate voting schemes for DMR and TMR systems. Reduced power consumption and increased fault tolerance. N/A [140] 2019 Approximate TMR scheme based on precision-scaling of converted ixed-point modules. Reduced area and increased relia- bility. Only considers precision-scaling. [172] 2019 Skipping crashes in non-critical code regions instead of restarting. Great run-time beneits with lim- ited hardware overhead. Requires both OS and hardware support. [12] 2021 Approximate TMR scheme using probability analysis-based over- and under- approximated modules. Reduced detection energy and al- gorithmic complexity. Single fault detection not guaran- teed. [39] 2021 An approximate QMR scheme based on output subsetting and majority voting. Potentially improved multiple fault resilience. Potentially increased area over- head. Continued on the next page ACM Comput. Surv. 46 • Damsgaard et al. cont. [112] 2021 Exploiting area savings from approximating DMR systems to implement extra hardware, improving throughput. Enables trading of accuracy for throughput. N/A [177] 2021 A truncated integer data format with built-in fault tolerance. Signiicance-relative errors and re- duced power consumption. Unclear encoding of negative numbers. O th e r a p p li ca ti o n s [152] 2013 Approximate wireless communication from importance-based data interleav- ing and non-trivial symbol encoding. Increased application signal- to-noise ratio with no re- transmissions. Hardware and communication overheads. [102] 2015 Distributed arithmetic-based approximate DWT accelerator. Improved performance at reduced hardware complexity. N/A [23] 2016 Approximate algorithm and systolic array-based accelerator for data detection in MIMO-based wireless communication. Reduced latency and improved SNR. Convergence not guaranteed. [162] 2016 Ultra-low power architecture for always-on motion detection with approxi- mate hybrid memory architecture at NTV. Improved energy eiciency. Large hardware overheads from standard cell memory. [19] 2018 Parallel and CGRA-equipped computing system for DSP in IoT edge devices. Run-time quality control and re- conigurability. N/A [202] 2018 Optimizations, limitations, and approximations in a successive cancellation decoder for polar codes. Increased operating frequency and throughput. N/A [60] 2019 Evaluation of approximate adders for FFT and IFFT inwireless communication. Clear approximation potential. No power analysis. [184] 2019 Applying mathematical approximations to reduce hardware complexity of expectation propagation in SCMA systems. Near-optimal performance with reduced hardware complexity. N/A [55] 2021 Optimized approximate architecture for localization and mapping accelerated on FPGA. Improved throughput and reduced energy per frame. Increased resource utilization. [63] 2021 Evaluation of approximate MAC units in FIR ilters used for wireless commu- nication. Signiicantly reduced dynamic power consumption. Degrading performance in multi- channel systems. [68] 2021 Improving mixed-criticality system throughput by approximating low-to- medium tasks. Increased low-criticality task sur- vivability. Requires OS and hardware sup- port. [88] 2021 Low-power Mel-frequency cepstral component accelerator with dual voltage- capable inexact adders and multipliers. Greatly reduced power consump- tion. N/A [40] 2022 Algorithmic approximations of a fuel estimation algorithm for cars with corresponding hardware architecture. Signiicant area savings. Incorrectly pipelined architecture. ACM Comput. Surv.