SELF-TIMED LENGTH-ADAPTIVE ARITHMETIC A Thesis Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Engineering by Edward Arthur Bingham December 2020 1 © 2020 Edward Arthur Bingham ALL RIGHTS RESERVED 2 SELF-TIMED LENGTH-ADAPTIVE ARITHMETIC Edward Arthur Bingham, Ph.D. Cornell University, 2020 Diminishing returns in technology scaling has motivated a resurgence of exploration into new computer architectures. While Coarse Grained Reconfigurable Arrays show promise in accelerating commonly used complex operations, their overall capacity remains fairly limited. While there is pressure on general purpose systems to support wide operations, the typical workload mostly exercises the lower 10 to 15 bits. This leaves most of the array on and unused during normal operation. This thesis presents adaptive digit-serial arithmetic as a plug-and-play method to support a variety of bitwidth requirements, showing decreased energy and area alongside increased throughput. 3 BIOGRAPHICAL SKETCH Ned Bingham is a Computer Engineering PhD specializing in self-timed circuits. He received his B.S. (2013), M.S. (2017), and PhD (2020) from Cornell with significant time spent at Yale. During his Masters, he designed a set of tools for working with self-timed systems using a control-flow specification called Handshaking Expansions. During his PhD, he researched self-timed systems as a method of leveraging average workload characteristics in reconfigurable architectures for general compute. Between his studies, he has worked at Intel on Pre-Silicon Validation (Hudson MA, 2011, 2012), at Qualcomm on arithmetic architectures (San Diego CA, 2014), at Google on self-timed systems (Mountain View CA, 2016) and at Google, again, on Hangouts Chat (Sunnyvale CA, 2017). In his free time, he maintains a variety of interests in the field, working on Compilers, Computer Graphics, and Natural Language Processing. (www.nedbingham.com) 4 This thesis represents a relatively small selection from 7 long years of work, none of which would have been possible without the steadfast anchor of love, support, and companionship. Through the excitement, abjection, stress, and uncertainty, you have been there to keep me going. I dedicate this thesis to you, Analeigha Olivia Ortega. 5 ACKNOWLEDGEMENTS I'd like to acknowledge the diligent work and support from my advisor, Rajit Manohar, toward my completion of this program along with the invaluable feedback from my committee as a whole, Rajit Manohar, Christopher Batten, and Zhiru Zhang. I'd further like to acknowledge my wife Analeigha Olivia Ortega, my parents, Lisa Blomgren Amsler and Geoffrey Bingham, and my brother, Daniel Bingham, for their love and support. No one is self-made, and I wouldn't be the person I am today without my mentors, my family, and my friends. 6 TABLE OF CONTENTS Preface 19 1. Technology, Architecture, and the Clock 20 1. Process Technology 20 2. Accelerator Architectures 22 3. Digit-Serial Arithmetic 25 4. Self-Timed Circuits 26 5. Contributions 27 6. Collaboration, Previous Publications, and Funding 29 2. Workload Characterization 30 1. Parallelism 31 2. Locality 37 3. Memory and Bandwidth 40 4. Bitwidth 45 5. Lessons Learned 48 3. Design Methodology 50 1. Digit-Serial Adaptivity 50 2. Circuit Topology 51 3. QDI Control Circuits 52 4. Synthesis Strategy 56 5. Microarchitectural Optimizations 57 6. Half-Cycle Timing Assumption 68 7. QDI Treatment for Pass Transistor Logic 69 8. Example: Single Bit Register 75 9. Integrated QDI/BD Circuits 79 10. Toolset and Circuit Evaluation 86 4. Counters 90 1. Behavioral Specification 90 2. Increment and Decrement 95 3. Clear 99 4. Read 101 7 5. Write 106 6. Evaluation 115 5. Stream Manipulation 118 1. Sign Extension 118 2. Sign Compress by N 122 3. Sign Compress by One 127 4. Serial to Parallel 135 5. Evaluation 141 6. Bitwise Operations 154 1. AND, OR, XOR 154 2. Shift Left 154 3. Shift Right 160 4. Evaluation 167 7. Addition and Subtraction 177 1. Addition 177 2. Subtraction 181 3. Evaluation 181 8. Comparison and Conditionals 188 1. Compare to Zero 188 2. Conditional Sink 191 3. Evaluation 191 9. Multiplication 197 1. Behavioral Specification 198 2. Datapath 207 3. Control 210 4. Evaluation 214 10. Example Array Architecture 220 1. Sum Process 221 2. Mul Process 222 3. Evaluation 222 11. Conclusion 227 A. CHP Notation 231 8 1. Examples 232 B. PRS Notation 234 1. Examples 234 References 235 1. History 235 2. Process Technology 235 3. Processor Performance 236 4. Program Workload Analysis 236 5. Asynchronous Design 238 6. Counters 240 7. Adders 241 8. Multipliers 241 9. Computer Arithmetic 243 10. Architecture Surveys 244 11. Micro Processors 244 12. Reconfigurable Arrays 250 9 LIST OF FIGURES 1. History of the clock frequency of Intel's processors. 21 2. History of the power density in Intel's processors. Frequency, Thermal 21 Design Point (TDP), and Die Area were scraped for all Intel processors. Frequency and TDP/Die Area were then averaged over all processors in each technology. Switching Energy was roughly estimated from [19] and [14] and combined with Frequency and Die Area to compute Power Density. 3. Wire and Gate Delay across process technology nodes. These were 21 roughly estimated from [14] and [21] 4. History of SpecINT base mean, with benchmarks scaled appropriately 22 [26]. 5. History of Intel process technology defect density. Intel's defect density 22 trends were very roughly estimated from [12][13][14][15][16] and [17]. 6. History of transistor count in Intel chips. Transistor density was averaged 22 over all Intel processors developed in each technology. 7. Instruction address locality in Spec Benchmarks. 24 8. Average bitwidth distribution in Spec Benchmarks. 24 9. Example dependency graph. 33 10. Trace of available parallelism for perlbench.0. 35 11. Distribution of available parallelism with respect to cycles in the Spec 35 benchmark programs. 12. Distribution of available parallelism with respect to operations in the 35 Spec benchmark programs. 13. Attainable speedup in the Spec benchmark programs. 37 14. Distribution of workloads for runs of cycles. 37 15. Reported application speedup with GPU. bzip2[46], wrf[47], h264ref and 38 lbm[48], dealII[49], gromacs[50], GemsFDTD[51], zeusmp[52], namd[53], bwaves[54], gamess[55], mcf[56], astar[57], milc[58], libquantum[59], hmmer[60]. 10 16. Microarchitectural breakdown of energy usage in an out of order 39 superscalar processor [44]. 17. Microarchitectural breakdown of energy usage in an embedded RISC 39 processor [45]. 18. Instruction address locality in Spec Benchmarks. 39 19. Distribution of maximum local correlation of parallelism within 100 41 cycles. 20. Distribution of instruction usage as a percent of total execution averaged 41 over all programs. 21. Distribution of instruction usage categories as a percent of total execution 41 for all programs in Spec2006. 22. Trace of memory requirements for perlbench.0. 43 23. Distribution of total storage requirements for each program. 43 24. Distribution of Storage Reuse Distance for an average cycle in each 43 program. 25. Distribution of Read Bandwidth over all of the cycles in each program. 44 26. Read Reuse Distance for an average cycle in each program. 44 27. Read Lifetime Distance for an average cycle in each program. 45 28. Distribution of Write Bandwidth over all of the cycles in each program. 45 29. Write Reuse Distance for an average cycle in each program. 45 30. Write Lifetime Distance for an average cycle in each program. 46 31. Distribution of ALU operation usage across all programs. 47 32. Average bitwidth distribution of integer operations across all programs. 47 33. Average bitwidth of integer operations in each program. 47 34. Alignment of integer addition operations. This is ultimately the location 48 of the first non-zero bit. 35. Distribution of run-length vs start-bit for the run. 48 36. Wires required to represent a bit using a specific MofN code (top), 53 relative energy required to communicate each bit (middle), and transistors per bit required to implement a validity gate for a specific MofN code assuming simple transistor sharing trees (bottom). Each curve shows a single selection of M while sweeping N. 11 37. Channel protocols for QDI buffers. 55 38. A basic template for QDI control with bundled data. 80 39. The channel protocol for the input and output channels of a pipeline stage 80 evaluated over two packets of data. 40. Circuit diagram for an asymmetric delay line. The upgoing transition is 80 delayed by six inverters while the downgoing by only two. 41. Communicating a QDI request signal to the bundled datapath using an 82 SR latch. 42. Communicating a QDI internal memory to the bundled datapath using an 82 n-latch. 43. Breaking a communication cycle with a p-latch on the output request. 84 44. Clocking a memory internal to the datapath. 84 45. Clocking a memory internal to the datapath. 85 46. The p-and-p latch (left) passes the value when both clocks are high, the 85 n-or-n latch (right) passes the value when either clock is low. 47. QDI internal memory to datapath memory communication. 87 48. Clocking an exchange channel. 87 49. The p-or-p latch (left) passes the value when either clock is high, the 87 n-and-n latch (right) passes the value when both clocks are low. 50. The basic template for datapath memory with exchange channels (left) vs 88 after all of the previously discussed memory optimizations (right). 51. The idzn counter decomposed into processes. 94 52. Read counter components. 103 53. Write counter components. 107 54. Bundled-Data write counter interface. 111 55. Chunked Bundled-Data write counter interface. 113 56. The distribution of carry chain lengths for increment and decrement 117 commands in the SPEC2006 benchmark. 57. Measured Performance and Energy for an array of counters. 117 58. The architecture of the integrated QDI/BD stream full compression unit. 128 59. The architecture of the integrated QDI/BD Stream Compress One unit. 134 12 60. Structure of multi-node operations for first approach for serial to parallel 137 conversion. 61. Structure of multi-node operations for second approach for serial to 137 parallel conversion. 62. Overview of the sign-extension unit performance. 142 63. Combined probability distribution of the input bitwidths to the addition, 144 subtraction, and bitwise operations. 64. Overview of the sign-extension unit performance. 146 65. Overview of the compression unit performance. 146 66. Combined probability distribution of the output run lengths for a given 148 start bit for the addition, subtraction, multiplication, and bitwise operations. 67. Average bitwidth distribution of the output for the addition, subtraction, 148 multiplication, and bitwise operations. 68. Average number of redundant bits introduced into the encoding by the 149 addition, subtraction, multiplication, and bitwise operations. 69. Overview of the compress unit performance. 152 70. Throughput and energy per serial token of the upflow and downflow 153 serial to parallel units as pipeline length increases. 71. Block diagram of the shift left operator. 156 72. Block diagram of the flop driving X. 156 73. Block diagram of the shift right operator. 162 74. Probability distribution for the bitwidth of the left operand A and right 168 operand B for AND, OR, and XOR. 75. Throughput/transistor (left) and energy/op (right) metrics scaled by 170 maximum bitwidth. 76. Probability distribution for the number of redundant bits introduced per 170 operation by this implementation of the bitwise operators. 77. Performance and energy averaged over the distribution in Fig. 78 and 170 Fig. 80 vs Transistor Count. 78. Probability distribution for the bitwidth of the shifted value A and the 172 shift amount B for the left shift operator. 13 79. Throughput/transistor (left) and energy/op (right) metrics scaled by 174 maximum bitwidth. 80. Probability distribution for the bitwidth of the shifted value A and the 174 shift amount B for the left shift operator. 81. Throughput and energy metrics scaled by maximum bitwidth. 176 82. The architecture of the Adaptive Adder. 179 83. Transistor diagram of LSB adder control circuitry. 181 84. The architecture of the Adaptive Adder/Subtractor. 182 85. Joint probability distribution for the two input bitwidths. 184 86. Performance and energy averaged over the distribution in Fig. 85 vs 186 Transistor Count. 87. Each point corresponds to the simulated energy per add averaged for 187 multiple adds over the distribution in Fig. 85 for a given maximum bitwidth. 88. Probability distribution for the number of redundant bits introduced per 187 operation by the adder. 89. Probability distribution for the input bitwidth. 194 90. Performance and energy averaged over the distribution in Fig. 89 vs 196 Transistor Count. 91. Each point corresponds to the simulated energy per compare averaged for 196 multiple compare operations over the distribution in Fig. 89 for a given maximum bitwidth. 92. The underlying multiplier architecture used in this chapter. 199 93. Datapath architecture for each digit unit of the multiplier. 209 94. Performance and energy averaged over the distribution in Fig. 95 vs 215 Transistor Count. 95. Probability distribution for the bitwidth of the left operand A and right 217 operand B for multiplication. 96. Throughput/transistor (left) and energy/op (right) metrics scaled by 218 maximum bitwidth. 97. Probability distribution for the number of redundant bits introduced per 219 operation by the multiplier. 98. Operation implemented by the arithmetic cube. 221 14 99. Architecture of the Arithmetic Cube. 221 100. Forwarding B. 223 101. Sum process with bypassing multiplexers. 223 102. Mul process with bypassing multiplexers. 223 103. Walk through of simulation of channels along the first column of the 224 Arithmetic Cube. 104. Waveform (left) of channels along the first column (right) of the 226 Arithmetic Cube. 105. Resulting gate for asymmetric C-element (left) and pass-transistor XOR 234 (right). 15 LIST OF TABLES 1. The state space of the dual differential pass transistor XOR. Rows are 72 highlighted when a and _a are the same value. 2. The state space of the pass transistor AND. 73 3. The state space of the pass transistor OR. 74 4. Raw performance measurements for the sign extension units. 143 5. Utilization of each condition for the addition, subtraction, AND, OR, and 146 XOR operators. 6. Raw performance measurements for the compressN units. 147 7. Raw performance measurements for the compress1 units. 147 8. Utilization of each compressN condition for the addition, subtraction, 151 multiplication, and bitwise operators. 9. Utilization of each compress1 condition for the addition, subtraction, 152 multiplication, and bitwise operators. 10. Utilization of each condition for the AND, OR, and XOR operators. 168 11. Performance measurements for the bit-parallel bitwise operators. 168 12. Raw performance measurements for the digit-serial bitwise operators. 169 13. Performance measurements for the bit-parallel shift operators. 170 14. Raw performance measurements for the shift operations. 171 15. Utilization of each condition for left shift. 174 16. Utilization of each condition for right shift. 176 17. Performance measurements for the bit-parallel addition operators. 183 18. Raw performance measurements for the digit-serial addition operators. 183 19. Utilization of each condition for the addition operator. 186 20. Raw performance measurements for the bit-parallel comparison 193 operators. 21. Raw performance measurements for the digit-serial comparison operator. 193 22. Utilization of each condition for the comparison operator. 196 23. Raw performance measurements for the digit-serial conditional sink 196 operator. 24. Short history of multiplication algorithms and their complexity. 198 16 25. Booth encoding for a four-bit digit multiply. 208 26. Performance measurements for the bit-parallel bitwise operators. 215 27. Raw performance measurements for the digit-serial multiply operator. 216 28. Utilization of each condition for a multiply. 217 29. Utilization of each condition for the least significant digit of the multiply 218 circuit. 30. Comparison of average performance and energy for all operators against 228 their closest clocked counterparts. The multiplier throughput numbers depend on the scheduling efficiency of the dynamic approach, and the compare numbers depend upon the clock overhead associated with the clocked operator comparison point. 31. Comparison of effective sequential execution throughput of all operators 229 against their closest clocked counterparts. The comparison operator is an average of 444 ps from not-equal comparisons 73% of the time and 791 ps from the other comparisons the other 27% of the time. 17 LIST OF ABBREVIATIONS • ALU Arithmetic Logic Unit • API Application Programming Interface • ASIC Application Specific Integrated Circuit • BD Bundled-Data • CGRA Coarse Grained Reconfigurable Array • CHP Communicating Hardware Processes • CISC Complex Instruction Set Computer • CMOS Complementary Metal Oxide Semiconductor • CPU Central Processing Unit • CSP Communicating Sequential Processes • DI Delay-Insensitive • DSA Dynamic Single Assignment • DSP Digital Signal Processing • FIFO First-In First-Out queue • FPGA Field Programmable Gate Array • GCC GNU C Compiler • GPU Graphics Processing Unit • HCTA Half-Cycle Timing Assumption • ILP Instruction Level Parallelism • IPC Instructions Per Cycle • ISA Instruction Set Architecture • LSB Least Significant Bit • LSD Least Significant Digit • MSB Most Significant Bit • NMOS N-type Metal Oxide Semiconductor • NoC Network on Chip • PCFB Pre-Charge Full Buffer • PCHB Pre-Charge Half Buffer • PMOS P-type Metal Oxide Semiconductor • PRS Production Rule Set • QDI Quasi Delay-Insensitive • RISC Reduced Instruction Set Computer • SIMD Single Instruction Multiple Data • SR Set-Reset • TDP Thermal Design Point • WCHB Weak-Condition Half Buffer • uP Micro-Processor 18 PREFACE While technology scaling has driven progress in computational power since 1970, that trend has slowed to a halt over the last 15 years. This is motivating research into alternative architectures that explore parallelism, specialization, and reconfigurability. In particular, Coarse Grained Reconfigurable Arrays (CGRA) seem to show great promise as general purpose accelerators. However, their efficacy and therefore adoption has been hindered by capacity limitations and current approaches to rectify this still fall short. In this thesis, I propose to leverage self-timed circuits to implement length-adaptive digit-serial arithmetic operators for use in configurable array architectures. These circuits significantly reduce the area requirements while maintaining support for arbitrary bitwidths, greatly increasing the capacity of the CGRA at no extra cost. They are implemented in completely self-contained modules that require no extra considerations in larger architectural contexts while elegantly avoiding unnecessary computation. I elaborate on the work I have already done in this space [75][76] and contribute the remaining arithmetic operations, showing significant improvements on all metrics. Finally, I show how these operators may be used in a simple CGRA. 19 CHAPTER 1 TECHNOLOGY, ARCHITECTURE, AND THE CLOCK The concepts introduced by Von Neumann in 1945 [214], remain the centerpiece of computer architectures to this day. His programmable model for general purpose compute combined with a relentless march toward increasingly efficient devices cultivated significant long-term advancement in the performance and power-efficiency of general-purpose computers. For a long time, chip area was the limiting factor and raw instruction throughput was the goal, leaving energy largely ignored. However, technology scaling has demonstrated diminishing returns, and the technology landscape has shifted quite a bit over the last 15 years. 1.1 Process Technology Around 2007, three things happened. First, Apple released the iPhone opening a new industry for mobile devices with limited access to power. Second, chips produced with technology nodes following Intel's 90nm process ceased scaling frequency (Fig. 1) as the power density collided with the limitations of air-cooling (Fig. 2). For the first time in the industry, a chip could not possibly run all transistors at full throughput without exceeding the thermal limits imposed by standard cooling technology. By 2011, up to 80% of transistors had to remain off at any given time [22]. Third, the growth in wire delay relative to frequency introduced new difficulties in clock distribution. Specifically, around the introduction of the 90nm process, global wire delay was just long enough relative to the clock period to prevent reliable distribution across the whole chip (Fig. 3). As a result of these factors, the throughput of sequential programs stopped scaling after 2005 (Fig. 4). The industry adapted, turning its focus toward parallelism. In 2006, Intel's Spec Benchmark scores jump by a 135% with the transtion from NetBurst to the Core microarchitecture, dropping the base clock speed to optimize energy and doubling the width of the issue queue from two to four, targeting Instruction Level Parallelism (ILP) instead of raw execution speed of sequential operations [9]. Afterward, performance grows steadily as architectures continue to optimize for ILP. While Spec2000 focused on sequential tasks, Spec2006 introduced more parallel tasks [43]. By 2012, Intel had pushed most other competitors out of the Desktop CPU market, and chips following Intel's 32nm process ceased scaling total transistor counts. While smaller feature sizes supported higher transistor density, it also brought higher defect density (Fig. 5) causing yield 20 Fig. 1: History of the clock frequency of Intel's processors. Fig. 2: History of the power density in Intel's processors. Frequency, Thermal Design Point (TDP), and Die Area were scraped for all Intel processors. Frequency and TDP/Die Area were then averaged over all processors in each technology. Switching Energy was roughly estimated from [19] and [14] and combined with Frequency and Die Area to compute Power Density. Fig. 3: Wire and Gate Delay across process technology nodes. These were roughly estimated from [14] and [21] losses that make larger chips significantly more expensive (Fig. 6). Today, energy has superceded area as the limiting factor and architects must balance throughput against energy per operation. Furthermore, improvements in parallel programs have slowed due to a combination of factors (Fig. 4). First, all available parallelism has already been exploited for many applications. Second, limitations in power density and device counts have put an upper bound on the amount of computations that can be performed at any given time. And 21 Fig. 4: History of SpecINT base mean, with benchmarks scaled appropriately [26]. Fig. 5: History of Intel process technology defect density. Intel's defect density trends were very roughly estimated from [12][13][14][15][16] and [17]. Fig. 6: History of transistor count in Intel chips. Transistor density was averaged over all Intel processors developed in each technology. third, memory bandwidth has lagged behind compute throughput, introducing a bottleneck that limits the amount of data that can be communicated at any given time [24]. 1.2 Accelerator Architectures These new constraints have rekindled exploration into alternative architectures that reduce energy requirements in order to increase total throughput. In light of diminishing returns exploiting parallelism, the industry has been exploring specialization and configurability for potential improvements in energy. While these approaches have increased performance by orders 22 of magnitude for specific applications [25], they have remained largely separated from general compute. However, one approach shows particular promise in bridging this gap. The Von-Neumann architecture has a significant energy overhead that seems to be a good fit for specialization and configurability. Reconfiguring the chip to execute each new dynamic instruction requires quite a bit of energy. However, the vast majority of dynamic instructions often come from a relatively small selection of the program specification (Fig. 7). For information processing applications like search, simulation, and compression, generally fewer than 100 static instructions are required to account for 50% or more of the program execution. Though others, like compilation, require upwards of 3000 static instructions. It is well known that Field Programmable Gate Arrays (FPGA) do not introduce the same type of overhead because the configuration remains static. However, this particular feature also limits FPGAs to smaller programs with a different programming model. Coarse Grained Reconfigurable Arrays (CGRA) [154] specialize the circuitry at each node, replacing the lookup tables in FPGAs with minimal Arithmetic Logic Units (ALU) or Micro-Processors (uP) implementing boolean operations, addition, subtraction, and sometimes even multiplication. This increases overall capacity and allows them to be used in conjunction with standard Von Neumann architectures to accelerate commonly used complex operations as in [234] and [212]. The vast majority of designs from 1996 to 2016 implemented fairly simple CGRA architectures, clocked with standard bit-parallel [234-251] or bit-serial [264-273] ALUs or uPs connected via routers to a crossbar or mesh Network on Chip (NoC). Architectural advancements within that baseline tend to focus on the integration of that array with other compute and memory elements, and run-time reconfiguration to maximize utilization. Furthermore, many of these architectures are highly specialized to particular problems, typically in the signal processing domain due to limitations in overall capacity as most of the architectures published have less than 200 execution nodes. However, there are facets of the workload that have not been sufficiently exploited. In particular, as shown in Fig. 8 and discussed in [160] and [159], it is well-known that most arithmetic operations in a given application do not require the full width of the datapath. Von-Neumann architectures have been taking some steps to exploit this particular feature by implementing multiple datapaths of various sizes, aggressive clock-gating, operator packing [159], and staggered execution [6][8]. However, array architectures have not really been able to do the same. None of the designs in [234-251] or [264-273] make any considerations for bitwidth allocation, implementing datapaths that are rigidly restricted to a specific bitwidth, wasting area and energy for most operations 23 Fig. 7: Instruction address locality in Spec Benchmarks. Fig. 8: Average bitwidth distribution in Spec Benchmarks. in order to support the worst-case. Two older designs, [252] and [253], implement a chip-wide configuration of 16-bit or 32-bit modes, and [256-259] have small ALUs that can be statically combined into arbitrarily wide parallel datapaths. However, this approach is altogether lacking because it assumes that the bitwidth can be determined at compile-time, putting a lot of responsibility on the programmer, the language they use, and its compiler. Generally, the programmer is unreliable, tending toward data-types that are far too large for the typical data they store. Modern programming languages do not help, lacking native support for granularity smaller than 8 bits and lacking clean syntax to track how bitwidth requirements of a single variable might shift throughout execution. And, compilers are severely limited, unable to dependably propagate computed data ranges through memory boundaries or determine how multithreading might effect those data ranges. [34][35] Run-time configuration of such an architecture is ultimately challenging. First, it places constraints on the network architecture to ensure all inter-digit dependencies are correctly routed. If this is resolved by routing all digits together, this excludes any networks in which the paths of those digits require different hop counts. This constraint could be resolved by using all-to-all circuit-switched networks much like FPGAs, layered networks much like modern machine-learning ASICs, or adding pipeline stages to delay the faster digits. If the digits are 24 allowed to be routed separately, then inter-digit dependencies must be routed as their own packets, requiring single-bit paths for carry propagation. All of this causes very high routing overhead for the architecture as a whole. Second, the algorithm for mapping and routing operations on an array architecture has exponential complexity and is not something easily solved at run-time on bare metal [156]. Any software API for this would require self-modifying code, which is the approach taken by [274]. Using that feature to dynamically adapt bitwidth would ultimately introduce more overhead than it is worth. Third, suppose the operator configuration is static but run-time configuration for the extra resources required by wide operations is allowed. That run-time configuration step might put the array over 100% utilization, risking deadlock as a wide operation waits for access to execution nodes it may never get. Ensuring correct operation in this environment is typically expensive [156]. 1.3 Digit-Serial Arithmetic Alternatively, a digit-serial architecture completely eliminates these problems. Inter-digit dependencies simply stay in place and consecutive digits are all routed along the same path. Instead of having to dynamically allocate resources for different lengths, a single resource is allocated for a longer period of time. Before 1970, many computer architectures were digit-serial and quite a few had implemented length-adaptivity [214-217]. However, their approach to length-adaptive arithmetic was baked into the control circuitry of the surrounding Von-Neumann architecture. Their digit-serial ALU was simply a smaller version of their bit-parallel counterparts with minor adjustments for inter-digit dependencies. The register file was often implemented with shift registers that stored multiple digits and streamed them one-at-a-time through the ALU. [214-231] After 1970, digit serial arithmetic had been largely relegated to array architectures targeting Digital Signal Processing (DSP) applications and linear algebra acceleration [150][266-277][280][282]. Focus tends to land on MSB first redundant arithmetic [147-149] and length-adaptivity is generally restricted to a static chip-wide configuration [262-277]. However, there have been a few non-array architectures that implement the token-based approach necessary for true length-adaptivity in array architectures [223]. Ultimately, adaptive digit-serial arithmetic has largely been forgone, and the reasons are fairly simple. First, a modern Von-Neumann architecture has very deep linear pipelines. It starts as a single trunk, branches out to multiple execution nodes after the issue queue and merges back into a single trunk some time before write-back. However, a serial digit stream can be quite 25 long, blocking the pipeline at the trunks while consecutive digits are being processed. This makes it near impossible to efficiently issue multiple instructions and limits the throughput of the whole architecture significantly. It is no surprise then that, except in the domain of low power low performance devices, digit serial Von-Neumann architectures could not compete with their bit-parallel counterparts after around 1960. Array architectures have quite wide networks and therefore do not suffer from the same bottleneck. However, they have a problem with timing. Adaptivity in this case means that two digit-streams can have different lengths. Consider a heavily used node executing a sequence of consecutive operations. If one input is shorter than the other, that input must delay its pipeline while the remaining digits of the longer input are being processed. In a clocked architecture, propagating these pipeline delay signals back through the array along the route followed by consecutive data words quickly becomes complex, particularly if the array supports loops or conditions. This is generally done with control flow using valid and ready bits to implement a handshake [78], at which point you might as well use asynchronous circuitry [278][279][281]. Furthermore, there are quite a few cases in which the length and structure of a digit-stream might need to be manipulated. For example, sign extension can add digits to the back of the stream; shifting can add digits to or remove digits from the front of the stream; rotation can move digits from the front to the back or visa-versa; addition can result in shorter and longer streams depending upon the carry; and multiplication requires serial-to-parallel conversion and dynamic allocation and deallocation of execution nodes. Overall, these operations require extremely complex control behaviors. Once again, it is no surprise that length-adaptivity has been relegated to a static chip-wide configuration. 1.4 Self-Timed Circuits Self-timed circuits can solve these problems in a simple and elegant way. First, nodes are connected via channels, each with a request and acknowledge. Each digit is transmitted from one node to another as a request in the channel protocol, and the next request will not be sent until the previous one has been acknowledged. In this way, they implement flow-control with back-pressure causing consecutive digits, and therefore words, to wait as needed. Furthermore, because self-timed circuits are event-based systems, they are able to implement complex control behaviors with little overhead. Relative to clocked design, self-timed circuits have suffered poor visibility in the industry. There is some sense that generalized asynchronous circuits with a strong framework for timing assumptions [70][72] could be extremely powerful [173]. However speaking from experience, complex asynchronous circuits are painfully difficult to design and typically require multiple 26 attempts before stumbling onto the correct approach. This has motivated various attempts to make the process even the slightest bit easier. Methods to formally derive asynchronous circuits from a program specification as in [65] are not quite complete. For example, [67] starts from an intermediate graph specification and their compiled circuits rely upon fast inverters for correct operation. Methods to directly map the syntax of a program specification onto circuit primitives as in [62] and [66] are entirely robust, but result in circuits with high overhead. Overall, the only approach that seems to produce efficient devices has effectively been no approach at all. Start from a circuit template as in [68], and then design the rest of the behaviors by hand. Because of this difficulty, expeditions into self-timed circuit design have expressed a bias toward simple pipeline behaviors and familiar architectures. The vast majority of self-timed projects yield bit-parallel Von-Neumann architectures [161-202], many of which are simply desynchronizations of existing synchronous architectures [71]. Similarly simple reconfigurable array architectures are seen as well [282]. However, some start to explore what complex control can do. [203-205] implement a counterflow architecture in which instructions flow down the pipeline while results flow up, effectively implementing a linear bypassing pipeline. The Rotary Pipeline architecture in [206] suggests a ring of connected ALUs around which results are constantly flowing. The Vortex architecture in [207] and later [208] is very similar, but with a crossbar network instead of the ring. Routes from node to node are explicitly dictated by each instruction. The Octasic architecture in [209-211] flips the idea on its head, keeping the data within each ALU while routing arbitration tokens that each represent a different stage of the pipeline (fetch, decode, execute, memory, write-back) around a ring. When a particular ALU has a token, it is granted access to the bus dedicated to that external resource. [212] proposes a decomposition of the standard architecture into two separate processors. One specifically handles branches, while the other executes blocks of dataflow instructions, fixed-length loops, and simple conditions. These alternative architectures are certainly interesting, but only two seem to target the feature at hand. [213] suggests bit-parallel width-adaptive architectures that effectively clock-gate the unused bits. [232] implements a length-adaptive bit-serial pipeline. However, it relies upon the control circuitry in a Von-Neumann architecture much like the Von-Neumann synchronous approach. Ultimately, a lot of architectural possibilities in the self-timed space have yet to be explored. 1.5 Contributions In light of these considerations, this thesis explores the application of self-timed circuits 27 toward the implementation of length-adaptive digit-serial arithmetic operators for configurable array architectures, elaborating on the work I previously completed toward this end [75][76] and exploring complex control circuitry through templated synthesis. In summary, this thesis contributes to the domain of Adaptive Self-Timed Arithmetic Circuitry as a first step toward maximizing the density of compute resources in Coarse Grained Reconfigurable Arrays. Doing so would allow for more significant sections of a program to be mapped to the configurable architecture thereby making that architecture applicable to a larger set of real world problems. This thesis supplies digit-serial arithmetic operators with built-in flow control that adapts to varying digit-stream lengths. In the state of the art, the approach that most closely resembles this work is found in [232]. Qualitatively, the existing approach is not modular, does not sufficiently explore the design space, and exhibits poor performance. Meanwhile, this thesis comprehensively explores the design space, presenting highly modular operators that provide significant compute density improvements by doubling the throughput per transistor and halving the energy per operation on average compared to their industry standard counterparts. Specifically, each chapter contributes the following to the state of the art: Chapter 2: Workload Characterization does an in-depth analysis of the workloads from the Spec benchmarks, covering available parallelism, instruction locality, memory usage and bandwidth, and integer arithmetic features. Chapter 3: Design Methodology describes several high-level approaches for adaptivity and discusses the strengths and weaknesses of available circuit design methodologies. Settling on an integrated QDI/Bundled-data approach, it covers how to communicate data between the QDI control and Bundled-Data datapath and how to approach the synthesis of QDI control circuitry. Chapter 4: Counters goes in depth on QDI counters, covering increment, decrement, clear, read, and write commands along with constant time zero detection. This chapter offers significant performance improvements beyond my previous work in [75]. Chapter 5: Stream Manipulation covers circuitry necessary for basic digit stream manipulation, including sign extension as a prerequisite for all multi-input operations, sign compression to reduce a digit stream 28 to its minimum length, and serial-to-parallel conversion as a prerequisite for shifting and multiplication operators. Chapter 6: Bitwise Operators shows how simple bitwise operators can be grafted into the sign extension circuitry, and how the counters and serial-to-parallel circuitry may be used to implement stream shifting. Chapter 7: Addition and Subtraction elaborates on the work I completed in [76], covering digit-serial addition and subtraction. Chapter 8: Comparison and Conditionals describes operators that compare a value with zero, and that conditionally pass or sink a value. These circuits may be used to implement conditional moves. Chapter 9: Multiplication explores multiplication architectures and their trade-offs and difficulties. Chapter 10: Example Array Architecture shows how these circuits interact in the context of a simple array architecture called the Arithmetic Cube [266][268]. Finally, the outcome of this work is summarized and further research opportunities are discussed. 1.6 Collaboration, Previous Publications, and Funding Without the guidance and collaboration of my advisor, Rajit Manohar, none of this work could have come to fruition. I have published three papers at the time of writing this thesis. "QDI Constant Time Counters" [75] covers my initial work on counter circuits. I improve upon this work in Chapter 4. "Self-Timed Adaptive Digit-Serial Addition" [76] covers my work on the addition operator. This corresponds to Chapter 7 of this thesis. My third paper, "A Systematic Approach for Arbitration Expressions" [77] is orthogonal to this work and is therefore not included. Various funding agencies allowed me to explore this work including the National Science Foundation (CCF-1065307, CCF-1617945), the Office of Naval Research (N00014-13-1-0419), and the Air Force Research Laboratory (FA8750-15-1-0173). 29 CHAPTER 2 WORKLOAD CHARACTERIZATION Before designing any circuitry, it is important to make a close examination of the functionality being targeted. Why specifically would a CGRA accelerator provide any benefit beyond modern architectures, and why is it important for its datapath to be digit-serial? This chapter endeavors to produce detailed and specific answers to these questions with a quantitative examination of important features underlying the workload. There are a few industry standard benchmarking suites and countless domain specific benchmark applications that could be used to expose underlying features in common workloads. Ultimately, the industry standard suites tend to examine the breadth and depth of their workloads more rigorously. Passmark [31] has become the most popular consumer facing benchmark. However, Parsec [27], Splash [29], TPC [30], Spec-OMP [32], and Spec-CPU [26] are the most popular for computer architecture research. Until around 2012, SPEC was the industry accepted benchmark for this purpose. Around that time, the breadth of application domains increased dramatically with the introduction of mobile, big-data, and machine learning systems. Today, there is no default correct choice for a benchmark suite. Parsec, Splash, and Spec-OMP focus on the performance of many-core systems while TPC focuses on database sytems. These benchmarks emphasize the performance of the on-chip network and memory systems. While this is good for understanding the performance of the whole system-on-chip, Spec-CPU focuses heavily on the performance of a single core. Relative to the Spec-CPU2006 benchmarks, Spec-CPU2017 adjusts the covered domain space by removing some applications and adding others. Ultimately, Spec-CPU2017 includes many of the same applications while adding more applications for machine learning problems [28]. As discussed in Chapter 1 Section B, CGRA accelerators can optimise the performance of a core's sequential instruction execution, particularly in low-power environments. Of the benchmarks, Spec-CPU2006 should exhibit the least parallelism, the most data-interdependency, and the least locality. This represents the most conservative application of a CGRA to real-world problems. While it should also exhibit the lowest total memory bandwidth requirements, memory system design is not in the scope of this work. Therefore, this chapter analyzes the programs found in the Spec-CPU2006 [26] benchmark using Intel PIN. The 29 applications listed below were carefully selected by the Spec Benchmark Committee to be an approximation of realistic workloads (perlbench, bwaves, milc, cactusADM, 30 gobmk, povray, sjeng, h264ref, omnetpp, sphinx3, bzip2, gamess, zeusmp, leslie3d, dealII, calculix, GemsFDTD, tonto, astar, xalancbmk, gcc, mcf, gromacs, namd, soplex, hmmer, libquantum, lbm, wrf). There are many caveats to this approach. The workload requirements of a given program are heavily influenced by the machines available at the time of development, because people will not develop a program that no machine can execute to completion. Therefore, designing an architecture strictly optimized toward observed workload requirements is unlikely to expose new opportunities in software. Also, many features of the measured workload are dramatically affected by the compiler and the target machine used to execute the program. Ultimately, complete isolation and characterization of these effects is extremely difficult and beyond the scope of this analysis. Intel PIN is simultaneously extremely helpful and problematic. It facilitates a deep dive into aspects of the executed program in ways that other tracing tools cannot. However, it also forces the use of Intel's x86 64-bit architecture, which is ultimately a RISC core combined with a powerful microcoding engine to implement a complex Instruction Set Architecture (ISA). Any analysis in this environment will have drastically different results from analyses in standard RISC environments. Loads and stores are represented by addressing modes rather than instructions, several instructions are Turing Complete all on their own [4][5], and there are complex instructions like sqrt , sin , and cos that hide calls to the more basic operators. Therefore, extracting workload data with respect to a generalized Von-Neumann RISC architecture requires some post-processing. Memory loads and stores should be separated from their parent instruction, complex operators should be split into their microcodes using known best implementations, SIMD instructions should be split up, and instruction variants should be grouped. Fortunately, the GCC compiler produces fairly reasonable assembly, avoiding most of the complex microcoded instructions and obfuscated compilation strategies available in the ISA, leaving those effects negligible. The results are presented as collection of distributions with one distribution per program. These distributions are color-coded such that red means "or more", green means "exactly", and blue means "or less". Therefore, red and blue represent different kinds of cumulative distribution functions while green represents a probability distribution function. 2.1 Parallelism Overall, there are two arguments for CGRA architectures that dominate the literature. One cites increased programmability beyond ASICs resulting in cheaper fabrication with lower design time and time to market [236][239][244][247][249][252][253][254][258][259]. 31 The other focuses heavily on their capabilities regarding available parallelism beyond CPUs [234][235][237][238][240][243][245][246][248][250][255][256][257][261]. While speeding up sequential execution of instructions via clock speed is off the table due to technology-node constraints, it has been argued that there is still quite a bit to be gained from accelerating embarrassingly parallel programs, particularly regarding the domain of mobile or embedded DSP applications. However, many programs in this category have already been moved off of the CPU and onto the GPU. So, when making this argument it is necessary to show both that there is parallelism available for further speedup, and that the GPU is not capable of capturing that speedup. The underlying concept of available parallelism is fairly simple. To be concise, it represents the speedup gained from being able to execute some number of instructions in parallel. Unfortunately, this definition is fairly vague, and measuring this can be deceptively complex. There have been quite a few papers that analyze the parallelism available in a given program, and they all take about the same approach. A computation consists of a collection of basic operations (add, subtract, multiply, divide, etc) in which one operation's output is another's input. Such data dependencies form the arcs within a directed acyclic graph called the dependency graph. The operations in this graph are then organized chronologically into steps such that an operation executes as soon as its input operands are ready. The available parallelism of a program is then computed with the application of Amdahl's Law to the number of instructions in each step. Unfortunately, there are quite a few confounding factors, driving each paper to make different assumptions about the underlying data. By definition, the dependency graph structure assumes infinite hardware, perfect branch prediction, and perfect memory disambiguation. This ignores write-after-read and write-after-write dependencies along with structural hazards and control dependencies. [37] and [38] each analyze a self-selected collection of programs, use this ideal machine to estimate the length of the critical path, and report the ratio of the total instruction count against the critical path length. [38] goes a step further to show the trace of available parallelism across the program. [39] analyzes a much longer segment of the program execution, but is limited by a window size, considering a limited number of instructions in the trace at a time. This allows it to do a complete analysis of the Spec benchmarks from 1989. [41] takes speculation step further, by analyzing available parallelism under data-value prediction models. However, there are also many circumstances that might artificially sequentialize an otherwise parallel program even with an ideal machine, and they are often difficult to identify. For example, loop iterators create a dependency chain from one iteration to the next that can often be unrolled, and long expressions can be accumulated sequentially or computed as a tree. Many of these program transformations are often done by the compiler, but there are limits. For example, a 32 Fig. 9: Example dependency graph. compiler may only partially unroll a loop, ultimately leaving that dependency chain somewhat in place. [40] applies constant value propagation and tree height reduction to deal with loop iterators and long sequential expressions. Finally, [42] brings many of the techniques together to do an analysis on modern benchmarks in Spec. Unlike the RISC ISA, x86-64 is particularly complex, and this approach must take that into account. First, every byte of data throughout the execution of the program is recorded in a linked list with pointers to the instructions that read and write it. Instructions are also recorded in a linked list with their assigned step. An array of x86 registers are mapped to the set of bytes they store, accounting for overlapping registers like RAX, EAX, AX, AL, and AH. Second, the program maintains a hash-map of 1024-byte pages of memory in which each location is mapped to a byte of data. As discussed in Section D of this chapter, mov instructions represent a significant overhead in the x86-64 ISA that should be removed for the purpose of an ideal machine. Therefore, any instruction that unconditionally moves or copies data should take immediate effect and should not be counted as an operator. Unfortunately, unconditional data moves or copies within memory require special treatment. While they should not be counted as an operator, the data dependencies from the base and index registers of the memory address must be taken into account for any operator that reads the result. Furthermore, Vector and SIMD instructions are divided into multiple operators. Branches and jumps are not counted nor do they create any dependencies per the assumption of perfect branch prediction. While PIN does not track instructions through the kernel, it does provide the values of all input operands. Therefore, system calls are are handled manually, counting as a single operator and correctly affecting memory, data dependencies, and parallelism as necessary. Any further instructions each count as a single operator. Finally, to prevent large initialization spikes as found 33 in [42], any operation without input dependencies is scheduled to execute the cycle before its output is read. All recorded operations take exactly one cycle to complete. Ultimately, rigorous automatic parallelization as in [40] is out of the scope of this work. While it can be safely assumed that modern compilers take care of the majority of known transformations, it is still necessary to break the long dependency chains created by loop iterators using constant value propagation. Any instruction in which all of its operands are constant can be evaluated at compile time. Therefore, they do not count as operators and their results are also constant. While this does not dive deep into automatic parallelization of expressions, it does implement loop unrolling. Overall, this strategy still miscounts the available parallelism from instructions like sqrt, cos, sin, etc. However only sqrt is ever used throughout the spec benchmark, and its execution counts are negligible. This PIN tool will quickly run out of memory if certain precautions are not taken. First, each byte will keep count of its references in the register file and in memory. When there are no more references to that byte, it will be counted in the output statistics and deleted. Instructions will keep count of all bytes they write. As soon as all of those bytes have been deleted from the list, the operation will also be counted in the output statistics and deleted. The hash-map of 1024-byte pages will limit its size to 1000 pages, swapping any pages beyond that count into a file. The output statistics will keep track of the number of operations, the data reuse distances and counts, and the data lifetime distances and counts in each step, swapping data out to a file as necessary to maintain a small memory footprint. Finally, the run time of the analysis is limited 24 hours. The outcome for a given program is the total number of operations that can be executed in parallel for each cycle. For example, Fig. 10 shows trace of available parallelism from a 120 million instruction execution of perlbench. Keep in mind the subtle difference between instructions and operations. An operation is the unit of computation left after post processing an x86 instruction. Overall, this analysis only covers a small sampling of each program. Two distributions show different features of the available parallelism. The first in Fig. 11 shows the distribution of available parallelism with respect to cycles. For example, 50% of cycles in perlbench.0 execute 24 or more operations in parallel. This showcases the sequential nature of each program, emphasizing the median available parallelism in the trace. Alternatively, Fig. 12 shows the distribution with respect to operations. 50% of operations in perlbench.0 were executed in cycles with at least 745 operations in parallel. This showcases the parallel nature of each program, emphasizing the workload achieved in highly parallel cycles. The programs in these figures are ordered by their speedup with lower speedup on the left and 34 Fig. 10: Trace of available parallelism for perlbench.0. Fig. 11: Distribution of available parallelism with respect to cycles in the Spec benchmark programs. Fig. 12: Distribution of available parallelism with respect to operations in the Spec benchmark programs. higher on the right, and both of these distributions are necessary for a thorough understanding of this ordering. For example, the reason for the order of wrf.0 and soplex.1 may be unclear if looking at only one of these distributions. However, it quickly becomes clear that while a large number of operations in wrf.0 are executed in one or two highly parallel cycles, there are also a significant number of extremely sequential cycles that limit its overall speedup. On the otherhand, the parallelism in soplex.1 is fairly evenly distributed across all cycles, allowing for higher overall speedup. 35 Fig. 13 shows the maximum achievable speedup of each program given some number of execution units. While there are a few very sequential programs with between 10x and 100x potential speedup, most programs offer more than 100x. [33] found an IPC between 1 and 3 for the execution of the Spec Benchmarks on an Intel Xeon processor in 2018, suggesting that there is still plenty of available parallelism to take advantage of. It is possible that these parallel programs cannot easily be moved to the GPU because the operations involved are interdependent meaning they cannot easily be split into threads. Unfortunately, there is not a particularly clean way to measure operator interdependence. However, It may be indirectly measured by looking at the reliability of available parallelism. If one cycle has 100 operators and the next 1000, then every operator in the second cycle must be dependant on at least one operator in the first. Likely every operator in the first will fan out to around 10 in the second. These kinds of interdependencies are not easily mapped onto a GPU. Meanwhile if multiple consecutive cycles all have similar available parallelism, then it is possible for those operators to have fairly independent threads. Therefore, it is necessary to identify runs of cycles. Assuming that each thread has a max IPC of 3, then if the next cycle has less than 1/3 of the maximum parallelism or more than 3 times the minimum parallelism of the run, then a new run is started. Therefore, the minimum parallelism of the run represents a loose upper bound on the possible thread count for that run. A modern CPU has 8 cores supporting 2 threads per core with an IPC of 3. Therefore if a run has a minimum parallelism less than 48, it can likely be executed on the CPU. Runs with a minimum parallelism greater than 48 are reasonable candidates for the GPU. Now there are two classes of runs for the GPU: long runs with relatively low parallelism, and short runs with extremely high parallelism. Ultimately, since all of these runs can be accelerated with a GPU, it is only necessary to determine the workload achieved by each run, or the total number of operations. The distribution of these workloads can be seen in Fig. 14. From this, it is possible to measure how threadable a program might be by computing the weighted average of the workload across the program. This is the workload of each run times its percentage of the total workload of the program. In Fig. 14, the programs to the right score higher. Therefore, they are more likely to be threadable and get a good speedup with a GPU. There are quite a few research papers that set about the task of porting those programs. Unsurprisingly, Fig. 15 shows that programs measured as threadable tend to achieve good speedups on a GPU. Ultimately, this measure accounts for 57% of the variance when compared to real GPU speedup. This is more predictive power than the limitations of the underlying data might suggest. Parallelism does not directly encode interdependency, and the real GPU speedups are reported by multiple authors who use different GPU hardware and compare to varying CPU 36 Fig. 13: Attainable speedup in the Spec benchmark programs. Fig. 14: Distribution of workloads for runs of cycles. hardware. It is ultimately a little more predictive than the maximum theoretical speedup which accounts for 44% of the variance. Overall, these statistics suggest significant speedup available. However, the max IPC of an Intel Xeon in 2017 is around 10. The fact that the achieved IPC is between 1 and 3 for these applications suggests that parallelism might not be the limiting factor [33]. While some programs with higher available parallelism already are or can be made well suited for a GPU, there is still theoretically quite a bit of speedup to be had from the more parallel applications. The fact that the achieved speedups even for the GPU are nowhere near the predicted max suggests that current architectures might be limited by another characteristic of the workload, and many of the GPU papers suggest that this might be the Von-Neumann bottleneck [10]. 2.2 Locality Parallelism is not the only reason one might want a CGRA, especially when most modern processors are energy-bound. [44] ran a full analysis of energy consumption in the context of a processor architecture similar to the Pentium III. This paper was published before frequency stopped scaling in 2003 due to the power wall, so a lot of things have changed. Unfortunately, the 37 Fig. 15: Reported application speedup with GPU. bzip2[46], wrf[47], h264ref and lbm[48], dealII[49], gromacs[50], GemsFDTD[51], zeusmp[52], namd[53], bwaves[54], gamess[55], mcf[56], astar[57], milc[58], libquantum[59], hmmer[60]. tool used by the paper is no longer available. However, its data is still informative. In particular, the energy breakdown remains fairly consistent across the programs from the Spec2000 benchmark. On average, the control logic for scheduling instructions (i.e. L1 Instruction Cache, Instruction Decode, Instruction Queue, and Reorder Buffer) account for around 62% of the energy consumption of a superscalar processor core. This ratio has likely been reduced by micro-op fusion [7], but instruction scheduling logic probably still represents the majority of the energy consumption. On top of this, the Register Renaming Table, which provides a mapping of the underlying memory system to a large set of virtual registers, represents around 13.6% of energy consumption. The activity required by all of these functionalities accounts for 75% of energy. This trend also holds for a much simpler embedded RISC processor in Fig. 17, where 48% of energy is consumed by instruction scheduling while 30% is consumed by data delivery. This accounts for a total of 78%. As mentioned in Chapter 1 Section B most programs have very high execution locality with at most a couple thousand static instructions accounting for the majority of the dynamic execution (Fig. 18). Therefore, much of the work done to schedule instructions is redundant. Unfortunately, the argument of locality [241][251] has been largely overlooked. It is well known that most instructions can be scheduled in a group called a basic-block. Whenever this basic-block is executed, its internal data dependencies and execution ordering remains the same. Therefore, a statically mapped CGRA as a configurable CISC ALU might be able to save a significant amount of energy by scheduling blocks of instructions together as an extreme adaptation of micro-op fusion [249]. There is one caveat though. If only one copy of an instruction block that is heavily used in 38 Fig. 16: Microarchitectural breakdown of energy usage in an out of order superscalar processor [44]. Fig. 17: Microarchitectural breakdown of energy usage in an embedded RISC processor [45]. Fig. 18: Instruction address locality in Spec Benchmarks. close proximity is mapped to the CGRA, then the performance might be hampered by structural dependencies. This is likely to happen in strictly iterative algorithms like the Newton-Raphson method. Fig. 19 shows the distribution of the lengths of two consecutive groups of cycles with identical patterns of available parallelism. Cycles are not included in this distribution if no repeated pattern could be identified. This is ultimately an indirect measure, and there are many caveats for using this data. A repeated pattern of available parallelism could be made up of two completely different functions. Alternatively, two identical functions could be scheduled 39 differently, showing different patterns of available parallelism. However, it is reasonable to assume an instruction block that is repeated for a long time will settle onto a repetitive schedule. The resulting data suggests that a few programs might be hampered by these structural dependencies. Therefore, it is likely a good idea for the CGRA to be able to duplicate a given configuration in multiple locations to avoid structural hazards. Finally, basic blocks tend to be defined by the surrounding branch instructions which require the instruction scheduling control circuitry to make a decision. However, not all branch instructions are made the same. They represent a barrier through which a collection of data routing decisions are made. Some decide the routing of a significant amount of data while others decide the routing of just one value. Therefore, it could be possible to expand basic block size by merging simple branches, conditional moves, and for loops into their neighboring basic blocks. This would increase the size of a basic block, free up space in the branch predictor, and further reduce scheduling costs [212]. Further analysis of brand width distributions should be done to determine just how much of an effect this could have on basic block size. 2.3 Memory and Bandwidth This form of micro-op fusion ultimately routes an operation's intermediate results directly to their next operation within the array. Doing so removes these intermediate values from the memory system entirely, reducing demand on the register rename table and the L0 and L1 data caches. This reduction is not linear. A lot of overhead is introduced by capacity limitations of the physical register file, and a CGRA effectively increases its size. With a limited physical register file, values must be swapped in and out of main memory more often, introducing a lot of move instructions that otherwise would not be necessary. The instruction usage breakdown in Fig. 20 shows that move instructions are the single most used instruction in the x86-64 ISA. Overall, Fig. 21 shows that routing instructions account for 43% of all instructions executed on average across all program runs. This is a massive overhead that would not exist in an ideal machine, which is why it was removed from the parallelism and bandwidth measurements. To my knowledge, only one paper has endeavored to measure memory and bandwidth requirements [36]. However, it is unclear whether they took this overhead into account to determine what an ideal system might see. They measured these requirements in the context of a particular architecture, focusing heavily on how cache size affects the memory bandwidth. In many ways, the memory and bandwidth requirements of a program mirrors its available parallelism. More instructions in parallel also means more data. Therefore, measuring them requires only minor modification to the available parallelism measurement tool. These requirements are examined from from six different perspectives: 40 Fig. 19: Distribution of maximum local correlation of parallelism within 100 cycles. Fig. 20: Distribution of instruction usage as a percent of total execution averaged over all programs. Fig. 21: Distribution of instruction usage categories as a percent of total execution for all programs in Spec2006. Read Reuse Distance The distance in cycles to the previous read or write for each byte of data read in a given cycle. Read Lifetime Distance The distance in cycles to the write for each byte of data read in a given cycle. 41 Write Reuse Distance The distance in cycles to the first read for each byte of data written in a given cycle. Write Lifetime Distance The distance in cycles to the last read for each byte of data written in a given cycle. Store Reuse Distance The distance in cycles to the next read for each byte of data currently in storage in a given cycle. Store Lifetime Distance The distance in cycles to the last read for each byte of data currently in storage in a given cycle. The first four summarize read and write bandwidth while the last two summarize total memory storage requirements at a given cycle. Fig. 22 shows an example outcome of the store reuse distance. For example in cycle 50, 20000 bytes currently in storage will be read within the next 4096 cycles. Fig. 23 shows the distribution of storage requirements. For most programs like gcc, h264ref, soplex, etc, the maximum storage requirements stay fairly consistent, showing tight distributions for 75% of cycles. However, there are a few like gobmk, lbm, wrf, and mcf in which the memory requirements vary wildly. This is simply due to a few highly parallel cycles. Ultimately, it is hard to know when a program might be at a particular location in this distribution. Given how tight many of the distributions are, it is unclear how much benefit could be gained by trying to take advantage of this. Fig. 24 shows the distribution of storage reuse distance for an average cycle. For most programs, the vast majority of this data is simply sitting around waiting more than 4096 cycles to be read. This is consistent with the program behaviors assumed by cache systems with cache size relative to ALU proximity. The programs that more often interact with a much greater percentage of their data are also the programs that require significantly less total storage, like bzip2 and gobmk. The bytes that will be read in 1 cycle represent the average read bandwidth. Unfortunately, the limiting factor for speedup of wrf and libquantum on a GPU will most definitely be bandwidth. The distribution of read bandwidth across cycles in Fig. 25 is not as tight. In general the read bandwidth for 50% of cycles stays between 100 and 1000 bytes per cycle. Though every program 42 Fig. 22: Trace of memory requirements for perlbench.0. Fig. 23: Distribution of total storage requirements for each program. Fig. 24: Distribution of Storage Reuse Distance for an average cycle in each program. has outlying cycles that have a lot of parallelism and therefore consume a lot of bandwidth. Fig. 26 shows the distribution of the read reuse distance for an average cycle. The vast majority of the read bandwidth is due to operators that were completed in the previous cycle. This accounts for no less than 40% of the read bandwidth requirement across all programs. For all but one program, around 70% of values were written or read at most 4 cycles ago. This is also consistent with the program behaviors assumed by cache systems implementing the least-recently-used swapping strategy. 43 Fig. 25: Distribution of Read Bandwidth over all of the cycles in each program. Fig. 26: Read Reuse Distance for an average cycle in each program. Fig. 27 shows the distribution of the read lifetime distance for an average cycle. Between 20% and 40% of the read bandwidth comes from bytes that were written exactly one cycle ago. These values can be routed directly through the CGRA, completed avoiding the memory system. Bytes with slightly longer lifetime distances can passively wait on the CGRA's routing network until the other dependencies have resolved. The write bandwidth distributions in Fig. 28 are ultimately similar to the read bandwidth. However, due to value fanout the write bandwidth is reduced by a small factor with respect to the read bandwidth. However, what is more interesting in Fig. 29 is that for most programs around 80% of written values are read in the next cycle. This is consistent with the performance boost offered by bypassing networks. Likely the difference between this and the read reuse distance comes from value fanout and ultimately values stored during program initialization. Finally, as shown by the write lifetime distance in Fig. 30, over 60% of written values for most programs last exactly 1 cycle and around 80% of written values last at most 4 cycles. This bandwidth represents the low-hanging fruit for optimization with a CGRA. These values would be routed directly through the CGRA to their next operation without ever touching the memory 44 Fig. 27: Read Lifetime Distance for an average cycle in each program. Fig. 28: Distribution of Write Bandwidth over all of the cycles in each program. Fig. 29: Write Reuse Distance for an average cycle in each program. system. This can be particularly helpful with programs like libquantum and wrf that are likely to struggle with bandwidth on a GPU. 2.4 Bitwidth The measured instruction locality suggests the CGRA's capacity will need to be fairly large for it to be useful. Furthermore, any method of reducing memory bandwidth requirements would help to alleviate the Von-Neumann bottleneck in the face of high parallelism. Luckily, the bitwidth 45 Fig. 30: Write Lifetime Distance for an average cycle in each program. distribution can be used to manage both. Fig. 31 shows 6 categories of arithmetic operations and their usage. The flags category contains bit test, set, scan, and count operations. The shift category contains logical shift, arithmetic shift, and rotation. The remaining categories are self explanatory. Add and subtract operations are responsible for 43% of all integer arithmetic operation executions averaged over all program runs. Including the subtraction operation required for the compare instructions, then it is clear to see that add and subtract operations are representative of 82% of all executed integer arithmetic operations. Furthermore, they are a heavily-used sub-operation for multiply and divide. Fig. 32 shows the distribution of the bitwidth of 2 trillion integer add and subtract operations measured by taking the maximum bitwidth of its inputs, averaged across all programs. There are four distributions centered around 6, 20, 26, and 29 bits. Then, there are spikes at 33 bits, and 48 bits. Given that the memory bus in an x86-64 architecture is 48 bits and the surrounding bitwidths have negligible occurrence counts, it is safe to assume that all of the 47 and 48 bit operations come from memory address calculations. Because those computations have a predictable bitwidth and require heavy utilization of multiplier circuitry, they should not be executed within any CGRA accelerator. Fig. 33 re-examines the data with this in mind, showing the average bitwidth both ignoring 47 and 48 bit operations in blue, and including them in red. For a few programs like calculix, povray, and omnetpp, memory address calculations seem to account for a significant fraction of total executed instructions. Meanwhile, others like sphinx3, gamess, and mcf have very few suggesting either that they have low total memory requirements, or that memory locations can be determined at compile time. The peak around 29 bits is likely from 32-bit masks. The peak around 6 bits is likely due to iterators and flags on top of the typical arithmetic distribution centered at 8 bits. It is unclear what causes the spikes at 20, 26, and 33 bits. Overall, the average bitwidth ignoring memory address 46 Fig. 31: Distribution of ALU operation usage across all programs. Fig. 32: Average bitwidth distribution of integer operations across all programs. Fig. 33: Average bitwidth of integer operations in each program. calculations is about 15 bits. While the bitwidth is highly variable, their alignment is not. Fig. 34 shows that those values tend to be aligned very close to 0. This means that an adaptive LSB-first datapath can be very simple while an MSB-first datapath will require complex control to get anything out of the bitwidth distribution. Finally, as shown in Fig. 35, run lengths follow an exponential drop in occurrence rate with vanishing probability past 6 bits. 47 Fig. 34: Alignment of integer addition operations. This is ultimately the location of the first non-zero bit. Fig. 35: Distribution of run-length vs start-bit for the run. 2.5 Lessons Learned There are quite a few lessons to be pulled from this data. Regarding micro-architecture, a CGRA accelerator as a statically mapped configurable CISC ALU could: 1. dramatically reduce the energy required for instruction configuration; 2. reduce demand on the data memory system and register renaming table, effectively expanding the width of the physical register file by directly routing intermediate values through the CGRA; 48 3. reduce demand on the instruction memory system by fetching blocks of instructions at a time; 4. reduce demand on the branch predictor by encoding simple branches and loops directly on the CGRA fabric; and 5. increase achieved IPC by reducing demand on all of the bottlenecked systems while increasing parallel execution resources. To limit the scope of this thesis, these benefits will have to be explored in future work. However, in the context of such a CGRA, this data also suggests that quite a few benefits could be realized from a digit-serial datapath that adapts to variable length operations. Specifically, an adaptive digit-serial datapath could: 1. reduce energy consumption and increase throughput per transistor by skipping unnecessary work; 2. natively implement multiple precision arithmetic; 3. require significantly fewer transistors per execution node, increasing the overall capacity of the CGRA; 4. slightly increase sequential instruction execution speed, because per-digit operations are likely to have lower latency; 5. reduce memory bandwidth, reading or writing only one digit of each value per cycle, and reducing the total number of digits read or written per value. The following chapters explore self-timed adaptive digit-serial arithmetic operators with this context in mind, directly demonstrating 1 and 2 with functional circuitry and indirectly demonstrating 3 with a simple prototype. 4 and 5 require implementation of the whole system, which is not in the scope of this work. 49 CHAPTER 3 DESIGN METHODOLOGY Next, there must be solid answers for three further questions. How should the digit stream be adapted to best take advantage of underlying features in the data, what are the optimal circuit families for such a task, and how should data be encoded get the most out those circuit families? 3.1 Digit-Serial Adaptivity The end goal for adapting digit streams to characteristics in the underlying data is to reduce the amount of computation required to complete arithmetic operations typically used throughout the execution of an application. There is a tenuous balance between increasing control complexity to take advantage of some underlying characteristic and minimizing the overhead of that control to reduce the per-digit cost of the remaining computation. Ultimately, without a full design-space exploration, there is no clear answer on what that balance should be. However, there are several features in the underlying data that should be considered. First, in add and subtract operations, the carry signal propagates from the least significant digit to the most significant. While it is possible to implement serial most significant digit first addition, doing so introduces overhead. Importantly, this overhead should not be overlooked. As noted in Fig. 31, addition and subtraction account for the vast majority of integer arithmetic operations. It is part of the comparison operator and a heavily used sub-operator of multiplication and division. Therefore, that overhead will add up very quickly. Second, another commonly used operator, compare to zero, can often be resolved by simply looking at the most significant digit of each input operand. An MSB first datapath might make it possible to cancel any unnecessary computation on the remaining digits. Because compare to zero operations are so common, this will affect a significant number of operations. However, it is unclear how efficient it would be to exploit this feature and what the overall benefit may be in relation to the overhead of carry chain propagation. Taking advantage of the bitwidth in Fig. 32 and the alignment in Fig. 34 is easily accomplished with an LSB first datapath. Meanwhile, an MSB first datapath will have to determine a maximum possible bitwidth and keep track of the offset to the most significant bit of each value. Undoubtedly, this means that an MSB first datapath cannot easily take advantage of this feature without extremely high control overhead. Fig. 35 suggests that run-length encoding is unlikely to help much, particularly if the digits are any wider than 1 bit. For example, a run of 8 bits only occurs in one out of every 100 operations. 50 Given these features, there are three possible approaches. The first approach is to have a fixed-point LSB first digit-serial datapath in which bitwidth is compressed by marking the end of the stream. This takes advantage of the natural flow of the carry chain, the bitwidth distribution, and the alignment distribution to make the control circuitry as simple as possible while extracting what is likely to be the majority of energy and throughput benefit. The second approach is a floating-point arbitrary-precision redundant-encoding MSB first digit-serial datapath. This aggressively targets all of the available features. Providing support for floating point inherently compresses the bitwidth on the MSB side while arbitrary precision compresses it on the LSB side. Likely, floating point support would need to be implemented by encoding the exponent as its own value alongside the significand. Meanwhile, the arbitrary precision support can be implemented by simply marking the end of the stream. Redundant encoding makes it unnecessary to wait for the whole carry chain before resolving a digit when executing anything other than a compare operator. However, since compare operators only decide a boolean value, they can implement this wait without actively tracking the length of the carry chain. Because the most significant digit is the first in the stream, compare operators can finish quickly and cancel the rest of the incoming digits. The primary downside of this approach will likely be the complexity of implementing floating point support. In the end, the redundant encoding does not require an unreasonable amount of overhead, and the overhead of the arbitrary precision support is replicated in the LSB first approach as bitwidth compression. The third approach is a variable-length and variable precision fixed-point combined LSB/MSB first datapath. This removes the need for MSB-first floating point control circuitry and adds arbitrary precision support to the datapath. Therefore, one could migrate some of the floating-point arithmetic over to such a datapath and reduce demand on the floating-point ALU. Given that the second and third approaches have significantly higher complexity likely reducing the overall capacity of the CGRA, this work will concentrate heavily on the first approach. 3.2 Circuit Topology To implement the first approach, each value is encoded into a stream of tokens. Each token communicates the value of a digit alongside a flag marking whether it is the last one. The first token communicates the least significant digit while the last communicates the most significant. Ultimately, there are a large number of circuit topologies to choose from. Historically, the most successful choice has been Synchronous. However as discussed in Chapter 1, synchronous circuits are inefficient at implementing the irregular data and compute patterns required by an adaptive digit-serial CGRA. In the end the implemented request and acknowledge signals are 51 nearly identical to self-timed circuitry. There are quite a few self-timed circuit families to choose from. The Quasi Delay-Insensitive (QDI) circuit families are the most reliable Turing complete option [79]. They have demonstrated correct operation through wide temperature and voltage swings, smoothly scaling operating frequency along the way [173]. They are also very flexible, implementing all kinds of complex behaviors, such as those found in [83] and [75]. As long as the acknowledgement requirement is maintained [82], this flexibility allows you to take advantage of irregular data and control patterns to skip unnecessary work, increasing throughput and saving energy. However, these features come at a cost. Delay insensitive data encodings [85] must communicate both the data and its validity. This requirement introduces overhead as shown in Fig. 36. Ultimately, the necessary validity and neutrality checks can also require quite a few devices, consuming lots of area on die compared to typical synchronous encodings. Because of this, people generally avoid the more complex codes in favor of the smaller one-hot codes. Unlike clocked signals, data transmission absolutely requires a transition, burning energy even if the same value is sent multiple times. Steps can be taken to mitigate this problem, but all generalized approaches require a not-insignificant amount of circuitry. Finally, the acknowledgement requirement can wreak havoc on performance in certain scenarios. In particular, signals with large fan-out or fan-in require a rather large c-element tree. In the end, using QDI circuits in a parallel datapath with regular data patterns is extremely expensive. Though these features also make them well-suited for control circuitry. Decades of industry effort have demonstrated how to implement wide datapath logic that is efficiently synchronized to a central control. Bundled-data circuits [63][74] demonstrate how this may be achieved for a QDI control rather than a clock. Historically, bundled-data circuits have maintained extremely simple control, staying as close to synchronous microarchitectures as possible. This effectively turns the asynchronous pipeline control into a reliable clock-distribution network that does not suffer from clock skew or jitter. This work uses the strengths of both QDI and bundled-data to implement an efficient datapath with a flexible and expressive control. To my knowledge, no other work has ventured in this particular direction. 3.3 QDI Control Circuits Under the delay insensitive (DI) delay model, a circuit should operate correctly independent of gate and wire delays. Correct operation means that the circuit remains stable, non-interfering, and deadlock-free. An instability, or glitch, can cause data-loss or lead to interference; interference, or a short, can cause permanent circuit damage; and deadlock halts the computation prematurely. To 52 Fig. 36: Wires required to represent a bit using a specific MofN code (top), relative energy required to communicate each bit (middle), and transistors per bit required to implement a validity gate for a specific MofN code assuming simple transistor sharing trees (bottom). Each curve shows a single selection of M while sweeping N. achieve this goal, every transition must acknowledge every input to its driving gate. A transition a acknowledges another b if there is a causal sequence of transitions from a to b that prevents b from firing until after a has completed.[90] In order for this model to be Turing Complete, the quasi-delay insensitive (QDI) delay model makes one exception to acknowledgement called the Isochronic Fork Assumption. If there is a wire fork to multiple gates, and one of those gates does not acknowledge all of the transitions on that wire, then the delay from the driver to the non-acknowledging gate is assumed to be bounded. In this model it is always safe to place an inverter before the wire fork. However, because gates have unbounded delay, placing an inverter after an isochronic fork and before the non-acknowledging gate can cause an instability. Because the isochronic fork timing assumption is easy to guarantee and maintain, real QDI circuits are robust by construction to temperature variation, process variation, sizing, noise, etc [161]. For a more detailed discussion on the QDI model and this timing assumption, see [90], [91], and [82]. Implementing control circuitry following the QDI delay model is often a difficult undertaking by itself. QDI circuits are often written in a control-flow language called Communicating Hardware Processes (CHP) described in Appendix A and then synthesized into a Production Rule Set (PRS) described in Appendix B using two basic methods. The first, Syntax-Directed Translation [66][62], maps the program syntax onto a predefined library of clockless processes through structural induction creating a circuit that strictly respects the control flow behavior of the original program. Well formulated examples of this method may be found in [96] and [97]. 53 The second, Formal Synthesis [80][65], iteratively applies a small set of formal program transformations like projection and process decomposition, decomposing the program until the resulting processes each represent a single pipeline stage. Then, these stages are synthesized using Martin Synthesis into production rules. This approach respects data dependencies, but not necessarily the original control-flow behavior of the specification [84]. This work uses a well-known hybrid approach, Templated Synthesis [68]. First, formal transformations are applied to decompose a CHP description into a collection of simple Dynamic Single Assignment (DSA) [86] CHP processes. Then, various template patterns and micro-architectural optimizations are applied to synthesize PRS which are automatically verified and compiled into circuits. Overall, [68] describes different ways to handle the standard pipeline stage or “buffer”. ∗[ L?v; R!v ] Ultimately, each pipeline stage is implemented over four signals as shown in Fig. 37. The input requests Lr , the input enable Le , the output requests Rr , and the output enable Re . The C-elements used to drive the output requests are called the “forward drivers”. Each channel goes through two phases. In the “set phase”, the input requests have gone high along with the output enable. The forward drivers drive the output requests high and the input enable “acknowledges” the input requests by transitioning low. In the “reset phase”, the input requests are “reset” meaning they transition low. The output enable acknowledges the output requests, transitioning low as well. As a result, the forward drivers reset the output requests and the input enable transitions high “enabling” the input channel. These events may be implemented with one of three primary orderings, or “reshufflings”. Weak-Condition Half Buffer (WCHB) The WCHB is the smallest and fastest of the three reshufflings. It is ultimately a symmetric handshake. The forward drivers always wait for both the output enable and the input request before transitioning. This makes conditional acknowledgement of the inputs and conditional output requests easier to handle because the reset of the forward drivers are mapped directly to the reset of the input requests and the acknowledgement of the output requests. However, this feature also means that the WCHB reshuffling tends to struggle with long transistor stacks in the reset phase of the forward drivers. If this happens, then there are a few strategies to mitigate the problem. The primary of which is to go back to the CHP and decompose the process further. ∗[[Re ∧ Lr]; Rr↾; Le⇂; [¬Re ∧ ¬Lr]; Rr⇂; Le↾] 54 Fig. 37: Channel protocols for QDI buffers. Synthesis of most processes should start with a WCHB reshuffling. A 1-bit WCHB buffer is 20 transistors and has the smallest cycle time. With the optimization rules presented in the next section, it is generally possible to implement most functionality with a cycle time of 10 to 12 transitions. Re ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Lr0 → Rr0⇂ Re ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Lr1 → Rr1⇂ Rr0 ∨ Rr1 → Le⇂ ¬Rr0 ∧ ¬Rr1 → Le↾ Pre-Charge Half Buffer (PCHB) ∗[[Re ∧ Lr]; Rr↾; Le⇂; [¬Re]; Rr⇂; [¬Lr]; Le↾] The PCHB reshuffling allows the output request to reset before the input request. That means it should generally be used if the input channel is significantly slower than the output channel in one or more cases. A 1-bit PCHB buffer is 34 transistors. The cycle time is 12 to 14 transitions. Re ∧ Le ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Le → Rr0⇂ Re ∧ Le ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Le → Rr1⇂ Lr0 ∨ Lr1 → Ln⇂ ¬Lr0 ∧ ¬Lr1 → Ln↾ Rr0 ∨ Rr1 → Rn⇂ ¬Rr0 ∧ ¬Rr1 → Rn↾ ¬Ln ∧ ¬Rn → Le⇂ Ln ∧ Rn → Le↾ Pre-Charge Full Buffer (PCFB) ∗[[Re ∧ Lr]; Rr↾; Le⇂; en⇂; ([¬Re]; Rr⇂ ∥ [¬Lr]; Le↾); en↾] The PCFB reshuffling lets the output request and input enable reset in parallel. This allows a token to wait at every stage of the pipeline instead of every other stage. In general, a PCFB reshuffling should not be used. However, if the design is latency or energy sensitive and the output channel is significantly slower than the input channel in one or more cases, then a 55 PCFB allows the input enable to reset before the output enable. If the design is not latency or energy sensitive then two WCHB buffers should take its place. This has fewer transistors and operates at higher throughput. A 1-bit PCFB buffer is 42 transistors with a cycle time of 14 transitions. en ∧ Re ∧ Lr0 → Rr0↾ ¬en ∧ ¬Re → Rr0⇂ en ∧ Re ∧ Lr1 → Rr1↾ ¬en ∧ ¬Re → Rr1⇂ Rr0 ∨ Rr1 → Rn⇂ ¬Rr0 ∧ ¬Rr1 → Rn↾ Lr0 ∨ Lr1 → Ln⇂ ¬Lr0 ∧ ¬Lr1 → Ln↾ ¬_en ∧ ¬Ln ∧ ¬Rn → Le⇂ _en ∧ Ln → Le↾ ¬Le → en⇂ Rn ∧ Le → en↾ 3.4 Synthesis Strategy This synthesis approach builds upon Andrew Lines' Templated Synthesis method, starting with a flattened DSA CHP specification of a single pipeline stage process and deriving energy-efficient high-throughput PRS. 1. Characterize the environment and state: Learn as much as possible about the circumstances under which this circuit will operate. Specifically, what are the possible input values? What are the possible output values? What kind of information needs to be stored? How frequently is each input value, output value, or state used? When is the internal state switching and when is it stable? Are there any relations between their usages? For example, might there be a common sequence of these values? 2. Encodings: What are all the possible ways in which the input requests, output requests, and internal state can be used to represent the function you are trying to implement? Pick one of those encodings. Concentrate on encodings that seem to play well with the characterizations from step 1. 3. Constraints and orderings: For this encoding, what are all of the different constraints? What are all the possible ways to order events? There is often a constraint that turns out to be not as strict as initially thought. Pick an ordering that seems to play well with the characterizations from step 1. 4. Group functionally equivalent behaviors: Group signals that have similar behaviors throughout the handshake. There are three parts of the handshake to consider in this process: the output requests, the input acknowledgement, and the internal memory. This helps to reduce the total number of forward drivers and simplify the transistor stacks in the reset phase of the forward drivers. In many cases, this can have a dramatic affect on all 56 performance metrics. However, this often requires a lot of trial and error. Ultimately, C-elements are an expensive gate, and all QDI handshake protocols dictate that the output request lines require C-elements for implementation. The goal is to reduce the number of necessary C-elements as much as possible. 5. Make an attempt: Start to place all of these behaviors out into the WCHB template. Learn from this process. What features make for a good encoding? What features make for a good ordering? 6. Push complexity out: If a particular part of the interface is making the reshuffling particularly complex, switch the interface to make it easier. This pushes the complexity out of the unit being developed. It will need to be dealt with when developing the interfacing units. Often times, this is easier but requires the interfacing units to be specialized for that particular situation. If not, then push complexity out of those units as well. 7. Avoid staticizers: Use combinational gates when possible. It is often possible to convert a C-element to a combinational gate by adding a few vacuous transitions to the handshake. This removes the need for a staticizer, benefiting all metrics. Ultimately in a WCHB handshake, only the gates driving the output request wires need to be implemented with C-elements. Keep in mind that a gate need not be combinational to avoid a staticizer, but it must be driven in every possible state of the handshake. 8. Iterate: Go back to step 2 or 3 as necessary. As more encodings and orderings are attempted, there will be a much better understanding the encodings and orderings that work, and the ones that do not. 3.5 Microarchitectural Optimizations [68] provides a good starting point for QDI circuit development. However, these templates also make it easy to introduce a significant amount of overhead in the circuit without realizing it. Fortunately, there are quite a few micro-architectural optimizations that make it easier to fit more complex computation in a single pipeline stage and greatly simplify stages without such computation. With these optimization rules, it is easier take advantage of irregularities in the data for performance gains and energy savings. All of the provided examples try to keep the optimizations separated from each other. Applying all of the optimizations to all of the examples will yield much better circuits overall. Validity Trees When the rules for the input enable become too long, then a validity tree can be used to break the computation into multiple gates. If the validity tree uses the outputs of the forward 57 drivers, Rr0 and Rr1 , for its inputs then the critical cycle will be increased by at least two transitions. If this process has an internal memory, then the validity trees may also be used to simplify the set logic for the internal memory. Doing so depends heavily upon the compatibility between the logic for the input enable and internal memory. Re ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Lr0 → Rr0⇂ Re ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Lr1 → Rr1⇂ Rr0 ∨ Rr1 → Rv↾ ¬Rr0 ∧ ¬Rr1 → Rv⇂ Rv → Le⇂ ¬Rv → Le↾ Intermediate Forward Drivers Similarly, if the logic in the output requests does not align well to the input enable or internal memory, then intermediate forward drivers may be used. Instead of allocating a single forward driver to each output request, multiple forward drivers will cover different conditions for a single output request. Then a gate tree is used to do the final combination. Once again, if the gate tree uses the outputs of the forward drivers R0 or R1 , then the critical cycle will be increased by at least two transitions. Re ∧ Lr0 → R0↾ ¬Re ∧ ¬Lr0 → R0⇂ Re ∧ Lr1 → R1↾ ¬Re ∧ ¬Lr1 → R1⇂ R0 ∨ R1 → Rr↾ ¬R0 ∧ ¬R1 → Rr⇂ R0 ∨ R1 → Le⇂ ¬R0 ∧ ¬R1 → Le↾ nLatch Internal Memory Internal state is often an integral part of any complex computation. Unfortunately, the strategies covered in [68] are fairly limited. The most basic implementation is covered in [68]. In this case, the internal memory is added as an n-latch with its signals transitioning up then down. If the internal memory is not written in the same cycle it is read, then the write rules simply acknowledge the down-going transition of the latch and the read rules use the output of the latch directly. This is demonstrated pretty well with a single-bit register. The input channel L has three requests. The write requests Lw0 and Lw1 set the value of the internal memory while the read request Lr reads the value of the internal memory to the read channel R . v:=0; 58 *[L?l; [lr → R!v [] lw → v:=lw]] This could be implemented using the strategies in [68] as follows. The write requests are stored with two C-elements driving R0 and R1 . Once stored, the input channel is acknowledged by lowering Le and the internal memory v0 and v1 are set using the internal nodes of the forward drivers _R0 and _R1 . In the reset phase of the handshake, the forward drivers wait for the downgoing transition of the latch before resetting. Once the forward drivers have been reset, the input channel is enabled by raising Le . Meanwhile, the read request emits the value of the internal memory directly to the requests on R . This allows the input channel to be acknowledged by lowering Le . When the output channel is acknowledged and the input channel is reset, the forward drivers for Rr0 and Rr1 may be reset and the input channel enabled. Re ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬Lw0 ∧ ¬v1 → R0⇂ Re ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬Lw1 ∧ ¬v0 → R1⇂ Re ∧ Lr ∧ v0 → Rr0↾ v1 ∧ _R0 → v0⇂ ¬Re ∧ ¬Lr → Rr0⇂ Re ∧ Lr ∧ v1 → Rr1↾ v0 ∧ _R1 → v1⇂ ¬Re ∧ ¬Lr → Rr1⇂ R0 ∨ R1 ∨ Rr0 ∨ Rr1 ¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧ ¬Rr1 → → Le⇂ Le↾ 3-Valued Internal Memory (Positive) The first thing to note is that QDI circuits are not limited to only 2-valued latches. Sometimes, a process only has three internal states. Encoding this with two 2-valued latches adds unnecessary transitions on the internal memory and an unnecessary state. In this case, it is more energy efficient and easier to use a 3-valued latch. While this saves energy, it does not reduce the size of the circuit because a 3-valued latch ultimately requires the same number of transistors as two 2-valued latches. This is demonstrated again using the 3-valued register. All the behaviors remain unchanged relative to the 2-value register, but a new write request Lw2 and a new read value Rr2 is added to the handshake. 59 Re ∧ Lw0 → R0↾ ¬v1 ∧ ¬v2 ∨ ¬_R0 → ¬Lw0 ∧ ¬v1 ∧ ¬v2 → R0⇂ Re ∧ Lw1 → R1↾ v0↾ ¬Lw1 ∧ ¬v0 ∧ ¬v2 → R1⇂ Re ∧ Lw2 → R2↾ ¬v0 ∧ ¬v2 ∨ ¬_R1 → ¬Lw1 ∧ ¬v0 ∧ ¬v1 → R2⇂ Re ∧ Lr ∧ v0 → Rr0↾ v1↾ ¬Re ∧ ¬Lr → Rr0⇂ Re ∧ Lr ∧ v1 → Rr1↾ ¬v0 ∧ ¬v1 ∨ ¬_R2 → ¬Re ∧ ¬Lr → Rr1⇂ Re ∧ Lr ∧ v2 → Rr2↾ v2↾ ¬Re ∧ ¬Lr → Rr2⇂ (v1 ∨ v2) ∧ _R0 → v0⇂ R0 ∨ R1 ∨ R2 → Wv↾ (v0 ∨ v2) ∧ _R1 → v1⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Wv⇂ Rr0 ∨ Rr1 ∨ Rr2 → Rv↾ (v0 ∨ v1) ∧ _R2 → v2⇂ ¬Rr0 ∧ ¬Rr1 ∧ ¬Rr2 → Rv⇂ Rv ∨ Wv → Le⇂ ¬Wv ∧ ¬Rv → Le↾ There are two things to notice in this example. First, the rule driving Le↾ was 6 transistors long, requiring a validity tree Wv and Rv to break up the transistor stack. Methods to solve this problem will be presented later in this section. Second, take note that the reset rules for the forward drivers of the write R0 , R1 , and R2 now have to check the downgoing transitions of two states in the internal memory. This can cause trouble when the computation requires longer reset rules in the forward drivers to begin with. Ultimately, there are strategies to mitigate this. For example, suppose that state 0 always transitions to state 1, state 1 always transitions to state 2, and state 2 always transitions to state 0 in a ring. In this case, R0⇂ would only need to acknowledge ¬v2 , R1⇂ would only need to acknowledge ¬v0 and R2⇂ would only need to acknowledge ¬v1 . In general, if there are constraints on the state transitions, they can be used to reduce these transistor stacks. 3-Valued Internal Memory (Negative) Alternatively, if the transistors stacks in the reset phase of the forward drivers become too long and there are not any usable constraints, then it is possible to flip the sense of the internal memory from positive to negative. Instead of encoding state 0 as v0↾, v1⇂, v2⇂ , it is encoded as v0⇂, v1↾, v2↾ . This is reflected across the other two states as well, reducing the transistor stack length of the reset of the forward drivers for the write, but increasing the transistor stack length of the forward drivers for the read. 60 Re ∧ Lw0 → R0↾ ¬Lw0 ∧ ¬v0 → R0⇂ ¬v1 ∨ ¬v2 ∨ ¬_R1 ∨ Re ∧ Lw1 → R1↾ ¬Lw1 ∧ ¬v1 → R1⇂ ¬_R2 → v0↾ Re ∧ Lw2 → R2↾ ¬Lw1 ∧ ¬v2 → R2⇂ ¬v0 ∨ ¬v2 ∨ ¬_R0 ∨ Re ∧ Lr ∧ v1 ∧ v2 → ¬Re ∧ ¬Lr → Rr0⇂ ¬_R2 → v1↾ Rr0↾ ¬Re ∧ ¬Lr → Rr1⇂ ¬v0 ∨ ¬v1 ∨ ¬_R0 ∨ Re ∧ Lr ∧ v0 ∧ v2 → ¬Re ∧ ¬Lr → Rr2⇂ ¬_R1 → v2↾ Rr1↾ v1 ∧ v2 ∧ _R1 ∧ _R2 → Re ∧ Lr ∧ v0 ∧ v1 → ¬R0 ∧ ¬R1 ∧ ¬R2 → Wv⇂ v0⇂ Rr2↾ ¬Rr0 ∧ ¬Rr1 ∧ ¬Rr2 → v0 ∧ v2 ∧ _R0 ∧ _R2 → Rv⇂ v1⇂ R0 ∨ R1 ∨ R2 → Wv↾ v0 ∧ v1 ∧ _R0 ∧ _R1 → Rr0 ∨ Rr1 ∨ Rr2 → Rv↾ ¬Wv ∧ ¬Rv → Le↾ v2⇂ Rv ∨ Wv → Le⇂ Internal Memory Completion Signal If none of these strategies are working to reduce the length of the transistor stacks in the reset phase of the forward drivers, then a completion detection gate We can be applied to the internal memory. This adds two transitions to the path that writes the internal memory. However, those two transitions do not increase the critical cycle beyond 10 transitions since they overlap with the handshake on L and R . Re ∧ We ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬We ∧ ¬Lw0 → R0⇂ Re ∧ We ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬We ∧ ¬Lw1 → R1⇂ Re ∧ Lr ∧ v0 → Rr0↾ v1 ∧ _R0 → v0⇂ ¬Re ∧ ¬Lr → Rr0⇂ Re ∧ Lr ∧ v1 → Rr1↾ v0 ∧ _R1 → v1⇂ ¬Re ∧ ¬Lr → Rr1⇂ R0 ∨ R1 ∨ Rr0 ∨ Rr1 → ¬_R0 ∧ ¬v1 ∨ ¬_R1 ∧ (_R0 ∨ v1) ∧ (_R1 ∨ Le⇂ ¬v0 → We⇂ v0) → We↾ ¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧ ¬Rr1 → Le↾ pLatch Internal Memory If there are other signals that would benefit from the internal memory completion detection gate, but their up-going sense must be checked instead of their down-going, then the internal memory can be flipped from an n-latch to a p-latch to accommodate them. This does not change the critical cycle. Instead it removes the inverter that was previously on the completion detection signal and uses the inverter on the C-element of the forward driver 61 instead. Re ∧ We ∧ Lw0 → R0↾ ¬v1 ∧ ¬R1 → v0↾ ¬Lw0 ∧ ¬We → R0⇂ Re ∧ We ∧ Lw1 → R1↾ ¬v0 ∧ ¬R0 → v1↾ ¬Lw1 ∧ ¬We → R1⇂ Re ∧ Lr ∧ v0 → Rr0↾ v1 ∨ R1 → v0⇂ ¬Re ∧ ¬Lr → Rr0⇂ Re ∧ Lr ∧ v1 → Rr1↾ v0 ∨ R0 → v1⇂ ¬Re ∧ ¬Lr → Rr1⇂ R0 ∨ R1 ∨ Rr0 ∨ Rr1 → R0 ∧ v0 ∨ R1 ∧ v1 → (¬R0 ∨ ¬v0) ∧ (¬R1 ∨ Le⇂ We⇂ ¬v1) → We↾ ¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧ ¬Rr1 → Le↾ C-element Internal Memory Of course, all of these approaches can be mixed. This makes the internal memory a C-element instead of a latch. In this example, R0 drives v0+; v1⇂ as an n-latch would and R1 drives v0-; v1↾ as a p-latch would. Notably, this C-element is not combinational, but also does not require an explicit staticizer. This is because R0 and R1 are guaranteed to be mutually exclusive. Specifically, if _R0 is low driving v0↾ , then R1 is guaranteed to be low. This means that the rule driving v0⇂ is off. If R1 is high driving v0⇂ , then _R0 is guaranteed to be high. This means that the rule driving v0↾ is off. If the handshake is in a neutral state and _R0 is high while R1 is low, then v0 is staticized based upon the value of v1 . Re ∧ Lw0 → R0↾ ¬v1 ∧ ¬R1 ∨ ¬_R0 → ¬Lw0 ∧ ¬v1 → R0⇂ Re ∧ We ∧ Lw1 → R1↾ v0↾ ¬Lw1 ∧ ¬We → R1⇂ Re ∧ Lr ∧ v0 → Rr0↾ ¬v0 → v1↾ ¬Re ∧ ¬Lr → Rr0⇂ Re ∧ Lr ∧ v1 → Rr1↾ v1 ∧ _R0 ∨ R1 → v0⇂ ¬Re ∧ ¬Lr → Rr1⇂ v0 → v1⇂ R0 ∨ R1 ∨ Rr0 ∨ Rr1 → ¬R1 ∨ ¬v1 → We↾ Le⇂ R1 ∧ v1 → We⇂ ¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧ ¬Rr1 → Le↾ Note that the validity tree We need only check the up-going transition on v1 . The down-going transition of v1 is acknowledged by the reset rule for R0 . Protecting Forward Drivers (Mutex) Suppose that the internal memory is written and read in the same cycle. For example, the following process records the previous data from the input channel L and sends the XOR of 62 the previous and current data on the output channel R . *[ L?l; R!(l^v); v:=l] In this example, there are four forward drivers covering each case of the XOR. R0 covers v0, Lr0 , R1 covers v1, Lr0 , R2 covers v0, Lr1 , and R3 covers v1, Lr1 . For R0 and R3 the result of the XOR is zero, so Rr0 is driven high. For R1 and R2 , the result of the XOR is one, so Rr1 is driven high. For R0 and R3 , the input request has the same value as the internal memory, so the internal memory is left unchanged. Meanwhile, R1 and R2 swap the value of the internal memory. Re ∧ v0 ∧ Lr0 ∧ _R1 → R0↾ ¬v1 ∨ ¬_R1 → v0↾ ¬Re ∧ ¬Lr0 → R0⇂ Re ∧ v1 ∧ Lr0 → R1↾ ¬v0 ∨ ¬_R2 → v1↾ ¬Re ∧ ¬v1 ∧ ¬Lr0 → R1⇂ Re ∧ v0 ∧ Lr1 → R2↾ v1 ∧ _R1 → v0⇂ ¬Re ∧ ¬v0 ∧ ¬Lr1 → R2⇂ Re ∧ v1 ∧ Lr1 ∧ _R2 → R3↾ v0 ∧ _R2 → v1⇂ ¬Re ∧ ¬Lr1 → R3⇂ R0 ∨ R3 → Rr0↾ ¬R0 ∧ ¬R3 → Rr0⇂ R1 ∨ R2 → Rr1↾ ¬R1 ∧ ¬R2 → Rr1⇂ R0 ∨ R1 ∨ R2 ∨ R3 → Le⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 ∧ ¬R3 → Le↾ If _R1 did not gate R0↾ or _R2 did not gate R3↾ , then the forward driving rules would be unstable. For example, suppose that Lr0 transitions high and v1 is high. This means that R1 would go high as a result. Then, before L is acknowledged, R1 drives v0 high. This enables R0↾ while R1 is already high. This breaks the mutual exclusion requirement of the delay insensitive encoding. When L is acknowledged and Lr0 transitions low, R0↾ will be disabled, causing a glitch to propagate out Rr0 . There are three techniques to avoid this instability. This first is to manually guarantee mutual exclusion of the offending forward drivers using the internal nodes of the others as demonstrated above. This is the fastest method, maintaining a 10 transition cycle time. However, it also makes the transistor stacks of the forward drivers one transistor longer. Protecting Forward Drivers (Output Enable) The second technique to protect the forward drivers from this instability is to wait until the output request is acknowledged before changing the internal state. Once the output request has been acknowledged, Re is low cutting off the up-going rules of the forward drivers. Because the internal memory now has to wait for the handshake on the output channel, the critical cycle is increased to 12 transitions. This technique should only be used if the transistor stacks in the forward drivers are already too long for the mutex approach. 63 This has the added benefit of reducing the transistor stacks in the reset phase as well. Since v0↾ acknowledges Re⇂ , R1 does not have to. This transistor stack length optimization only works if the internal memory is guaranteed to transition as a result of this forward driver. Re ∧ v0 ∧ Lr0 → R0↾ ¬v1 ∨ ¬Re ∧ ¬_R1 → v0↾ ¬Re ∧ ¬Lr0 → R0⇂ Re ∧ v1 ∧ Lr0 → R1↾ ¬v0 ∨ ¬Re ∧ ¬_R2 → v1↾ ¬v1 ∧ ¬Lr0 → R1⇂ Re ∧ v0 ∧ Lr1 → R2↾ v1 ∧ (Re ∨ _R1) → v0⇂ ¬v0 ∧ ¬Lr1 → R2⇂ Re ∧ v1 ∧ Lr1 → R3↾ v0 ∧ (Re ∨ _R2) → v1⇂ ¬Re ∧ ¬Lr1 → R3⇂ R0 ∨ R3 → Rr0↾ ¬R0 ∧ ¬R3 → Rr0⇂ R1 ∨ R2 → Rr1↾ ¬R1 ∧ ¬R2 → Rr1⇂ R0 ∨ R1 ∨ R2 ∨ R3 → ¬R0 ∧ ¬R1 ∧ ¬R2 ∧ Le⇂ ¬R3 → Le↾ Protecting Forward Drivers (Input Request) If a forward driver writes the internal memory, but does not make an output request, then the output request is not available to gate the transitions on the internal memory and protect the forward drivers from instability. In this case, the input requests may be used to do so. Unfortunately, if the input requests are not easily mapped to the transitions in the internal memory, then this will become very messy. Like the output enable approach, this increases the critical cycle to 12 transitions. Re ∧ v0 ∧ Lr0 → R0↾ ¬v1 ∨ ¬Lr0 ∧ ¬_R1 → v0↾ ¬Re ∧ ¬Lr0 → R0⇂ Re ∧ v1 ∧ Lr0 → R1↾ ¬v0 ∨ ¬Lr1 ∧ ¬_R2 → v1↾ ¬Re ∧ ¬v1 → R1⇂ Re ∧ v0 ∧ Lr1 → R2↾ v1 ∧ (Lr0 ∨ _R1) → v0⇂ ¬Re ∧ ¬v0 → R2⇂ Re ∧ v1 ∧ Lr1 → R3↾ v0 ∧ (Lr1 ∨ _R2) → v1⇂ ¬Re ∧ ¬Lr1 → R3⇂ R0 ∨ R3 → Rr0↾ ¬R0 ∧ ¬R3 → Rr0⇂ R1 ∨ R2 → Rr1↾ ¬R1 ∧ ¬R2 → Rr1⇂ R0 ∨ R1 ∨ R2 ∨ R3 → ¬R0 ∧ ¬R1 ∧ ¬R2 ∧ Le⇂ ¬R3 → Le↾ Exchange Channels (Positive) Sometimes, the implementation of some functionality requires data to be communicated in both directions. The sending process sends data as part of the request, and the receiving process sends data back as part of the acknowledge or enable. Effectively, the two processes exchange data every cycle. Often the data sent by the receiving process will be status data informing the sending process whether the computation has completed. 64 There are two ways of achieving this depending upon which phase of the handshake the receiving process communicates its data. In the first approach, the receiving process communicates its data during the reset phase as part of the enable of the input requests. The input requests are acknowledged when the encoding is in a neutral state and enabled when the encoding has a valid value. This means that the value being returned to the sender remains valid until another token is sent. The example used here will be a simple exchange channel buffer. The receiving channel L receives l from the input requests, then returns r . The sending channel R forwards the received value l and is returned a new value for r . Because the new value for r is received from the channel R after the old value has been sent across L , it must be stored in an internal memory for a cycle. r:=0; *[ L?l!r; R!l?r ] The implementation turns out to be fairly straightforward. Only one of the output enables will be high, signifying the data that is being returned. This is combined with the input request to form four intermediate forward drivers. R0 forwards the output request Rr0 and sets the internal memory to 0 . R1 forwards the output request Rr0 and sets the internal memory to 1 . Similarly, R2 and R3 both forward the output request Rr1 and set the internal memory to 0 and 1 respectively. Re0 ∧ Lr0 → R0↾ ¬Re0 ∧ ¬Lr0 ∧ ¬v1 → R0⇂ ¬v1∨¬Le0∧(¬_R0∨¬_R2) → Re1 ∧ Lr0 → R1↾ ¬Re1 ∧ ¬Lr0 ∧ ¬v0 → R1⇂ v0↾ Re0 ∧ Lr1 → R2↾ ¬Re0 ∧ ¬Lr1 ∧ ¬v1 → R2⇂ ¬v0∨¬Le1∧(¬_R1∨¬_R3) → Re1 ∧ Lr1 → R3↾ ¬Re1 ∧ ¬Lr1 ∧ ¬v0 → R3⇂ v1↾ R0 ∨ R1 → Rr0↾ ¬R0 ∧ ¬R1 → Rr0⇂ v1∧(Le0∨_R0∧_R2) → v0⇂ R2 ∨ R3 → Rr1↾ ¬R2 ∧ ¬R3 → Rr1⇂ v0∧(Le1∨_R1∧_R3) → v1⇂ v1∨Rr0∨Rr1 → Le0⇂ ¬v1 ∧ ¬Rr0 ∧ ¬Rr1 → Le0↾ v0∨Rr0∨Rr1 → Le1⇂ ¬v0 ∧ ¬Rr0 ∧ ¬Rr1 → Le1↾ The forward drivers lower both of the input enables, setting the input enable to a neutral state. Because the internal memory is used by the input enable, it must wait until the input enable is in a neutral state before transitioning. If the internal memory transitions before this happens, then the value held on the input enable will start to switch. When the input enable is finally driven neutral by the forward drivers, it would cause a glitch. Once the input enable is lowered, the internal memory is set according to the value stored in the forward drivers. Each forward driver then waits for its respective reset conditions and 65 resets. When the input requests are enabled, the internal memory is used to gate the input enable and ensure that the correct value is returned. Le0 is blocked if the internal memory is set to 1 and Le1 is blocked if the internal memory is set to 0 . Then the gates driving Le0 and Le1 are made combinational using v0 and v1 . Exchange Channels (Negative) In the second approach, the receiving process communicates its data in the set phase as part of the acknowledge of the input requests. The input requests are enabled when the input enables are in a neutral state and acknowledged when they hold a valid value. This means that the value being returned by the receiving process is only valid while the token is being sent, and it is up to the sending process to record that value until the next cycle. Going back to the example of the exchange channel buffer, the forward drivers record the four cases derived from the combination of the internal memory and the input request. Both output enables will be high, but the internal memory records the last enable to go low. Therefore, the forward drivers only need to acknowledge the up-going transition of the output enable that last went low. Now, R0 and R1 set the output request Rr0 and acknowledge with Le0 and Le1 respectively. R2 and R3 set the output request Rr1 and acknowledge with Le0 and Le1 respectively. Re0 ∧ v0 ∧ Lr0 → R0↾ ¬v1 ∨ ¬Re0 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R0⇂ Re1 ∧ v1 ∧ Lr0 → R1↾ v0↾ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R1⇂ Re0 ∧ v0 ∧ Lr1 → R2↾ ¬v0 ∨ ¬Re1 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R2⇂ Re1 ∧ v1 ∧ Lr1 → R3↾ v1↾ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R3⇂ R0 ∨ R1 → Rr0↾ v1 ∧ Re0 → v0⇂ ¬R0 ∧ ¬R1 → Rr0⇂ R2 ∨ R3 → Rr1↾ v0 ∧ Re1 → v1⇂ ¬R2 ∧ ¬R3 → Rr1⇂ R0 ∨ R2 → Le0⇂ ¬R0 ∧ ¬R2 → Le0↾ R1 ∨ R3 → Le1⇂ ¬R2 ∧ ¬R3 → Le1↾ The internal memory is set directly from the output enable, and the reset of the forward drivers no longer maps to the value returned by the output enable or the resulting value of the internal memory. This means that an XOR gate is required to check the completion of the transition on the internal memory. Following the reset of the forward drivers, the input is enabled with both Le0 and Le1 transitioning high to a neutral state. Storing a 1of2 Request (Inverted Singlerail Out) If an internal memory simply records the value of a delay insensitive encoding, then it is often possible to have that happen directly. In this example, a dualrail request r0, r1 is 66 stored by the p-latch v0, v1 . An XOR gate is then used to generate the completion signal. The downgoing transition of _o signals that the data on r0, r1 is valid and has been successfully stored in v0, v1 . The upgoing transition signals that r0, r1 has transitioned to a neutral state. If the input request changes the value of the internal memory, then there are three transitions from that input request to the down-going transition on _o . For example r0↾ causes v1-; v0+; _o⇂ . If the input request is the same value as the internal memory, then there is only one transition _o⇂ . v1 ∨ r1 → v0⇂ v0 ∨ r0 → v1⇂ ¬v1 ∧ ¬r1 → v0↾ ¬v0 ∧ ¬r0 → v1↾ v0 ∧ r0 ∨ v1 ∧ r1 → _o⇂ (¬v0 ∨ ¬r0) ∧ (¬r1 ∨ ¬v1) → _o↾ Memory Gated Forward Drivers This optimization is a combination of the intermediate forward drivers and the internal memory unit. Suppose multiple forward drivers share the same behaviors regarding the input enable and the internal memory, but drive different output requests. Further suppose that the internal memory is able to differentiate between the two output request cases in a stable way throughout the handshake. Then, the forward drivers for those output requests may be combined into one, and the intermediate forward drivers method may be used with the internal memory to generate the separated output requests. This may also combine forward drivers that have output requests with forward drivers that do not. In this example, there are three input requests. Lw0 sets the internal memory to 0 , Lw1 sets the internal memory to 1 , and Lr is conditionally forwarded on R depending upon the value of v . If v is 0 , then Lr is simply acknowledged with no further action. If v is 1 , then the request is forwarded across R . *[L?l; [lw → v:=lw; [] lr∧v=0 → skip [] lr∧v=1 → R!]] The write commands Lw0 and Lw1 are implemented following the internal memory strategies discussed previously. However, the read command Lr only requires a single C-element driving R2 . Because the internal memory remains stable during the handshake from Lr , it can be used to gate the intermediate forward driver R2 to conditionally raise the request on Rr . The reset of R2 must wait for the output request to be acknowledged if an 67 output request was forwarded. Once again, the internal memory can be used to differentiate the two cases. Re ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬Lr0 ∧ ¬v1 → R0⇂ Re ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬Lr1 ∧ ¬v0 → R1⇂ Re ∧ Lr → R2↾ v1 ∧ _R0 → v0⇂ (¬Re ∨ ¬v1) ∧ ¬Lr → R2⇂ R2 ∧ v1 → Rr↾ v0 ∧ _R1 → v1⇂ ¬R2 ∨ ¬v1 → Rr2⇂ R0 ∨ R1 ∨ R2 → Le⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Le↾ 3.6 Half-Cycle Timing Assumption A very constrained timing assumption beyond the Isochronic Fork may be used to remove extra transitions introduced by some of these approaches. Specifically, the Half-Cycle Timing Assumption (HCTA) introduced in [72] and [73] helps to keep the cycle time within 10 transitions. Specifically, a C-element has an internal node that is staticized using the output. The HCTA assumes that the internal node will be successfully staticized before the inputs to the C-element cut off the main driver. This allows for the internal node to be used separate from the output node. There are three primary cases where this can help. Validity Trees Instead of building the validity tree from the output nodes Rr0 and Rr1 and adding two transitions to the handshake, the HCTA would use the internal nodes of the forward drivers _Rr0 and _Rr1 . This keeps the critical cycle time unchanged, and while this is a timing assumption, it is not altogether difficult to guarantee in layout. Specifically, this assumes that Rr0 and Rr1 resolve, staticizing _Rr0 and _Rr1 before Lr0 or Lr1 are lowered, cutting off the main driver. Ultimately, this means that one transition internal to a process must complete before four across an external channel. If the input requests are lowered before the outputs of the forward drivers resolve, then the staticizers of the forward drivers may drive the internal node back up causing a glitch and likely deadlock. Re ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Lr0 → Rr0⇂ Re ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Lr1 → Rr1⇂ ¬_Rr0 ∨ ¬_Rr1 → Rv↾ _Rr0 ∧ _Rr1 → Rv⇂ Rv → Le⇂ ¬Rv → Le↾ 68 Intermediate Forward Drivers Similarly, the Half-Cycle Timing Assumption can be used to generate a combined output from the forward drivers without introducing any extra transitions. The tree is started from the internal nodes of the forward drivers _R0 and _R1 . This keeps the critical cycle unchanged, assuming that R0↾ or R1↾ resolves before Re is lowered. Once again, this means that one transition internal to a process must complete before four across a channel. If the output enable is lowered before the forward drivers resolve, the staticizers can once again cause a glitch or deadlock. Re ∧ Lr0 → R0↾ ¬Re ∧ ¬Lr0 → R0⇂ Re ∧ Lr1 → R1↾ ¬Re ∧ ¬Lr1 → R1⇂ ¬_R0 ∨ ¬_R1 → Rr↾ _R0 ∧ _R1 → Rr⇂ R0 ∨ R1 → Le⇂ ¬R0 ∧ ¬R1 → Le↾ Memory Gated Forward Drivers For the memory gated forward drivers, the latch implementing the memory guarantees the existence of the negated signal as well. This means that the internal nodes of the forward drivers and the negated signal of the internal memory can be used to generate the output request. Re ∧ Lw0 → R0↾ ¬Lr0 ∧ ¬v1 → R0⇂ Re ∧ Lw1 → R1↾ ¬Lr1 ∧ ¬v0 → R1⇂ Re ∧ Lr → R2↾ (¬Re ∨ ¬v1) ∧ ¬Lr → R2⇂ ¬_R2 ∧ ¬v0 → Rr↾ _R2 ∨ v0 → Rr2⇂ R0 ∨ R1 ∨ R2 → Le⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Le↾ ¬v1 ∨ ¬_R0 → v0↾ ¬v0 ∨ ¬_R1 → v1↾ v1 ∧ _R0 → v0⇂ v0 ∧ _R1 → v1⇂ 3.7 QDI Treatment for Pass Transistor Logic Pass transistor logic represents an interesting opportunity to dramatically expand the expressibility of a single QDI pipeline stage by reducing circuit size and complexity. There are two ways that pass transistor logic can help. First, the WCHB reshuffling often results in duplicated logic in each of the forward drivers in either the set phase, the reset phase, or both. In 69 such cases, it is often desirable to factor that shared logic into its own gate. However with CMOS logic, this adds extra transitions to the handshake which can slow the circuit significantly, forcing a choice between large load capacitances or long cycle times. Implementing these gates with pass transistor logic can reduce the load capacitances by factoring out the logic without adding any extra transitions to the handshake. Second, many of the optimizations that depend upon the Half-Cycle Timing Assumption can be replaced with pass transistor logic that makes no such assumption. This makes the circuit more robust without any sacrifice to performance. Furthermore, pass transistor logic often requires differential signals for correct operation. This means that both the signal and its inverse are available. In a WCHB reshuffling, most signals either come from a C-element or a latch which both produce differential signals. The only signals that do not come from one of these two gates are the input enables. This means that pass transistor logic often fits easily into the handshake without much extra effort. Finally, pass transistor logic can get special treatment regarding the QDI delay model. As discussed in the previous section, the QDI delay model has an acknowledgement requirement. Because pass transistor logic does not introduce a gate delay, it can ultimately be viewed as a low quality wire. This means that acknowledging the passed signal can also acknowledge the output of the pass transistor gate. Any approach that does this should not be taken lightly however. The output load on the pass transistor gate and the sizing ultimately determines the delay of the gate. If the pass gate is sized too small relative to the load, then this feature is no longer safe. Furthermore, the pass transistor gate should be placed in layout in the same cell as the signal it passes. This reduces the delay associated with the logic. There are a few basic pass transistor gates that can be easily applied to QDI circuits: the XOR gate, the AND gate, and the OR gate. Each of these gates introduce only one transistor to any transistor stack. This keeps the delay introduced by these gates to a minimum. Pass Transistor XOR There are ultimately hundreds of possible XOR gate constructions. Each one has a collection of undesirable states that result in either interference or a weakly driven output. The interfering states can be avoided depending upon the neutral states of the inputs or mitigated depending upon the length of time spent in the neutral states and the time in which the result is needed. Meanwhile, the weak driving states can be fixed with staticizing transistors using other signals in the handshake. Only two pass-transistor XOR gates are generally found in the literature. Of the two, only one is useful for QDI circuits. This approach represents a relatively safe well-rounded XOR gate which is reasonably applicable to most scenarios. For this approach, both inputs must be 70 differential. One of the differential inputs, b and _b , are passed through the XOR gate while the other, a and _a , drive the gates of the transistors. This means that the up-going transition on c acknowledges the up-going transitions on b , _b and the down-going transitions on a , _a . @b ∧ ¬a ∨ @_b ∧ ¬_a → c↾ ¬@b ∧ _a ∨ ¬@_b ∧ a → c⇂ Unfortunately, there are six undesirable states in this circuit. Four of these happen when a , _a are switching. When the C-element or latch driving a , _a switches, there is a transient state in which a and _a are the same value. When this happens, b and _b are connected to each other. If b and _b are in a differential state, then it will cause interference. Specifically, if both a and _a are 0, then the pull-up network of either b or _b will drive the other up through this pass gate, fighting the pull-down network of the other. If both a and _a are 1, then the pull-down network of either b or _b will drive the other down through this pass gate, fighting the pull-up network of the other. Table 1 elaborates the outcomes of each possible state for this XOR. This creates a few constraints. First, a and _a cannot be driven by a delay insensitive encoding, because the neutral state of the delay insensitive encoding is not transient and will burn power through this short. This means that a and _a should either be driven by a latch or a C-element. Second, b and _b are ideally driven by a delay insensitive encoding that is in the neutral state when a and _a switch. This allows b and _b to be briefly connected without causing any interference. If both signals must be driven by latches, then the offending transistors of the transient state causing interference should be sized similar to that of a weak keeper for a C-element, meaning that part of the gate is no longer allowed special treatment in the QDI delay model since the resulting delay will no longer be negligible. The other two undesirable states only happen when both a , _a and b , _b are driven by latches and both latches are switching. During these transient states, all of the input signals to this gate are the same value, causing the output node c to be dynamic. These transient states may be covered by staticizing c using other signals from the handshake. However, this should be done only if the dynamic transient state is reachable in the handshake, the completion of the transition to the static value of c after the transient state is not guaranteed by the QDI delay model by the time it is used in the handshake, and staticizing c would not cause weak interference with another staticizer on b or _b . 71 _a a _b b c Drivers 1 0 1 0 0 b⇂ 0 1 1 0 1 _b↾ 1 0 0 1 1 b↾ 0 1 0 1 0 _b⇂ 1 1 1 1 weak 1 weak b↾ ,weak _b↾ 1 1 1 0 X- b⇂ ,weak _b↾ 1 1 0 1 X- weak b↾ , _b⇂ 1 0 1 1 1 b↾ 0 1 1 1 1 _b↾ 0 0 1 1 1 b↾ , _b↾ 1 1 0 0 0 b⇂ , _b⇂ 0 0 1 0 X+ weak b⇂ , _b↾ 0 0 0 1 X+ b↾ ,weak _b⇂ 1 0 0 0 0 b⇂ 0 1 0 0 0 _b⇂ 0 0 0 0 weak 0 weak b⇂ ,weak _b⇂ Table 1. The state space of the dual differential pass transistor XOR. Rows are highlighted when a and _a are the same value. Pass Transistor AND @a ∧ ¬_b → c↾ ¬@a ∧ b → c⇂ _b → c⇂ The pass transistor AND gate presents an easy solution for gating signals in a handshake. This can be used to reduce transistor stack length in some cases and remove Half-Cycle Timing Assumptions in others. There are ultimately two undesirable states. First, when all of the input signals are 0, c is no longer driven. In a transient state, this is not much of a problem. However, if noise becomes an issue, then another signal from the handshake may be used to staticize c . Second, when all of the input signals are 1, there is weak interference as the CMOS pull-up network on c fights the pull-down network on a through the NMOS pass transistor. This state should ultimately be avoided. If that is not possible, then the NMOS 72 a _b b c Drivers 0 1 0 0 GND⇂ 1 1 0 0 GND⇂ 0 0 1 0 a⇂ 1 0 1 1 a↾ 0 0 0 weak 0 weak a⇂ 1 0 0 1 a↾ 0 1 1 0 GND⇂ , a⇂ 1 1 1 X- GND⇂ ,weak a↾ Table 2. The state space of the pass transistor AND. pass transistor should be sized like a weak staticizer and the offending state must be transient. Pass Transistor OR @a ∧ ¬b → c↾ ¬@a ∧ _b → c⇂ ¬_b → c↾ The pass transistor OR gate is similar to the AND gate, flipping the undesirable states. These pass transistor gates can have a significant effect on performance, and can be applied to many of the previously listed micro-architectural optimizations. This work provides three examples. Storing a 1of2 Request (Non-inverted Singlerail Out) First, if values on a stored 1of2 request are repeated often, then the latch will switch very rarely. In these cases, it would be desirable to cut the transitions out of the handshake. This can be done with a pass transistor XOR gate. This uses the same number of transistors to provide a non-inverted output compared to the inverted output in the CMOS approach. At the start of the handshake, r0 and r1 are in the neutral state with both signals low. As long as v0 and v1 are in a differential state, o will be driven low. If r0 transitions high, then v1 will transition low followed by v0 high. This means that _v1 will transition high and _v0 low in any order, but preferring _v1 first. This will briefly put o in an undesirable state when both _v0 and _v1 are high. Upon resolution, o will transition high signalling the completion of the store. Then, the 1of2 input r0 and r1 may transition back 73 a _b b c Drivers 0 1 0 0 a⇂ 1 1 0 1 a↾ 0 0 1 1 Vdd↾ 1 0 1 1 Vdd↾ 0 0 0 X+ Vdd↾ ,weak a⇂ 1 0 0 1 Vdd↾ , a↾ 0 1 1 0 a⇂ 1 1 1 weak 1 weak a↾ Table 3. The state space of the pass transistor OR. to a neutral state. This gives o an entire cycle to transition low again. That means that the NMOS transistors in the pass transistor XOR may be sized like a weak staticizer without affecting the performance of the handshake. Furthermore, Both the inverted variables _v0 and _v1 are fully acknowledged in all cases just by using o . Therefore, this approach does not rely on any special treatment in the QDI delay model. v1 ∨ r1 → v0⇂ v0 ∨ r0 → v1⇂ ¬v1 ∧ ¬r1 → v0↾ ¬v0 ∧ ¬r0 → v1↾ v0 → _v0⇂ ¬v0 → _v0↾ v1 → _v1⇂ ¬v1 → _v1↾ @r1 ∧ ¬_v1 ∨ @r0 ∧ ¬_v0 → o↾ ¬@r0 ∧ _v1 ∨ ¬@r1 ∧ _v0 → o⇂ Memory Gated Forward Drivers When the forward drivers do not map well to the output requests, the general strategy is to use the memory to gate the forward drivers to generate the correct output requests. However, the CMOS approach to this problem either introduces transitions to the handshake or relies upon the Half-Cycle Timing Assumption. With a pass transistor AND gate, the gating rule on the internal node of the forward driver can be replaced with a pass transistor gate on the external node. This removes the extra transitions. Furthermore, because the passed signal is the output of a C-element, it is driven by an inverter. This is the optimal driving gate for 74 the inputs to pass transistor logic because it allows the driving gate to be sized up quite significantly. Re ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬Lr0 ∧ ¬v1 → R0⇂ Re ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬Lr1 ∧ ¬v0 → R1⇂ Re ∧ Lr → R2↾ v1 ∧ _R0 → v0⇂ (¬Re ∨ ¬v1) ∧ ¬Lr → R2⇂ @R2 ∧ ¬v0 → Rr↾ v0 ∧ _R1 → v1⇂ ¬@R2 ∧ v1 → Rr⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Le↾ v0 → Rr⇂ R0 ∨ R1 ∨ R2 → Le⇂ Intermediate Forward Drivers The same can be done for the intermediate forward drivers. Because the dualrail encoding ensures mutual exclusivity between the two requests, an XOR gate can be used in place of an OR gate. When both drivers R0 and R1 are low, both internal nodes are high. This means that both NMOS transistors are on, passing both drivers. When one of the forward drivers R0 is active, its internal node _R0 transitions low. This disconnects the NMOS transistor to R1 and connects the PMOS transistor to R0 . Then, the transition on R0 is passed directly through the gate. In the reset phase, _R0 transitions high. There is a transient state in which R0 and R1 are connected to each other through the two NMOS transistors of the pass transistor XOR. This will start to weakly pull R1 high. However, because _R0 transitioned high, the inverter driving R0 will switch at the same time. Now, the inverters for both R0 and R1 will actively pull R0 low. Practically, this does not cause a glitch on R1 unless the load capacitance on Rr is extremely high and the drivers on R0 and R1 are extremely weak. Re ∧ Lr0 → R0↾ ¬Re ∧ ¬Lr0 → R0⇂ Re ∧ Lr1 → R1↾ ¬Re ∧ ¬Lr1 → R1⇂ @R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Rr↾ ¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Rr⇂ ¬R0 ∧ ¬R1 → Le↾ R0 ∨ R1 → Le⇂ These optimizations will be used throughout the circuits proposed in this work. 3.8 Example: Single Bit Register As an example of some of these optimizations, this section walks through a step-by-step 75 optimization of the single bit register provided in [68], which is the current standard for circuit design of such a unit. Ce ∧ Cr ∧ Re ∧ v0 → Rr0↾ Ce ∧ Cw ∧ (v0 ∧ Lr0 ∨ v1 ∧ Lr1) → Le⇂ Ce ∧ Cr ∧ Re ∧ v1 → Rr1↾ ¬Le ∨ ¬Rn → Ce⇂ Rr0 ∨ Rr1 → Rn⇂ ¬Ce ∧ ¬Re → Rr0⇂ ¬Ce ∧ ¬Re → Rr1⇂ Ce ∧ Cw ∧ Lr0 → v1⇂ Ce ∧ Cw ∧ Lr1 → v0⇂ ¬Rr0 ∧ ¬Rr1 → Rn↾ ¬Lr0 ∧ ¬v0 → v1↾ ¬Lr1 ∧ ¬v1 → v0↾ ¬Ce ∧ ¬Lr0 ∧ ¬Lr1 → Le↾ Le ∧ Rn → Ce↾ 1. WCHB Single Bit Register: First, the circuit is converted to a standard WCHB reshuffling. This simplifies the handshake, making it easier to think about. Cw ∧ Ld0 → W0↾ ¬Cw ∧ ¬Ld0 ∧ ¬v1 → W0⇂ Cw ∧ Ld1 → W1↾ ¬Cw ∧ ¬Ld1 ∧ ¬v0 → W1⇂ Re ∧ Cr ∧ v0 → Rr0↾ ¬Re ∧ ¬Cr → Rr0⇂ Re ∧ Cr ∧ v1 → Rr1↾ ¬Re ∧ ¬Cr → Rr1⇂ W0 ∨ W1 ∨ Rr0 ∨ Rr1 → Ce⇂ ¬W0 ∧ ¬W1 ∧ ¬Rr0 ∧ ¬Rr1 → Ce↾ W0 ∨ W1 → Le⇂ ¬W0 ∧ ¬W1 → Le↾ ¬v1 ∨ ¬_W0 → v0↾ ¬v0 ∨ ¬_W1 → v1↾ v1 ∧ _W0 → v0⇂ v0 ∧ _W1 → v1⇂ 2. Gated Forward Drivers: Noticing that the output requests from the read behave similarly, and that there is a stable way of differentiating them through v0 and v1 , pass transistor memory gated forward drivers is applied to remove one of the read C-elements. 76 @Rr ∧ ¬v0 → Rr1↾ ¬v1 ∨ ¬_W0 → v0↾ ¬@Rr ∧ v1 ∨ v0 → Rr1⇂ ¬v0 ∨ ¬_W1 → v1↾ v1 ∧ _W0 → v0⇂ @Rr ∧ ¬v1 → Rr0↾ v0 ∧ _W1 → v1⇂ ¬@Rr ∧ v0 ∨ v1 → Rr0⇂ ¬Cw ∧ ¬Ld0 ∧ ¬v1 → W0⇂ Cw ∧ Ld0 → W0↾ ¬Cw ∧ ¬Ld1 ∧ ¬v0 → W1⇂ Cw ∧ Ld1 → W1↾ ¬Re ∧ ¬Cr → Rr⇂ Re ∧ Cr → Rr↾ ¬W0 ∧ ¬W1 ∧ ¬Rr → Ce↾ W0 ∨ W1 ∨ Rr → Ce⇂ ¬W0 ∧ ¬W1 → Le↾ W0 ∨ W1 → Le⇂ 3. Stored Dualrail Request: Then, the write command is simplified using the stored dualrail request method. This removes another C-element. However, this optimization relies upon the special treatment for pass transistor logic in the QDI model for the pass transistor rules driving Rr0 and Rr1 low. @Rr ∧ ¬v0 → Rr1↾ Cw ∧ (Lw0 ∧ v0 ∨ Lw1 ∧ v1) → Wr↾ ¬@Rr ∧ v1 ∨ v0 → Rr1⇂ Re ∧ Cr → Rr↾ @Rr ∧ ¬v1 → Rr0↾ Wr → Le⇂ ¬@Rr ∧ v0 ∨ v1 → Rr0⇂ Wr ∨ Rr → Ce⇂ v1 ∨ Cw ∧ Lw1 → v0⇂ ¬Cw ∧ ¬Lw0 ∧ ¬Lw1 → Wr⇂ v0 ∨ Cw ∧ Lw0 → v1⇂ ¬Re ∧ ¬Cr → Rr⇂ ¬v1 ∧ (¬Cw ∨ ¬Lw1) → v0↾ ¬v0 ∧ (¬Cw ∨ ¬Lw0) → v1↾ ¬Wr → Le↾ ¬Wr ∧ ¬Rr → Ce↾ 4. Pass Transistor AND: The read command C-element can be completely removed, removing the pipeline stage as well. This makes the read pass through to the read channel. 77 @Cr ∧ ¬v0 → Rr1↾ v1 ∨ Cw ∧ Lw1 → v0⇂ ¬@Cr ∧ v1 ∨ v0 → Rr1⇂ v0 ∨ Cw ∧ Lw0 → v1⇂ ¬v1 ∧ (¬Cw ∨ ¬Lw1) → v0↾ @Cr ∧ ¬v1 → Rr0↾ ¬v0 ∧ (¬Cw ∨ ¬Lw0) → v1↾ ¬@Cr ∧ v0 ∨ v1 → Rr0⇂ Cw ∧ (Lw0 ∧ v0 ∨ Lw1 ∧ v1) → Wr↾ @Re ∧ ¬Wr → Ce↾ ¬@Re ∧ _Wr ∨ Wr → Ce⇂ Wr → Le⇂ ¬Cw ∧ ¬Lw0 ∧ ¬Lw1 → Wr⇂ ¬Wr → Le↾ 5. Combined L, C: From here, further simplification requires modification of the behavioral specification. If it is possible to guarantee mutual exclusion of the read and write in the environment, then the C and L channels can be combined into one, making L have a 1of3 request. This entirely removes the need for a write C-element. Again, this relies upon the special treatment for the pass transistors driving Rr0 and Rr1 low. @Rr ∧ ¬v0 → Rr1↾ Re ∧ Lr → Rr↾ ¬@Rr ∧ v1 ∨ v0 → Rr1⇂ v0 ∧ Lw0 ∨ v1 ∧ Lw1 ∨ Rr → Le⇂ @Rr ∧ ¬v1 → Rr0↾ ¬@Rr ∧ v0 ∨ v1 → Rr0⇂ ¬Re ∧ ¬Lr → Rr⇂ v1 ∨ Lw1 → v0⇂ (¬v0 ∨ ¬Lw0) ∧ (¬v1 ∨ ¬Lw1) ∧ ¬Rr v0 ∨ Lw0 → v1⇂ → Le↾ ¬v1 ∧ ¬Lw1 → v0↾ ¬v0 ∧ ¬Lw0 → v1↾ 6. Combined L, C, R Counterflow (Positive): If the environment can further handle the acknowledgement of the read, then it is possible to remove the read C-element by implementing a counterflow handshake. Now the input enable keeps the value of the internal memory when the enable is high. This allows its value to be sampled whenever it is needed without any request, as long as that is still mutually exclusive from the write commands. v1 ∨ Lw1 → v0⇂ Lw0 ∨ v1 → Lr0⇂ v0 ∨ Lw0 → v1⇂ Lw1 ∨ v0 → Lr1⇂ ¬v1 ∧ ¬Lw1 → v0↾ ¬v0 ∧ ¬Lw0 → v1↾ ¬Lw0 ∧ ¬v1 → Lr0↾ ¬Lw1 ∧ ¬v0 → Lr1↾ 78 3.9 Integrated QDI/BD Circuits Fig. 38 shows the typical diagram used to describe a 4-phase bundled data pipeline [63]. For each stage, there is a QDI control block and latched datapath logic. The enable signal from the QDI control is amplified and used to clock the datapath. Meanwhile, the input request is delayed to prevent the latches from closing before the input data resolves. The pipeline protocol as executed by the process in Fig. 38 is demonstrated in Fig. 39, assuming a WCHB reshuffling for the control. Signals driven by the environment on L are colored red while signals driven by the environment on R are colored blue. The input enable for all channels are generally initialized high signifying that the process is ready to receive data on that channel. This initialization, having been applied to Channel L , opens the p-latches. Therefore, the bundled data is the first thing to arrive on L , followed by an upgoing transition on the request wire shortly thereafter. This request must be delayed such that the bundled data arriving on L has passed through the p-latch by the time the enable is lowered. This ensures that the incoming data is correctly latched so that datapath logic may resolve and forward the result through R . Once the enable of L is lowered, the input data is allowed to change. This means that the delay assumption overlaps with half of the handshake protocol, allowing some of the protocol to be counted toward the delay assumption, reducing the length of the delay line. Ultimately, this might be improved by overlapping the delay assumption with the whole handshake protocol using full-buffering. However, that would require a PCFB reshuffling in the QDI control and flip-flops instead of latches in the datapath. In the end, both the total throughput and the device count will double, leaving the throughput efficiency the same. The energy required per token will increase, so overall that approach would be less performant. One should also note that the delay elements only need to delay the upgoing transition of the request wires. The downgoing transition only serves to reset the channel protocol and open the latches for the next computation. Therefore, one should make extensive use of asymmetric delay lines as seen in Fig. 40 to increase the overall throughput of the process. Ultimately, while the pipeline demonstrated in Fig. 39 uses a dataless WCHB reshuffling, it seems like a reasonable jump that the QDI control could be any half-buffered process with or without data. In particular, there are a whole host of templates from [68] that can be put in that box. If the input request communicates data, then all of the request wires must have a delay element. If the request wires trigger cycles with different cycle times in the QDI control, the delay lines should be tuned to their associated cycle times. Alternatively, it is possible to clock the request lines as in [64]. However, there are a few important facets of this approach that require careful consideration. 79 Fig. 38: A basic template for QDI control with bundled data. Fig. 39: The channel protocol for the input and output channels of a pipeline stage evaluated over two packets of data. Fig. 40: Circuit diagram for an asymmetric delay line. The upgoing transition is delayed by six inverters while the downgoing by only two. This includes communicating data between the control and datapath, dealing with conditional acknowledgement and conditional output requests, clocking internal memories in the datapath, and handling exchange channels. QDI Input Requests to Datapath Communicating input request signals from the QDI control to the bundled datapath is 80 simple enough. Unfortunately, they transition to neutral before the output enable is lowered, potentially allowing an incorrect result to propagate out through the datapath before the latches in the next stage are closed. So, a separate signal must be generated that is held stable at least until the output enable is lowered. This can be easily achieved with an SR latch as shown in Fig. 41. The delay insensitive data is pulled from the request wires before the delay elements and fed through an SR latch. The result is guaranteed to be stable by the time the request is received by the QDI control due to the delay lines, and remains stable until the next request. Therefore it can then be used in both the datapath logic and the QDI control throughout the whole cycle. Unfortunately, anything beyond a two way latch becomes prohibitively expensive. For example, a 3-way latch requires 18 transistors compared to the 8 required for a 2-way latch. So, to get the best performance from this strategy, any delay insensitive encoding should be converted to a collection of 1of2 codes before feeding it into the SR latches. QDI Internal Memory to Datapath When communicating the value of an internal memory of the QDI control process to the datapath, one must ensure that any transition executed on that memory cell waits for the output enable to lower. This prevents the transition from incorrectly propagating through the datapath into the next process, potentially causing a glitch. Luckily, this is often the default implementation of an internal memory unit [68]. However, if this is not possible, then the signal sent to the datapath must be protected with a latch that remains closed while the output enable is high as in Fig. 42. Datapath to QDI Forward Drivers In order to use a datapath signal D in the pull-down networks of the forward drivers, the timing assumption must be tightened. Previously the output data was assumed to be stable by the time the output enable was lowered. However, that only happens after the output requests have been sent. Any signal used at this stage of the handshake must be stable by the time the request passes through the delay lines. Ultimately, this just means that the delay lines should be lengthened as they no-longer overlap the handshake protocol. Re ∧ Lr ∧ D0 → Rr0↾ ¬Re ∧ ¬Lr → Rr0⇂ Re ∧ Lr ∧ D1 → Rr1↾ ¬Re ∧ ¬Lr → Rr1⇂ Rr0 ∨ Rr1 → Le⇂ ¬Rr0 ∧ ¬Rr1 → Le↾ 81 Fig. 41: Communicating a QDI request signal to the bundled datapath using an SR latch. Fig. 42: Communicating a QDI internal memory to the bundled datapath using an n-latch. Datapath to QDI Internal Memory Using a datapath value D to set the value of an internal memory unit depends upon the implementation of the internal memory. If transitions on the internal memory are gated by the downgoing transition of the output enable Re , then D may be used directly. Otherwise, as below, the delay assumption must be adjusted such that the datapath has stabilized before the internal memory is allowed to transition. Re ∧ Lr ∧ v0 → R0↾ ¬v1 ∨ ¬_R1 ∧ ¬D1 → v0↾ ¬Re ∧ ¬Lr ∧ (¬D1 ∧ Re ∧ Lr ∧ v1 → R1↾ ¬v0 ∨ ¬_R0 ∧ ¬D0 → v1↾ ¬v0) → R0⇂ v1 ∧ (_R1 ∨ D1) → v0⇂ ¬Re ∧ ¬Lr ∧ (¬D0 ∧ R0 ∨ R1 → Le⇂ v0 ∧ (_R0 ∨ D0) → v1⇂ ¬v1) → R1⇂ ¬R0 ∧ ¬R1 → Le↾ 82 QDI/Datapath Cycle If the process communicates data both from the QDI control to the datapath and from the datapath to the QDI control, it is possible to introduce a cycle. Specifically, if a QDI logic block transitions and this transition is passed into the datapath, then this can cause a transition or even a glitch to propagate back out of the datapath and into the QDI control. There are two ways to mitigate this problem. First, one could redesign the datapath to break the cycle. If that cannot be done, then an extra latch can be introduced to effectively break the cycle as in Fig. 43. If the data sent to the datapath comes from the internal memory and the data received from the datapath is used in the forward drivers, then the p-latch is unnecessary since the n-latch guarding the datapath ensures the output enable is low before causing new transitions on D . This disables the forward drivers and protects them from any glitches in the datapath this cycle might cause. Datapath Memory An internal memory in the datapath can start to make things a bit more complex, particularly because they cannot always be clocked in lock-step with another channel. The basic template requires three layers of latches as in Fig. 44 (left). Assume for a moment the internal memory were instead implemented using a p-flop, removing the first layer of latches. If the memory were clocked some delay after the input latches due to logic, then there would be a short time during the reset phase of the handshake in which the input latch and the n-latch of the internal memory are both open. This would allow the new value from the input to race ahead and erase the value that should have been stored from the last operation. Of course, the latches implementing the internal memory can be pushed around the datapath so long as doing so would not affect the output value on Rd . For example, the first p-latch layer could pushed back to just after the input latches and just after the second p-latch layer of the internal memory as in Fig. 44 (right). Because the first and second p-latch layer of the internal memory share the same clock, they can be merged together. If there is no clocking logic meaning the delay between the input latch clock and the memory clock is zero, then the input latch and the relocated first p-latch layer of the memory can also be merged. This strategy of moving the latches around must include all signals that lead into the datapath internal memory, including signals from the QDI logic. Then, the clocking logic can also be pushed around. Suppose the clocking logic comes 83 Fig. 43: Breaking a communication cycle with a p-latch on the output request. Fig. 44: Clocking a memory internal to the datapath. from two channels that are conditionally acknowledged as in Fig. 45 (left). The AND gate can be pushed into the latches, then the optimizations above can be performed to remove the unnecessary latching layer as in Fig. 45 (right). This requires special multi-clock latches shown in Fig. 46. Alternatively, the AND gate can be pushed into the QDI control using C-elements to ensure it occurs before the input enables. However, this has fairly undesirable performance. QDI and Datapath Memories There are two ways the internal memory of the QDI control can be used to set the datapath 84 Fig. 45: Clocking a memory internal to the datapath. Fig. 46: The p-and-p latch (left) passes the value when both clocks are high, the n-or-n latch (right) passes the value when either clock is low. memory. In the standard approach, the new value of the QDI internal memory will set the value of the datapath memory at the end of the next cycle as seen in Fig. 47 (left). This matches the behavior of the QDI internal memory on Rd . Notice that the n-latch with Le can be merged into the QDI internal memory and that the first layer of p-latches implementing the datapath memory can be pushed around so that it takes the spot of the n-latch. With these optimizations, the cost of this approach is fairly low. In the alternative approach, the new value of the QDI internal memory will set the value of the datapath memory at the end of this cycle as in Fig. 47 (right). In this case, the first layer of p-latches implementing the datapath memory have been optimized out for clarity. The n-latch guarding the QDI internal memory has been removed as well. Removing these two latches 85 shifts the set time by a full cycle. Exchange Channels Exchange channels pose both unique challenges and opportunities. For exchange channels, all of the enable signals also encode data. This means there is no longer a convenient clocking signal for the datapath. Furthermore, the latching circuitry on both Lid and Rid must be closed before either requests Lor or Ror are sent out. This necessitates the creation of a separate set of clocking signals as seen in Fig. 48. Re-examining the negative exchange channel implementation, the signals driven by the forward drivers R0 , R1 , R2 , and R3 are the only ones along the critical path of both output requests. Unfortunately, there are four of them. So this will require special multi-clock latches shown in Fig. 49. The positive exchange channel implementation is similar. Re0 ∧ v0 ∧ Lr0 → ¬v1 ∨ ¬Re0 → v0↾ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R0⇂ R0↾ // amplify ¬v0 ∨ ¬Re1 → v1↾ // amplify Re1 ∧ v1 ∧ Lr0 → v1 ∧ Re0 → v0⇂ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R1⇂ R1↾ // amplify v0 ∧ Re1 → v1⇂ // amplify Re0 ∧ v0 ∧ Lr1 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R2⇂ R2↾ // amplify // amplify Re1 ∧ v1 ∧ Lr1 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R3⇂ R3↾ // amplify // amplify R0 ∨ R1 → Rr0↾ ¬R0 ∧ ¬R1 → Rr0⇂ R2 ∨ R3 → Rr1↾ ¬R2 ∧ ¬R3 → Rr1⇂ R0 ∨ R2 → Le0⇂ ¬R0 ∧ ¬R2 → Le0↾ R1 ∨ R3 → Le1⇂ ¬R2 ∧ ¬R3 → Le1↾ Exchange Channel with Internal Memory Clocking a memory with an exchange channel provides for a unique optimization. Much like the earlier datapath memory implementation, the latches can be pushed around the datapath. In this case, the latches can be merged entirely into the input latches with a little extra logic in Fig. 50. 3.10 Toolset and Circuit Evaluation All of the circuits in this work are developed and evaluated using a set of in-house tools found in [89]. The production rule specifications are verified with a switch-level simulation which identifies instability, interference, and deadlock. These specifications are then automatically translated into netlists and their analog properties are verified using Synopsys's combined 86 Fig. 47: QDI internal memory to datapath memory communication. Fig. 48: Clocking an exchange channel. Fig. 49: The p-or-p latch (left) passes the value when either clock is high, the n-and-n latch (right) passes the value when both clocks are low. simulator with VCS, a verilog simulator, to simulate the testbench and HSIM, a fast spice simulator, to report power and performance metrics. The CHP was simulated using C++ to 87 Fig. 50: The basic template for datapath memory with exchange channels (left) vs after all of the previously discussed memory optimizations (right). generate inject and expect values which were tied into both the switch level and analog simulations using Python. This facilitated verification of circuit and behavioral correctness by checking the behavioral, digital, and analog simulations against each other. A 1V 28nm process was simulated to evaluate frequency and energy per operation. Latency is measured from the 0.5 V level of the input to the 0.5 V level of the output. To get more accurate results, each of the digitally driven channels is protected with a FIFO of three WCHBs isolated to a different power source. All circuits are sized minimally with a pn-ratio of 2. The simulations do not include extracted parasitics, but a 1 fF capacitor is added to every gate output. All implementations are explicit about the use of the half-cycle timing assumption (HCTA) [72], and use weak feedback for C-elements. Circuitry necessary for reset will not be included in any of the descriptions. There is no particularly straightforward way to evaluate self-timed circuits. Each circuit has a set of conditions, executing one per cycle. Some conditions might not toggle the input or output channels. Some might simply act as a token source or sink with dramatically higher frequency. So to get a reasonable picture of a circuit's overall performance, one must determine the frequency and energy of each condition, determine how often each condition is likely to execute, and use those two measures to determine some average performance metrics per token. In the event that 88 there is not enough data to determine how often a condition might execute, individual numbers will be reported and a uniform random distribution will be used to determine relative overall performance. Evaluating adaptive digit-serial circuits add further difficulty. To be able to compare against their bit-parallel counterparts, it is necessary to determine the overall performance of the circuit per digit-stream by using the data from Fig. 32. Overall, four numbers will be reported: forward latency, operation frequency, energy per operation, and transistor count. The forward latency is informative of the execution speed of sequential operations. The operation frequency and transistor count are informative of the total throughput of parallel operations. And minimizing the energy per operation reduces the power-wall constraint seen by other processors. 89 CHAPTER 4 COUNTERS Counters are fundamental for tracking the state of a digit-serial operation. In particular, they will be absolutely necessary for the implementation of digit-serial shifting, rotation, and sign compression operations. In general, they are also applicable to a large array of other applications including the control logic for power gating, clock gating, and pipeline management [186][198][232]; timers, performance counters, and frequency dividers [87]; and iterative arithmetic circuits [88]. This versatility leads to a wide variety of functionality. Ultimately, there are five basic input commands: increment, decrement, clear, read, and write; and six output responses: zero, full, less-than, greater-than, equal-to, and no-event. Counters are named using the first letter of each command they support followed by the first letter of each response they support. So idzn would be an increment/decrement counter with zero/not-zero detection. While clocked counters have been thoroughly explored such as the increment/decrement counter in [92], the increment/write in [93], and the decrement/write in [94][95], the same cannot be said of QDI counters. Aside from my previous work in [75], a constant response time decrementing counter with zero detection was implemented in [96], an increment/decrement counter with zero/full detection in [97], and an increment/decrement counter with constant-time zero detection in [98]. 4.1 Behavioral Specification This work iterates on the designs found in [75], deriving significant improvements in area, energy, and throughput with five basic optimizations. First, the input enable previously communicated the counter status on the reset phase of the handshake using positive exchange channels. This is extremely helpful when using the counter, but it also encourages long transistor stacks in the reset rules of the input enable. This forced the use of a validity tree from the output request drivers to reduce the transistor stack length of the input enable rules, introducing two extra transitions in the handshake that reduced overall frequency. Furthermore, it required the slowest internal memory implementation, gating the internal memory against the reset of the input request wires, again reducing the operating frequency. Optimizing this requires that the counter status be communicated instead when the input enable is lowered using negative exchange channels. This makes it more difficult for a user to interface 90 with the counter because the interfacing processes can no longer directly condition their increment and decrement requests on the status of the counter. Instead, they must have their own internal memory to keep track of the counter status received between commands. However, it also reduces the long transistor stacks required by the previous approach. This removes the validity tree and allows for the fastest internal memory approach, dramatically increasing operating frequency. Second, a pass-transistor stored 1of2 request can combine the output acknowledge and the acknowledgement of transitions in the counter status memory unit into a single signal. This reduces the complexity of the output acknowledge and therefore the forward drivers. Third, the savings from the first and second optimizations make it possible combine 2-bits of the counter into a single process. The complexity per bit of the output request lines and internal memory stays the same. However, the complexity per bit of the counter's status circuitry is cut is half. Furthermore, command and status data are communicated half as often. Fourth, a Gray Code can be used to implement the 2-bit counter unit. This reduces the complexity of the internal memory, ensuring that an increment or decrement command switches only one of the two latches at a time. Because one of the latches remains stable, it can be used to control memory gated forward drivers, reducing the total number of forward drivers. Fifth, pass transistor XOR and XNOR gates can be used between the two internal memory units to reduce the load capacitance from the forward drivers. This extracts most of the complexity out of the underlying handshake and into the pass transistor logic. To keep the circuit as simple as possible, the specification assumes the counter will not under or overflow. It starts at zero, then for every iteration a command is received from Lc , then the value, vc , is either increased or decreased by one depending upon the command. Finally, the status of the counter is sent across Lz . Notice that this specification differs from the specification in [75] since data is not sent on the acknowledgement until after the command. vc:=0; ∗[Lc?lc; [ lc=inc → vc:=vc+1 ▯ lc=dec → vc:=vc-1 ]; Lz!(vc=0) ] Deriving a process for a 2-bit counter unit is done by separating the least significant digit, v0 , of the counter from the remaining digits, vc . This requires carry circuitry for the increment and decrement from the first digit to the remaining digits. If Lc is increment and v0 is 3 or Lc is decrement and v0 is 0 , the increment or decrement command should be carried to the remaining digits. Otherwise, the remaining digits are left unchanged. Either way, the value of v0 changes 91 by one. v0:=0, vc:=0; ∗[Lc?lc; [ lc=inc → [ v0=0 → v0:=1 ▯ v0=1 → v0:=2 ▯ v0=2 → v0:=3 ▯ v0=3 → v0:=0, vc:=vc+1 ] ▯ lc=dec → [ v0=0 → v0:=3, vc:=vc-1 ▯ v0=1 → v0:=0 ▯ v0=2 → v0:=1 ▯ v0=3 → v0:=2 ] ]; Lz!(v0=0 ∧ vc=0) ] Then, two new channels are introduced into the specification: Rc communicates the carried command (inc, dec) and Rz responds with the resulting status (zero, not zero). This removes all direct data dependencies between v0 and vc , preparing the specification for projection [80]. v0:=0, vc:=0, vz:=1; ∗[Lc?lc; [ lc=inc → [ v0=0 → v0:=1 ▯ v0=1 → v0:=2 ▯ v0=2 → v0:=3 ▯ v0=3 → v0:=0; Rc!inc; Rz?vz ∥ Rc?rc; vc:=vc+1; Rz!vc=0 ] ▯ lc=dec → [ v0=0 → v0:=3; Rc!dec; Rz?vz ∥ Rc?rc; vc:=vc-1; Rz!vc=0 ▯ v0=1 → v0:=0 ▯ v0=2 → v0:=1 ▯ v0=3 → v0:=2 ] ]; Lz!(v0=0 ∧ vz=1) ] In the next step, the least significant digit is projected into a separate process with variables v0, lc, vz , leaving the remaining digits implemented by the variables vc, rc . 92 v0:=0; vz:=1 ∗[Lc?lc; [ lc=inc → [ v0=0 → v0:=1 ▯ v0=1 → v0:=2 ▯ v0=2 → v0:=3 ▯ v0=3 → v0:=0; Rc!inc; Rz?vz ] ▯ lc=dec → [ v0=0 → v0:=3; Rc!dec; Rz?vz ▯ v0=1 → v0:=0 ▯ v0=2 → v0:=1 ▯ v0=3 → v0:=2 ] ]; Lz!(v0=0 ∧ vz=1) ] ∥ vc:=0; ∗[Rc?rc; [ rc=inc → vc:=vc+1; Rz!vc=0 ▯ rc=dec → vc:=vc-1; Rz!vc=0 ] ] The specification for the remaining bits is left unaffected, and each digit has four channels: Lc and Lz for the command and counter status and Rc and Rz to carry the command to and receive the status from the remaining digits. This sequence of transformations can be executed recursively on the remaining bits to formulate an N-digit counter. Finally, the specification is flattened into DSA format. v:=0, vz:=1; ∗[ Lc?lc; [ lc=inc ∧ v=0 → v:=1 ▯ lc=inc ∧ v=1 → v:=2 ▯ lc=inc ∧ v=2 → v:=3 ▯ lc=inc ∧ v=3 → v:=0; Rc!inc; Rz?vz ▯ lc=dec ∧ v=0 → v:=3; Rc!dec; Rz?vz ▯ lc=dec ∧ v=1 → v:=0 ▯ lc=dec ∧ v=2 → v:=1 ▯ lc=dec ∧ v=3 → v:=2 ]; Lz!(v=0 ∧ vz=1) ] Because Lc and Lz , and Rc and Rz always communicate together, they can be merged into exchange channels L and R with the command encoded in the request and the zero status encoded in the acknowledge as shown below. However, the counter must be of finite size meaning it will need to be capped off. This is done 93 Fig. 51: The idzn counter decomposed into processes. with a circuit attached to the most significant digit that sinks the command on Lc and always returns true on Lz : ∗[ Lc?; Lz!true ] . This adds an overflow condition to the previous counter specification. vc:=0; ∗[Lc?lc; [ lc=inc → vc:=vc+1 ▯ lc=dec → vc:=vc-1 ]; [ vc > pow(digit, units) → vc:=vc-pow(digit, units) ▯ vc < 0 → vc:=vc+pow(digit, units) ▯ else → skip ]; Lz!(vc=0) ] At the moment, if the value of the counter is pow(digit, units)-1 where digit is the number of bits each counter unit implements while units is the total number of counter units in the counter, then an increment command and the resulting status signal would have to propagate across the full length of the counter. This means that the zero detection circuitry will take linear time with respect to the number of bits in the worst case. A constant time zero detection can be implemented by ignoring this overflow case. Instead of sending on Lz after all of the computation and carries have been performed, the counter status can be sent on Lz before all of the computation. This requires that the counter status be computed with the current command in mind, so the check must be changed from v=0 ∧ vz=1 to v=1 ∧ lc=dec ∧ vz=1 . This ignores the increment command and therefore the overflow case altogether. 94 v:=0, vz:=1; ∗[ Lc?lc; Lz!(v=1 ∧ lc=dec ∧ vz=1) [ lc=inc ∧ v=0 → v:=1 ▯ lc=inc ∧ v=1 → v:=2 ▯ lc=inc ∧ v=2 → v:=3 ▯ lc=inc ∧ v=3 → v:=0; Rc!inc; Rz?vz ▯ lc=dec ∧ v=0 → v:=3; Rc!dec; Rz?vz ▯ lc=dec ∧ v=1 → v:=0 ▯ lc=dec ∧ v=2 → v:=1 ▯ lc=dec ∧ v=3 → v:=2 ]; ] This increases the maximum value the finite-length counter can store before it overflows to pow(digit, units-1)*(digit+1) . [ vc ≥ pow(digit, units-1)*(digit+1) → vc:=vc-pow(digit, units) ▯ vc < 0 → vc:=vc+pow(digit, units) ▯ else → skip ] 4.2 Increment and Decrement Due to the first optimization, both Rz and Rn are initialized high on reset, with one to be lowered as acknowledgement. This acknowledgement is recorded in the internal memory implemented by vz and vn . For example, if Rz is lowered to acknowledge the output request, vz will transition high followed by vn low. ¬vn ∨ ¬Rz → vz↾ ¬vz ∨ ¬Rn → vn↾ vn ∧ Rz → vz⇂ vz ∧ Rn → vn⇂ Following the second optimization using a stored dualrail request, the acknowledgement of the internal memory can be simplified by using a simple XOR gate. Once the internal memory matches Rz and Rn , Re will transition low. When both Rz and Rn transition high, then Re follows. Rn ∧ vn ∨ Rz ∧ vz → _Re⇂ ¬Rn ∧ ¬vz ∨ ¬Rz ∧ ¬vn → _Re↾ ¬_Re → Re↾ _Re → Re⇂ 95 However, the majority of acknowledgements do not ultimately change the value of the internal memory, and the above solution introduces four gate delays to every output request handshake. Alternatively, this feature can be implemented using the pass transistor variant of the stored dualrail request, introducing gate delays only when the internal memory switches. vz → _vz⇂ ¬vz → _vz↾ vn → _vn⇂ ¬vn → _vn↾ @Rn ∧ ¬_vn ∨ @Rz ∧ ¬_vz → Re↾ ¬@Rn ∧ _vz ∨ ¬@Rz ∧ _vn → Re⇂ The transitions on vz and vn in the given example cause _vz to transition low and _vn high in any order. Whichever goes first, there will be a short time in which the drivers for Rz and Rn interfere with each other through the pass transistors. If _vz⇂ goes first, then the pull-up stack for Rn will push against the pull-down stack for Rz . If _vn↾ goes first, then the pull-down stack for Rz will push against the pull-up stack for Rn . This means that the drivers for Rz and Rn must be made strong when possible and the pass transistor XOR must be minimally sized. This allows both Rz and Rn to maintain their values at the expense of power. Ultimately, the gate driving Re is a standard pass-transistor multiplexer and this transient interference is the typical behavior for such gates. Either way, Re will monotonically transition low, acknowledging the internal memory, the output acknowledge, and either _vn↾ or _vz↾ . Once the output request is lowered, the output acknowledge will go high, monotonically driving Re↾ , acknowledging either _vz⇂ or _vn⇂ . Next, the Gray Code counter value must be implemented with four signals: v00 , v01 , v10 , and v11 . With this encoding, incrementing or decrementing the counter only changes one of the memory values. For example, incrementing from 1 to 2 lowers v00 and raises v01 . 0. v10 , v00 1. v10 , v01 2. v11 , v01 3. v11 , v00 From here, there are many approaches for grouping those state changes in the forward drivers with different properties. However, one is ultimately superior in every metric through the use of pass transistor XORs. This approach groups the alternating increments together and the alternating decrements together as follows. • R00 drives v00 → v01 for 0 → 1 or v01 → v00 for 2 → 3 • R01 drives v10 → v11 for 1 → 2 or v11 → v10 for 3 → 0 96 • R11 drives v00 → v01 for 3 → 2 or v01 → v00 for 1 → 0 • R10 drives v10 → v11 for 0 → 3 or v11 → v10 for 2 → 1 R01 's increment from 3 → 0 and R10 's decrement from 0 → 3 carry increments and decrements to the next counter unit. For both, v00 will remain high and v01 low, allowing for the use of pass transistor memory gated forward drivers. @R01 ∧ ¬v01 → Ri↾ ¬@R01 ∧ v00 → Ri⇂ v01 → Ri⇂ @R10 ∧ ¬v01 → Rd↾ ¬@R10 ∧ v00 → Rd⇂ v01 → Rd⇂ Examining R00 , state 0 is encoded by v00 ∧ v10 and state 2 by v01 ∧ v11 . So the rule for R00 will look something like (v00 ∧ v10 ∨ v01 ∧ v11) ∧ Li . This is just the XNOR of the internal state. Similarly, R01 will look something like (v01 ∧ v10 ∨ v00 ∧ v11) ∧ Li which is just the XOR of the internal state. R11 and R10 follow the same pattern with the transitions on the internal state toggling between XOR and XNOR. Therefore, a pass transistor XNOR and XOR can be used to simplify these rules. @v01 ∧ ¬v11 ∨ @v00 ∧ ¬v10 → v1↾ ¬@v00 ∧ v11 ∨ ¬@v01 ∧ v10 → v1⇂ @v00 ∧ ¬v11 ∨ @v01 ∧ ¬v10 → v0↾ ¬@v01 ∧ v11 ∨ ¬@v00 ∧ v10 → v0⇂ This simplifies the forward drivers dramatically, cutting the capacitive load of the input requests by about half. Because R01 and R10 are only forwarded when v00 is high, the check for Re can be skipped when v01 is high. The input acknowledgement gates R11 to check for the zero condition. v0 ∧ _R01 ∧ Li → R00↾ v1 ∧ _R00 ∧ (v01 ∨ Re) ∧ Li → R01↾ v0 ∧ _R11 ∧ (v01 ∨ Re) ∧ Ld → R10↾ v1 ∧ _R10 ∧ Ld → R11↾ v10 ∧ vz ∧ R11 → Lz⇂ (v11 ∨ vn) ∧ R11 ∨ R01 ∨ R10 ∨ R00 → Ln⇂ However, care must be taken when implementing the internal memory with regard to v0 and v1 . Specifically, the reset phase of the forward drivers must acknowledge the downgoing transitions of those two variables. Unfortunately, the downgoing transitions of v0 and v1 97 acknowledge the upgoing transitions of v10, v11 and the downgoing transitions of v00, v01 . This means that v00, v01 must form an nlatch while v10, v11 must form a platch. Specifically, R00 and R11 must set the state with v00+; v01⇂ or v01+; v00⇂ . Meanwhile R01 and R10 must set v10-; v11↾ or v11-; v10↾ . Following the previously defined encoding gives these rules for the internal state. Keep in mind that the handshake along v00, v01 is three transitions from the input request to the reset phase while v10, v11 is five. The increased length of the second ultimately matches the transition count for a standard WCHB buffer, so this only affects energy. ¬v01 ∨ ¬v11 ∧ ¬_R11 ∨ ¬v10 ∧ ¬_R00 → v00↾ ¬v00 ∨ ¬v11 ∧ ¬_R00 ∨ ¬v10 ∧ ¬_R11 → v01↾ v01 ∧ (v11 ∨ _R11) ∧ (v10 ∨ _R00) → v00⇂ v00 ∧ (v11 ∨ _R00) ∧ (v10 ∨ _R11) → v01⇂ ¬v11 ∧ (¬v00 ∨ ¬R10) ∧ (¬v01 ∨ ¬R01) → v10↾ ¬v10 ∧ (¬v00 ∨ ¬R01) ∧ (¬v01 ∨ ¬R10) → v11↾ v11 ∨ v00 ∧ R10 ∨ v01 ∧ R01 → v10⇂ v10 ∨ v00 ∧ R01 ∨ v01 ∧ R10 → v11⇂ Noticing that all of those rules have XORs or XNORs between the internal state and the forward drivers, pass transistor XORs can be used to help reduce gate capacitance once again. This changes the rules for the internal state to the following. @_R11 ∧ ¬v11 ∨ @_R00 ∧ ¬v10 → x00↾ ¬@_R00 ∧ v11 ∨ ¬@_R11 ∧ v10 → x00⇂ @_R00 ∧ ¬v11 ∨ @_R11 ∧ ¬v10 → x01↾ ¬@_R11 ∧ v11 ∨ ¬@_R00 ∧ v10 → x01⇂ @R10 ∧ ¬v01 ∨ @R01 ∧ ¬v00 → x10↾ ¬@R01 ∧ v01 ∨ ¬@R10 ∧ v00 → x10⇂ @R01 ∧ ¬v01 ∨ @R10 ∧ ¬v00 → x11↾ ¬@R10 ∧ v01 ∨ ¬@R01 ∧ v00 → x11⇂ ¬v01 ∨ ¬x00 → v00↾ ¬v00 ∨ ¬x01 → v01↾ v01 ∧ x00 → v00⇂ v00 ∧ x01 → v01⇂ ¬v11 ∧ ¬x10 → v10↾ ¬v10 ∧ ¬x11 → v11↾ 98 v11 ∨ x10 → v10⇂ v10 ∨ x11 → v11⇂ There are two subtle features of this approach. First, when the circuit is idle, and all of the forward drivers are low, it will drive x00 and x01 high, keeping v00 and v01 stable. It will also drive x10 and x11 low, keeping v10 and v11 stable. Second, if x00 transitions low and x01 high, then x00⇂ will be acknowledged by v00↾ . Immediately following that, x01↾ will be acknowledged by v01⇂ . Therefore all of the transitions on these new nodes are acknowledged and require no timing assumptions. Ultimately, v0 and v1 control the order in which the internal state is set. If only a decrement counter is required then R00 and R01 are eliminated. If only an increment counter is required, then R10 and R11 are eliminated. In both cases, the rules driving the internal state are no longer XORs and x00, x01 and x10, x11 are no longer worth it. In the case of the decrement counter, the pass-transistor XOR driving v0 should be flipped to pass v10, v11 instead of v00, v01 . This flips the order in which v10, v11 must be set. Conversely, the increment counter should flip v1 . This converts the five transition path to three, saving some energy. Finally, the downgoing transitions of v0 and v1 correctly acknowledge the internal state. So they can be used as expected for the reset rules of the forward drivers. The input acknowledge is simply combinational following the standard WCHB reshuffling. ¬v0 ∧ ¬Li → R00⇂ ¬v1 ∧ ¬Li ∧ (¬v00 ∨ ¬Re) → R01⇂ ¬v0 ∧ ¬Ld ∧ (¬v00 ∨ ¬Re) → R10⇂ ¬v1 ∧ ¬Ld → R11⇂ ¬v10 ∨ ¬vz ∨ ¬R11 → Lz↾ (¬v11 ∧ ¬vn ∨ ¬R11) ∧ ¬R01 ∧ ¬R10 ∧ ¬R00 → Ln↾ All following counter units will use this as a base template, listing only the modified rules. 4.3 Clear The clear command is fairly straightforward. Rc always acknowledges the input request with Lz and checks Re in preparation for the output request. Re ∧ Lc → Rc↾ v10 ∧ vz ∧ R11 ∨ Rc → Lz⇂ Like the increment and decrement commands, the clear command must be gated. The forward driver Rc always transitions on a clear command, but the command is only forwarded to Rc if vz is low. However, because the output request is guaranteed to set vz high, the gating rule must 99 also check Rz to guarantee stability. Unfortunately, having two checks like this precludes the use of pass transistor logic. ¬_Rc ∧ (¬vz ∨ ¬Rz) → Rc↾ _Rc ∨ vz ∧ Rz → Rc⇂ The clear command always sets v00 and v10 setting the value of this counter unit to 0. For v10 the clear command is gated by x10 to reduce the load capacitance on x10 . It has the added effect of cleaning up instabilities on x10 , though ultimately unnecessary for the same reason as before. ¬v01 ∨ ¬x00 ∨ ¬_Rc → v00↾ ¬v00 ∨ ¬Rc ∧ ¬x01 → v01↾ v01 ∧ x00 ∧ _Rc → v00⇂ v00 ∧ (Rc ∨ x01) → v01⇂ (¬v11 ∨ ¬_Rc) ∧ ¬x10 → v10↾ v11 ∧ _Rc ∨ x10 → v10⇂ Finally, the reset for Rc checks the transitions on both internal memory units. When the command is forwarded, the Re is lowered following the usual sequence. When the command is not forwarded, then vn and Rc will already be low and Re will remain high. ¬v01 ∧ ¬v11 ∧ ¬Lc ∧ (¬vn ∧ ¬Rc ∨ ¬Re) → Rc⇂ (¬v10 ∨ ¬vz ∨ ¬R11) ∧ ¬Rc → Lz↾ This approach adds a few assumptions to the pass transistor logic. First, when clearing a counter with both v01 and v11 set high, high representing a state of 2, both internal memory units will transition. Before the transition, v0 will be high and v1 low. During the transition both v0 and v1 will glitch with quite a significant voltage swing. However, those glitches follow the transitions on the internal memory with no delay and are resolved before Rc is lowered. Since v0 and v1 are only used in the forward drivers and their use is gated by the other two inactive input requests, these glitches are completely masked. Specifically, v1 must complete its transition low before the completion of this cycle. The upgoing transition on v0 will be checked by the next command. Second, there is a transient state in which v10 and v11 are both high. During this transient state, x00 and x01 are left undriven. Because they are implemented by pass transistor logic, they will have already transitioned high by this point. However, if they have not, then there would have to be significant noise at just the right time in order to create a glitch on x00 or x01 . If x00 glitches down, then it will simply help the clear command correctly transition v00↾ and v01⇂ . 100 If x01 glitches down, then it technically might glitch v01 up. Therefore, x00 and x01 must be held high during the clear command. These transistors can be as small as possible, as long as it is strong enough to overcome any possible noise. ¬_Rc → x00↾ ¬_Rc → x01↾ 4.4 Read There are two ways to approach the read command. For the first approach, the read command propagates through the counter, but does not acknowledge. This blocks up the counter from all other commands and keeps the values stored by the internal memory stable. When the command reaches the end of the counter it sends a request for the bundled data channel formed by the latches in each counter unit and the output read request/acknowledge of the most significant counter unit. Once the output request has been acknowledged, then the read command may be acknowledged throughout the counter, opening it up to the next command. This approach sacrifices throughput in favor of lower area. For the second approach, the read command loads the values from the internal memory of each counter unit into a separate set of latches specifically for the read. Once the read command has reached the most significant counter unit and then gone through the reset phase, the bundled data request for the read channel may be sent. This allows the read command to operate at the same frequency as other commands as long as two read commands are not requested consecutively. Since it is unlikely that anyone would intend to send two read commands one after another, that means that the read command is always full throughput. This approach sacrifices area in favor of high throughput. This section describes the second approach in detail starting from the idzn counter template. Throughput was chosen over area because the throughput of the read and write commands are ultimately a bottleneck for later circuits. Since this circuit is implemented in the context of a digit-serial CGRA, it is prudent to ensure that the read value can be divided into digits to be converted to a digit stream. This gives four possible types of counter unit. Of those four "first" blocks any consecutive read commands until the active read has completed, "mid" and "last" signal when the active read has completed, and "base" simply propagates the read while loading the latches. Finally, the interface drives the read handshake when everything has completed. The read propagates until it reaches a "mid" unit with its vz flag set. This signals that the remaining counter units store zero, so there is nothing left to read. If there is not a "mid" unit with its vz flag set, then the read completes when it reaches "last". The handshake on each digit is 101 handled independently so that it can be fed through a parallel to serial unit. Base Unit The base unit implementation starts with two read latches. These are only open when the read command is active, saving power otherwise. The Gray code used by the internal memory units needs to be converted to the standard binary code for the bit parallel channel. Luckily, that work is already done. The first bit, O00, O01 , is equal to the XOR of the internal memory units, which is covered by v1 and v0 . The second bit, O10, O11 is equal to v10 and v11 . 0. v10 , v00 = O10 , O00 1. v10 , v01 = O10 , O01 2. v11 , v01 = O11 , O00 3. v11 , v00 = O11 , O01 ¬O00 ∨ ¬_Rr ∧ ¬v0 → O01↾ ¬O01 ∨ ¬_Rr ∧ ¬v1 → O00↾ O00 ∧ (_Rr ∨ v0) → O01⇂ O01 ∧ (_Rr ∨ v1) → O00⇂ ¬O10 ∨ ¬_Rr ∧ ¬v10 → O11↾ ¬O11 ∨ ¬_Rr ∧ ¬v11 → O10↾ O10 ∧ (_Rr ∨ v10) → O11⇂ O11 ∧ (_Rr ∨ v11) → O10⇂ The transitions on the read memories are acknowledged using the same approach that was used for the internal memories with the pass transistor XORs. Much like v0 and v1 for the clear command, these will be allowed to transition for other commands, causing glitches. However, we assume that the upgoing transitions are short enough compared to the cycle time that they will have completed before the read command is active. @O00 ∧ ¬v0 ∨ @O01 ∧ ¬v1 → o0↾ ¬@O01 ∧ v0 ∨ ¬@O00 ∧ v1 → o0⇂ @O10 ∧ ¬v10 ∨ @O11 ∧ ¬v11 → o1↾ ¬@O11 ∧ v10 ∨ ¬@O10 ∧ v11 → o1⇂ The forward driver must acknowledge the upgoing transitions on v0 and v1 since those signals are used to set one of the read latches. The input acknowledge must return the same status that it did in the previous command so that the requesting process does not see a change in the counter status. This is surprisingly expensive, and requires a validity tree Rv to keep the transistor stacks driving Ln at a reasonable length. 102 Fig. 52: Read counter components. (v0 ∨ v1) ∧ Re ∧ Lr → Rr↾ Rr = Rr ¬_R01 ∨ ¬_R10 ∨ ¬_R00 → Rv↾ v10 ∧ vz ∧ (R11 ∨ v00 ∧ Rr) → Lz⇂ (v11 ∨ vn) ∧ R11 ∨ (v01 ∨ v11 ∨ vn) ∧ Rr ∨ Rv → Ln⇂ The reset phase acknowledges the newly introduced pass transistor XORs and everything else resets combinationally. ¬Re ∧ ¬Lr ∧ ¬o0 ∧ ¬o1 → Rr⇂ _R01 ∧ _R10 ∧ _R00 → Rv⇂ ¬v10 ∨ ¬vz ∨ ¬R11 ∧ (¬v00 ∨ ¬Rr) → Lz↾ (¬v11 ∧ ¬vn ∨ ¬R11) ∧ (¬v11 ∧ ¬vn ∧ ¬v01 ∨ ¬Rr) ∧ ¬Rv → Ln↾ First Unit Starting from the "base" unit, the "first" unit will need an extra dataless channel Xr to sync with the interface and block consecutive reads. Normally, this would just be added to the read request as follows. (v0 ∨ v1) ∧ Xre ∧ Re ∧ Lr → Rr↾ ¬Xre ∧ ¬Re ∧ ¬Lr ∧ ¬o0 ∧ ¬o1 → Rr⇂ Rr = Rr Xrr = Rr Unfortunately, this makes the transistor stack for the reset phase of the output request too long. To get around this, a new variable We is introduced to handle the acknowledgement of o0 and o1 . This increases the length of the set phase transistor stack to five, but reduces the reset phase back down to four. (v0 ∨ v1) ∧ Xre ∧ We ∧ Re ∧ Lr → Rr↾ Rr = Rr Xrr = Rr 103 ¬_Rr ∧ ¬o0 ∧ ¬o1 → We⇂ ¬Xre ∧ ¬We ∧ ¬Lr ∧ ¬Re → Rr⇂ _Rr ∨ o0 ∨ o1 → We↾ This definition for We relies upon the Half-Cycle Timing Assumption. Alternatively, inverting the order in which the read latches are set and inverting the XOR gates driving o0 and o1 avoids this assumption. This would allow We to acknowledge Rr and the now upgoing transitions for o0 and o1 . Mid Unit Starting from the rules in the "first" unit, the "mid" unit only forwards the read request if the vz flag is not set. Furthermore, the "mid" unit forwards the value of the vz and vn flag out Xr to signal to the interface unit whether this is the last value in the counter. Pass transistor logic is sufficient to gate these signals. Rr = Xrd[0] @Rr ∧ ¬vn → Xrd[1]↾ ¬@Rr ∧ vz → Xrd[1]⇂ vn → Xrd[1]⇂ @Rr ∧ ¬vz → Xrd[0]↾ ¬@Rr ∧ vn → Xrd[0]⇂ vz → Xrd[0]⇂ Then, the vz case cannot deadlock the system. So ¬vn is used to skip the output acknowledge in the reset phase. ¬Xre ∧ ¬We ∧ ¬Lr ∧ (¬vn ∨ ¬Re) → Rr⇂ Last Unit The "last" unit is the same as "mid" except that all paths for forwarding the command have been removed. Normally, when doing this, vz and vn would be removed. However, in the case of the read command it must remain, recording the last value sent across Lz and Ln . This ensures that the status returned by the read command remains consistent. Specifically, an increment might overflow the counter, setting value of the counter to 0. However, this case always returns Ln . If the overflow was not recorded by vz and vn , then the next read would return Lz . This would switch the vz and vn latch in the previous counter unit, potentially causing an instability on the status wires down the whole counter. 104 Read Interface The read interface must block the read channel R until all of the read latches have resolved. This is signaled by the reset phase of Xo from the "mid" or "last" units. Then, the read interface must block the next read command until the read channel R has been acknowledged. This is done by holding Xie low. Ultimately, to keep the command throughput high, the requests on Xi and Xo are acknowledged as quickly as possible with Xi being a higher priority. Therefore, the read interface waits for the requests on Xi and Xo , storing the cap value in v before acknowledging them. Once their output requests have reset, the read latches have resolved. Therefore v is copied to the output request on the read channel. Once the read channel has acknowledged the request, the enables on Xi and Xo are raised, unblocking the next read command. Finally, the output request on R is reset. v := 1, Xoe+, Xie+, Rr := null; [¬Re ∧ ¬Xir ∧ ¬Xor]; *[(v := Xor; [Re]; Xoe- ∨∨ [Xir ∧ Re]; Xie-); [¬Xir ∧ ¬Xor]; Rr := v; [¬Re]; Xoe+,Xie+; Rr := null] The copy of Xo 's request to v is handled quite simply by an SR latch. v0 ∨ Xo0 → v1⇂ v1 ∨ Xo1 → v0⇂ ¬v0 ∧ ¬Xo0 → v1↾ ¬v1 ∧ ¬Xo1 → v0↾ The completion of this circuitry is detected with an XOR driving the C-element for Xoa . Meanwhile, the C-element driving Xia acknowledges the request on Xi as soon as it arrives. Re ∧ (v0 ∧ Xo0 ∨ v1 ∧ Xo1) → Xoa↾ Re ∧ Xi0 → Xia↾ Xoa → Xoe⇂ Xia → Xie⇂ Once the request on Xo has been lowered, the value stored in v is copied to the output request of the read channel R . ¬Xo0 ∧ ¬v1 ∧ ¬Xoe → Rc0↾ ¬Xo1 ∧ ¬v0 ∧ ¬Xoe → Rc1↾ The acknowledgement of the read channel unblocks Xi and Xo . ¬Re ∧ ¬Xia → Xoa⇂ ¬Re ∧ ¬Xi0 → Xia⇂ ¬Xoa → Xoe↾ ¬Xia → Xie↾ 105 Finally, the output requests to the read channel are lowered, resetting the handshake. Xo0 ∨ v1 ∨ Xoe → Rc0⇂ Xo1 ∨ v0 ∨ Xoe → Rc1⇂ 4.5 Write Unlike the read counter, there is only one reasonable way to approach the bundled-data write. Once again, the write command propagates through the counter, and at each unit loads in the written value. Once the value has been successfully loaded into the internal memory, other commands may follow almost immediately. The only command that must wait would be another consecutive write. This is similar to the high throughput solution for the read. The equivalent to the area-saving solution does not save area for the write counter because there are not extra latches to remove. Similar to the read counter, this is implemented in the context of a digit-serial CGRA. Therefore, the write data should be divided into digits from a serial to parallel unit. This again gives four different components. The "first" unit blocks any consecutive write command until the active write command has completed, "mid" and "last" signal when the active write has completed, and "base" propagates the write while loading the written value. Once again, the interface drives the write handshake once everything has completed. The write propagates until it reaches the cap token of the digit stream, signalling that there is no more data to write. If the digit stream is longer than the total number of counter units, then there will be an overflow. In this case, the write completes when it reaches "last". The handshake on each digit is handled independently so the write data can be received from the serial to parallel unit. Base Unit Once again, the base unit circuit implementation starts with the latches. The write data received from the serial to parallel unit uses the standard binary encoding. That means that it has to be converted to the Gray code used by the counter. This is done by an XOR gate in the datapath, and the transitions are covered by a delay line on the request for the write data. This will be discussed in the write interface section. 0. W10 , W00 = v10 , v00 1. W10 , W01 = v10 , v01 2. W11 , W00 = v11 , v01 3. W11 , W01 = v11 , v00 ¬W00 ∧ ¬W10 ∨ ¬W01 ∧ ¬W11 → Wx0↾ 106 Fig. 53: Write counter components. (W00 ∨ W10) ∧ (W01 ∨ W11) → Wx0⇂ Wx0 → Wx1⇂ ¬Wx0 → Wx1↾ Then, the write command arrives on Lw and is forwarded to Rw . Ultimately, the write interface must compute where to place the zero flag. This information is passed to the counter unit through a datapath signal, Z . Now, the write command is acknowledged with the new zero flag for the previous counter unit. This ultimately makes the pull-up network for the gate driving Ln too long. So, the Half Cycle Timing Assumption is used with a validity tree, Rv , from the internal nodes of the C-elements driving R01 , R10 and R00 . Re ∧ Lw → Rw↾ Rw = Rw ¬_R01 ∨ ¬_R10 ∨ ¬_R00 → Rv↾ v10 ∧ vz ∧ R11 ∨ Rw ∧ Z1 → Lz⇂ (v11 ∨ vn) ∧ R11 ∨ Rv ∨ Rw ∧ Z0 → Ln⇂ The internal state is set using the typical method. Careful consideration is taken to keep the rules for increment and decrement as short as possible so as to not interfere with their performance. Ultimately, the write command will create instabilities if assumptions are not made. Specifically, during a write command, both latches can switch their value. This will cause transitions in v0 and v1 similar to the transitions created by the clear command. It is assumed that the downgoing transitions on v0 or v1 have sufficiently resolved before a new input command is received. ¬v01 ∨ ¬x00 ∨ ¬Wx1 ∧ ¬_Rw → v00↾ ¬v00 ∨ ¬x01 ∨ ¬Wx0 ∧ ¬_Rw → v01↾ v01 ∧ x00 ∧ (Wx1 ∨ _Rw) → v00⇂ v00 ∧ x01 ∧ (Wx0 ∨ _Rw) → v01⇂ (¬v11 ∨ ¬W11 ∧ ¬_Rw) ∧ ¬x10 → v10↾ (¬v10 ∨ ¬W10 ∧ ¬_Rw) ∧ ¬x11 → v11↾ v11 ∧ (W11 ∨ _Rw) ∨ x10 → v10⇂ 107 v10 ∧ (W10 ∨ _Rw) ∨ x11 → v11⇂ Both internal memory units driving v00, v01 and v10, v11 are nlatches with respect to the write. This means that setting a value, v01 for example, would cause v01+; v00⇂ in that order. For v00 and v01 , the downgoing transitions of v00 or v01 are passed directly through the pass transistors driving v0 and v1 with no gate delay. This is assumed to complete before 5 transition from when it is enabled. This includes the reset of the forward driver Rw , the input enable Lz, Ln , and then a new input command Li, Ld, Lw . For v10 and v11 , the downgoing transitions of v0 or v1 are enabled as soon as v10 or v11 transition high, one transition earlier than the v00 and v01 case. Then, there is a single gate delay to drive v0 and v1 low as a result. Therefore, the assumption requires these transitions to complete before 6 transitions from when it is enabled. This includes the downgoing transition of the internal memory v10, v11 , the reset of the forward driver Rw , the input enable Lz, Ln , and then a new input command Li, Ld, Lw . Effectively, this is equivalent to the bounds assumed by a Half-Cycle Timing Assumption. Once the internal memory has transitioned, two pass transistor XORs v0w and v1w between each internal memory and the write value to signal its completion. These two signals will be unstable as we increment or decrement the counter and the internal state changes, but those transitions are pass directly through the pass transistors with no delay. Therefore, the upgoing transitions on v0w and v1w will have resolved long before the write command is active. @v00 ∧ ¬Wx0 ∨ @v01 ∧ ¬Wx1 → v0w↾ ¬@v00 ∧ Wx1 ∨ ¬@v01 ∧ Wx0 → v0w⇂ @v10 ∧ ¬W10 ∨ @v11 ∧ ¬W11 → v1w↾ ¬@v10 ∧ W11 ∨ ¬@v11 ∧ W10 → v1w⇂ Finally, v0w and v1w transition low signalling the completion of the write, and the forward driver Rw is reset. The input enable follows combinationally. ¬Re ∧ ¬Lw ∧ ¬v0w ∧ ¬v1w → Rw⇂ _R01 ∧ _R10 ∧ _R00 → Rv⇂ (¬v10 ∨ ¬vz ∨ ¬R11) ∧ (¬Rw ∨ ¬Z1) → Lz↾ (¬v11 ∧ ¬vn ∨ ¬R11) ∧ ¬Rv ∧ (¬Rw ∨ ¬Z0) → Ln↾ Similar to the clear counter, there is a transient state in which both v10 and v11 are high, leaving x00 and x01 dynamic. Once again, this can be resolved with two small transistors. ¬_Rw → x00↾ 108 ¬_Rw → x01↾ First Unit Starting from the "base" unit, the "first" unit will need an extra dataless channel Xr to sync with the interface and block consecutive writes. Normally, this would just be added to the write request. Xre ∧ Re ∧ Lw → Rw↾ ¬Xre ∧ ¬Re ∧ ¬Lw ∧ ¬v0w ∧ ¬v1w → Rw⇂ Rw = Rw Xrr = Rw However, this makes the transistor stack for the reset phase of the output request too long. Once again, a new variable We is introduced to handle the acknowledgement of v0w and v1w . Xre ∧ We ∧ Re ∧ Lw → Rw↾ Rw = Rw Xrr = Rw ¬_Rw ∧ ¬v0w ∧ ¬v1w → We⇂ ¬Xre ∧ ¬We ∧ ¬Lw ∧ ¬Re → Rw⇂ _Rw ∨ v0w ∨ v1w → We↾ This definition for We would rely upon the Half-Cycle Timing Assumption. Again, this can be avoided by inverting the order in which the write latches are set and inverting the XOR gates driving v0w and v1w . This would let We acknowledge Rw and the now upgoing transitions of v0w and v1w . This inversion would remove the transient state in which x00 and x01 are dynamic. So, the previously added keepers would no longer be necessary. However, it also adds a transient state in which x10 and x11 are dynamic. Therefore two new keepers would be needed. Rw → x10⇂ Rw → x11⇂ Mid Unit Starting from the rules in the "first" unit, the "mid" unit only forwards the write request if the data token is not a cap. The interface communicates this information to the "mid" unit through a datapath signal called X . This is now used to gate Rw @Rw ∧ ¬X1 → Rw↾ ¬@Rw ∧ X0 → Rw⇂ 109 X1 → Rw⇂ Unfortunately, this signal is stored in a platch. This mean that when the platch switches, the pass transistor logic driving Rw will go dynamic. Therefore, a keeper is needed. _Rw ∧ ¬Rw → Rw⇂ Then, the vz flag must be set. Careful consideration is taken to ensure that this does not interfere with the performance of the other commands. ¬vn ∨ ¬Rz → vz↾ ¬vz ∧ (¬X1 ∨ ¬Rw) ∨ ¬Rn → vn↾ vn ∧ Rz → vz⇂ (vz ∨ Xd1 ∧ Rw) ∧ Rn → vn⇂ Finally, the acknowledgement of Re is skipped when the output command is not sent. Furthermore, the new transition on vz, vn must also be acknowledged at this point. A pass transistor OR gate can make both of these things happen. X is used to gate the transition on _vz into a new signal called Ze . Then, Rw waits for this signal to transition low. Because it is a pass transistor gate, transitions on Ze will follow all transitions on _vz with no delay when it is open. This will be unstable during increment and decrement commands, but will be stable for the write. This acknowledges the transition on _vz, vz, vn with no timing assumptions for the QDI control. ¬@_vz ∧ Xd1 → Ze⇂ @_vz ∧ ¬Xd0 → Ze↾ ¬X1 → Ze↾ ¬Xre ∧ ¬We ∧ ¬Lw ∧ (¬Ze ∨ ¬Re) → Rw⇂ Last Unit The "last" unit is the same as "mid" except that all paths for forwarding the command have been removed. Unlike the read counter, there is no need to store the Lz, Ln flags in this unit. Write Interface The write interface is responsible for blocking consecutive write commands. Unfortunately, the write data channel is significantly slower than the command channel. For the next command to be propagated through the counter, the request on Xi must be acknowledged as soon as possible. For the next write to propagate through the counter, the request on Xi must be enabled. For now, it is assumed that the input data remains stable while the write enable We is low. This means that the write enable can be lowered as soon as the write command has been initiated as signaled by the request on Xi . However, the write enable cannot transition high until after the write command 110 Fig. 54: Bundled-Data write counter interface. has completed and the written data has all been acknowledged in the counter units, as signalled by Xor⇂ . Unfortunately, these constraints do not lend themselves to any of the standard reshufflings (WCHB, PCHB, PCFB). Upon initialization, Xir is low, blocking any write command from proceeding, and We is high, ready to receive write data. Once the write data has arrived on W as signalled by Wr↾ , the write command is unblocked by Xir↾ . Once the write command is unblocked, the write data is acknowledged by lowering We . This is done as soon as possible to overlap the slow write data handshake on W with the write command propagation through the counter. Xir is also lowered, resetting the handshake in the counter unit and allowing further commands to proceed as long as they are not write commands. Once the write command has completely propagated through the counter as signalled by Xor↾ , it is acknowledged by lowering Xoe . Finally, when Xor is lowered, the write command has entered the reset phase in the last counter unit and all of the latches in the counter have stabilized. At this point, We is raised to signal that the process is ready for new write data. Xir⇂,We↾,Xoe↾; [Xie ∧ ¬Wr ∧ ¬Xor]; ∗[([Xie ∧ Wr]; Xir↾ ∥ Xoe↾); We⇂; ([¬Xie]; Xir⇂ ∥ [Xor]; Xoe⇂); [¬Xor ∧ ¬Wr]; We↾ ] This whole handshake is anchored by We . We is lowered once Xir and Xoe have gone high, and the write command is propagating through the counter. Then, it is raised once Xor and Wr are lowered signalling the completion of the write. Then, to disambiguate some of the other states in the handshake, the downgoing transitions on Xir and Xoe must also be acknowledged. Xir ∧ Xoe → _We↾ ¬Xir ∧ ¬Xoe ∧ ¬Xor ∧ ¬Wr → _We⇂ _We → We⇂ ¬_We → We↾ For Xir , there is the usual dependency on Xie , but then Xir must also wait for the write data to arrive before letting the command propagate. This is covered by Wr in the Xir↾ rule. We then disambiguates some of the other states in the handshake. ¬Xie ∧ ¬We → Xir⇂ 111 Xie ∧ We ∧ Wr → Xir↾ For Xoe , We↾ already acknowledges the downgoing transition of Xor . So, Xoe must only acknowledge the upgoing transition. _We ∧ Xor → Xoe⇂ ¬_We → Xoe↾ Overall, this interface is as parallel as possible, allowing for the write command propagation to overlap the write data handshake. Chunked Write Interface The chunked write interface has the same underlying handshake as the normal write interface, but adds channel actions to propagate the zero flag from digit to digit. This allows each digit to be loaded independently so the write interface may accept input from the serial to parallel unit. There are two input channels. W carries the write data for this digit, and Uz carries the zero flag from the next most significant digit. Then, there are the two channels that block consecutive writes Xi and Xo . Finally, there is one output channel Dz which carries the computed zero flag for this digit to the next digit of lesser significance. The handshake with Xi , Xo and W remains largely unchanged. Meanwhile, logic for handling non-cap tokens and the handshake for the Dz and Uz channels have been merged in. The handshake starts with We high, signifying that it is ready to receive write data. Xir and Xoe are both low, blocking any write command from propagating through the counter. Uz is only enabled when the token received from W is not a cap. This guarantees that the requests on Uz will be low unless they are required by the handshake. At the start of the handshake, the requests on Dz and Xi along with the enable on Xo are raised in parallel. This allows each of these signals to transition as soon as possible in their respective handshakes. Once this is done, We and Uze transition low, acknowledging the slow handshakes for the write data as soon as possible. Then, the requests on Dz and Xi along with the enable on Xo are reset in parallel. This allows the next counter command to propagate through the counter. When the command arrives at the end of the counter, Xoe is lowered. Then, once the request on Xo has lowered, all of the latches in the counter are stable, and can therefore request new write data from W and Uz . 112 Fig. 55: Chunked Bundled-Data write counter interface. We↾,Dz0⇂,Dz1⇂,Xir⇂,Xoe⇂,Uze⇂; [¬W0 ∧ ¬W1 ∧ ¬Uz0 ∧ ¬Uz1 ∧ Dze ∧ Xie]; ∗[[ W0 → Uze↾ ▯ W1 → skip ]; ( [ Dze ∧ (W0 ∧ (Uz1 ∨ Zd0) ∨ W1 ∧ Zd0) → Dz1↾ ▯ Dze ∧ (W0 ∧ Uz0 ∨ W1) ∧ Zd1 → Dz0↾ ] ∥ [ Xie ∧ (W0 ∧ (Uz0 ∨ Uz1) ∨ W1)]; Xir↾ ∥ Xoe↾ ); We⇂; Uze⇂; ( [¬Dze]; Dz0⇂,Dz1⇂ ∥ [¬Xie]; Xir⇂ ∥ [Xor]; Xoe⇂ ); [¬Xor ∧ ¬W0 ∧ ¬W1 ∧ ¬Uz0 ∧ ¬Uz1]; We↾ ] First, the initial zero flag is loaded for this digit into the bundled datapath. When W1 is high, the digit is a cap token, meaning it is the last token in the stream. In this case, the initial zero flag is set to true. Otherwise when W0 is high, the zero flag is loaded from the next most significant digit as communicated over Uz . Finally, the cap token flag is also loaded from W into the datapath. This signal will be used to stop the write command from propagating through the rest of the counter. Zu0 ∨ Uz1 ∧ W0 → Zu1⇂ Zu1 ∨ Uz0 ∧ W0 ∨ W1 → Zu0⇂ ¬Zu0 ∧ (¬Uz1 ∨ ¬W0) → Zu1↾ ¬Zu1 ∧ (¬Uz0 ∨ ¬W0) ∧ ¬W1 → Zu0↾ X1 ∨ W1 → X0⇂ X0 ∨ W0 → X1⇂ ¬X1 ∧ ¬W1 → X0↾ ¬X0 ∧ ¬W0 → X1↾ Then, these latches need time to resolve. To reduce the number of delay lines required and 113 reduce energy, the validity of each input is computed and delayed. W0 ∨ W1 → dWv↾ // delayed ¬W0 ∧ ¬W1 → dWv⇂ Uz1 ∨ Uz0 → dUzv↾ // delayed ¬Uz1 ∧ ¬Uz0 → dUzv⇂ For the handshake, the requests on Dz must wait for Zd to resolve in the datapath, but Zd is only dependent upon W . When Zd is false, meaning not all of the bits in this digit are one, then false is sent on Dz before receiving the zero flag from Uz . This early-out mechanism effectively breaks the carry chain, letting it resolve in O(logN) time on average instead of O(N) . If Zd is true, then the result sent on Dz is determined by Uz or W . If this digit is a cap token, as signified by W1 , then the value of Zd is simply forwarded. If this digit is not a cap token, then the value of Uz is forwarded. However, it is not necessary to wait until after the delay line on Uz because Zd has already been computed. Because Uze is kept low in the case of a cap token, the requests on Uz will be low, preventing any instabilities on Dz . Then, the request on Xi must be driven. The value received from Uz is loaded into Zu and used in the datapath to compute the zero flag for each bit in the digit. Therefore, it is necessary to wait until that process has resolved after the delay lines on Uz . Therefore, Xir waits for both dWv and dUzv . Once the datapath has been resolved, a request is sent on Xir which unblocks the write command from propagating through the counter units. In parallel Xo is also enabled, letting the command propagate out the last unit in the counter as well. Finally, Rv waits for all three of these actions to resolve in parallel. Dze ∧ We ∧ dWv ∧ (Uz1 ∨ Zd0) → Dz1↾ Dze ∧ We ∧ dWv ∧ (Uz0 ∨ W1) ∧ Zd1 → Dz0↾ Xie ∧ We ∧ dWv ∧ (dUzv ∨ W1) → Xir↾ ¬_We → _Xoe⇂ ¬_Xoe → Xoe↾ Xoe ∧ (Dz0 ∨ Dz1) ∧ Xir → Rv↾ Once Rv signals the completion of the forward driving rules, the write data on W is acknowledged. It is assumed that the bundled datapath remains unchanged until after W is enabled again. This is not a standard implementation of the bundled-data protocol, since typically there would be a layer of latches to ensure that. However, the serial to parallel units are able to guarantee that feature without the extra layer of latches. 114 Rv → _We↾ _We → We⇂ Now that W has been acknowledged, the write command must propagate through the counter units and the request on W must be lowered. These both represent the longest part of their respective handshakes, and this reshuffling handles them both in parallel. Then upon completion of each process, the request are lowered immediately, following a reshuffling akin to the PCHB. This unblocks the counter, allowing subsequent commands to propagate. Then, Rv is lowered once each signal has been reset. ¬Dze ∧ ¬We → Dz1⇂ ¬Dze ∧ ¬We → Dz0⇂ ¬Xie ∧ ¬We → Xir⇂ _We ∧ Xor → _Xoe↾ _Xoe → Xoe⇂ ¬Xoe ∧ ¬Dz0 ∧ ¬Dz1 ∧ ¬Xir → Rv⇂ Finally, before raising We , the request on Xo must lower. This guarantees that all of the latches in the counter units have stabilized to their written value. Then raising We requests new data on the datapath. ¬Rv ∧ ¬Xor ∧ ¬dWv ∧ ¬dUzv → _We⇂ ¬_We → We↾ Uze is almost entirely determined by We . However, Uze can only go high when this digit is not a cap token. Therefore, a pass transistor AND gate ensures that no extra extra delay is introduced when implementing this feature. @W0 ∧ ¬_We → Uze↾ ¬@W0 ∧ We → Uze⇂ _We → Uze⇂ 4.6 Evaluation In [75], I designed a large class of QDI counter circuits, showing significant gains compared to other asynchronous counters. While they were faster and used significantly less energy, they required up to twice as many transistors to implement. These were compared to the two other known QDI counters from [96] and [98]. To my knowledge, no one had done quite as thorough an exposition on QDI counters and their capabilities. 115 Type Trans Frequency Energy/Op Latency d_z[75] 50N 2.73 GHz 24.01 fJ N/A dzn[75] 102N+10 2.15 GHz 48.17 fJ 399 ps idzn[75] 146N+12 2.03 GHz 56.05 fJ 421 ps idczn[75] 174N+14 2.00 GHz 40.62 fJ 442 ps idrzn[75] 246N+14 1.88 GHz 89.51 fJ 441 ps idrzn_bd[75] 188N+32 1.77 GHz 75.20 fJ 441 ps dwzn[75] 192N+12 1.86 GHz 43.81 fJ 487 ps is_zn[75] 146N+61 2.08 GHz 45.52 fJ 139 ps Type Trans Frequency Energy/Op Latency d_z_n[96] 117N+32 1.42 GHz 73.34 fJ 468 ps id_zn[98] 398N+26 0.60 GHz 152.76 fJ 1150 ps The designs presented in this chapter were also compared to the typical counter synthesized by Synopsys Design Compiler in the same technology. While it is both fast and small, it is also comparatively power hungry. The optimizations introduced here improve upon the previous designs, cutting the transistor count by more than half, increasing the frequency by 43%, and decreasing the energy usage by 20%. Furthermore, the circuit template is simplified, making it easier to fit more functionality into a single counter bit before running into the limitations of the WCHB template. These metrics are averaged using the carry chain length statistics from the SPEC2006 benchmark in Fig. 56. This shows that the vast majority of increments and decrements only carry about five bits past the least significant bit. As shown in Fig. 57, the counters from this chapter are nearing the clock speed typically achieved in extremely optimized synchronous architectures while consuming less than half as much energy during operation using only 10.5 more transistors per bit. This is true to varying degrees across all of the possible counter commands. 116 Type Trans Frequency Energy/Op id_c_zn 74N 1.00 GHz 169.18 fJ id_c_zn 74N 2.00 GHz 116.75 fJ id_c_zn 74N 3.00 GHz 98.24 fJ id_c_zn 74N 4.00 GHz 86.12 fJ Type Trans Frequency Energy/Op Latency idzn 71N 3.57 GHz 42.38 fJ idczn 84.5N 3.11 GHz 41.41 fJ idrzn_bd 102.5N+32 3.28 GHz 53.93 fJ idwzn_bd 104.5N+42 3.25 GHz 65.31 fJ idrzn_bdN 102.5N+53M 3.24 GHz 110.06 fJ idwzn_bdN 104.5N+164M 3.22 GHz 87.05 fJ dwzn_bd 76N+42 3.48 GHz 58.89 fJ dwzn_bdN 76N+164M 3.43 GHz 81.22 fJ Fig. 56: The distribution of carry chain lengths for increment and decrement commands in the SPEC2006 benchmark. Fig. 57: Measured Performance and Energy for an array of counters. 117 CHAPTER 5 STREAM MANIPULATION Three operations are fundamental for managing adaptive digit-serial streams. First, given that two adaptive digit streams are likely to be different lengths, an operator with multiple inputs will have to sign-extend all inputs to the length of the longest stream. Second, once an operator is complete, there are likely to be redundant tokens in the result that should be removed with sign compression. Third, some operators require parallel inputs. Since data is transmitted serially, it will have to be actively converted to parallel. This chapter will explore multiple implementations of each in the context of different encodings and topologies. 5.1 Sign Extension Sign-extension is necessary for any digit-serial operation with multiple inputs. At best, the control circuitry necessary for sign-extension can be integrated into the circuitry for the operation itself. At worst, it can be shared across all of the operations in a single CGRA execution node. This section covers both a QDI approach and an integrated QDI/BD approach. 5.1.1 Behavioral Specification The function implemented by these approaches remains fairly consistent. Each input stream communicated over a channel, X , consists of a sequence of tokens in which Xd , or "data", communicates the value of the bit or digit and Xc , or "cap", communicates the end-of-stream marking. Xc is true for the last token in the stream and false otherwise. For the sign-extension process, there are two input channels, A and B , and one output channel, S . If neither input token is a cap then both tokens can be acknowledged. If only one input token is a cap, then that token is not acknowledged. This keeps the token waiting on the channel interface, effectively duplicating it. Meanwhile, the other token is acknowledged until a cap token has been received on both channels. Once that happens, both tokens can be acknowledged to complete the operation. ∗[[ !Ac ∧ !Bc → S!({Ad, Bd}, 0); A?, B? ▯ Ac ∧ !Bc → S!({Ad, Bd}, 0); B? ▯ !Ac ∧ Bc → S!({Ad, Bd}, 0); A? ▯ Ac ∧ Bc → S!({Ad, Bd}, 1); A?, B? ]] Meanwhile, the output channel combines the two input bits Ad and Bd into an output with a single end-of-stream marking. In the context of another operation, part of that operation can 118 be implemented here as an optimization to reduce the size of the output channel, reduce the complexity of the next pipeline stage, and increase the overall frequency. An example of this would be adding Ad and Bd into a 3-valued output in preparation for a bit-serial addition operation. For multi-bit digits, a problem arises in the treatment of the cap token. For the above specification, the most significant digit is simply repeated. The last digit in the stream could be constrained to either 0 or -1. If the digits are represented with a binary encoding, then this would be encoded as all zeros or all ones. This faithfully implements two's complement encoding. Alternatively, extra circuitry could be introduced in the datapath that duplicates the most significant bit in the cap token to the remaining bits in a new cycle, sign-extending the digit. This would remove all constraints from the cap token and reduce the number of tokens necessary to represent a given value. 5.1.2 QDI Only (PCHB) If reliability in extreme temperature environments is the most important factor for the design, then a QDI-only design is likely optimal. This reliability comes at a direct sacrifice to throughput, energy, and area. Because of the built-in control-data split, the best QDI template to use for these operations is likely PCHB. It is the simplest template in which the reset of the output data is not dependent upon the acknowledgement on the input data. This minimizes circuit complexity in the face of wide data operations. However, parts of the WCHB template can simplify the complexity of the input acknowledgement for the control. If the data grows too wide, then it should be desynchronized from the control with its own pipeline circuitry. Assuming digit sizes of 4 bits or less, this is not a concern. Per the CHP specification, there are four conditions that determine the behavior of the process. Unfortunately, while these conditions do share various behaviors, they are not shared in a way that can be merged per the first optimization rule in Chapter 3 due to acknowledgement constraints. Therefore, four C-elements are required to record which condition is covered by this cycle. These four signals are then used to generate the output request. en ∧ Ac0 ∧ Bc0 → Xd0↾ en ∧ Ac1 ∧ Bc0 → Xd1↾ en ∧ Ac0 ∧ Bc1 → Xd2↾ en ∧ Ac1 ∧ Bc1 → Xd3↾ Xd0 ∨ Xd1 ∨ Xd2 → Sc0↾ Sc1 = Xd3 119 The PCHB template inherently requires two validity trees: one for the input requests and one for the output requests. However, the acknowledgement of the input requests for the control circuitry will be handled with a WCHB reshuffling. Therefore, the validity signals necessary for this design only cover the input data requests with An and Bn , the output data requests with Sdan and Sdbn , and the output control requests with Scn . Because the behavior condition was previously recorded with four C-elements, their internal nodes can be used to gate the input enable. The global enable en also uses the internal nodes to handle cases that do not acknowledge an input. Sc0 ∨ Sc1 → Scn⇂ ¬Sdan ∧ ¬Sdbn ∧ ¬Scn → Rn⇂ ¬An ∧ ¬Rn ∧ (¬_Xd0 ∨ ¬_Xd2 ∨ ¬_Xd3) → Ae⇂ ¬Bn ∧ ¬Rn ∧ (¬_Xd0 ∨ ¬_Xd1 ∨ ¬_Xd3) → Be⇂ (¬Ae ∨ ¬_Xd1) ∧ (¬Be ∨ ¬_Xd2) ∧ ¬Se → en⇂ Entering the reset phase, the input control requests are acknowledged using WCHB techniques directly in the drivers for the output requests. This dramatically simplifies the acknowledgement circuitry in the control. ¬en ∧ ¬Ac0 ∧ ¬Bc0 → Xd0⇂ ¬en ∧ ¬Bc0 → Xd1⇂ ¬en ∧ ¬Ac0 → Xd2⇂ ¬en ∧ ¬Ac1 ∧ ¬Bc1 → Xd3⇂ ¬Xd0 ∧ ¬Xd1 ∧ ¬Xd2 → Sc0⇂ Finally, the validity, input enable, and global enable signals are all set following the typical PCHB template. ¬Sc0 ∧ ¬Sc1 → Scn↾ Sdan ∧ Sdbn ∧ Scn → Rn↾ An ∧ Rn → Ae↾ Bn ∧ Rn → Be↾ Ae ∧ Be ∧ Se → en↾ In the datapath, the output request signals are simply conditioned on the global enable. This serves the same purpose as the clock in a clocked datapath. The reset phase for the data requires no information about the acknowledgement of the input channel. 120 en ∧ Ad0 → Sda0↾ ¬en → Sda0⇂ en ∧ Ad1 → Sda1↾ ¬en → Sda1⇂ en ∧ Bd0 → Sdb0↾ ¬en → Sdb0⇂ en ∧ Bd0 → Sdb1↾ ¬en → Sdb1⇂ Ad0 ∨ Ad1 → An⇂ ¬Ad0 ∧ ¬Ad1 → An↾ Bd0 ∨ Bd1 → Bn⇂ ¬Bd0 ∧ ¬Bd1 → Bn↾ Sda0 ∨ Sda1 → Sdan⇂ ¬Sda0 ∧ ¬Sda1 → Sdan↾ Sdb0 ∨ Sdb1 → Sdbn⇂ ¬Sdb0 ∧ ¬Sdb1 → Sdbn↾ 5.1.3 Integrated QDI/BD (Extensible) The bundled-data timing assumption simplifies much of the circuitry in this design. As suggested in Chapter 3, the control should be handled by a QDI process while the datapath should be clocked by the QDI control. Therefore, the cap signal of each token is communicated by the QDI input requests while the data bits of each token are communicated with a clocked bus. Instead of using four C-elements to keep track of the current condition like the QDI approach does, the integreated QDI/BD approach takes advantage of the bundled-data timing assumption and copies the QDI input requests into the latched datapath. The keeps that information stable throughout the whole cycle and removes the responsibility from the output request drivers. Axcd0 ∨ Ac0 → Axcd1⇂ Axcd1 ∨ Ac1 → Axcd0⇂ ¬Axcd0 ∧ ¬Ac0 → Axcd1↾ ¬Axcd1 ∧ ¬Ac1 → Axcd0↾ Bxcd0 ∨ Bc0 → Bxcd1⇂ Bxcd1 ∨ Bc1 → Bxcd0⇂ ¬Bxcd0 ∧ ¬Bc0 → Bxcd1↾ ¬Bxcd1 ∧ ¬Bc1 → Bxcd0↾ Then, instead of the four delay lines typically required by a bundled-data design, the input control requests can be merged together using two C-elements and delayed. This reduces the number of delay lines from four to two and simplifies the remaining circuitry. (Ac0 ∧ (Bc0 ∨ Bc1) ∨ Ac1 ∧ Bc0) → ABd0↾ // delay Ac1 ∧ Bc1 → ABd1↾ // delay ¬Ac0 ∧ ¬Bc0 → ABd0⇂ ¬Ac1 ∧ ¬Bc1 → ABd1⇂ Then, the latched control signals can be used to condition the input acknowledgement and the remainder of the design is a simple WCHB buffer. 121 Se ∧ ABd0 → Sc0↾ Se ∧ ABd1 → Sc1↾ Sc0 ∧ Axcd0 ∨ Sc1 → Ae⇂ // amplify Sc0 ∧ Bxcd0 ∨ Sc1 → Be⇂ // amplify ¬Se ∧ ¬ABd0 → Sc0⇂ ¬Se ∧ ¬ABd1 → Sc1⇂ (¬Sc0 ∨ ¬Axcd0) ∧ ¬Sc1 → Ae↾ // amplify (¬Sc0 ∨ ¬Bxcd0) ∧ ¬Sc1 → Be↾ // amplify The input enable signals Ae and Be are amplified and used to latch their respective data. This removes the validity trees that were in the datapath of the QDI approach. Sda0 ∨ Ad0 ∧ Ae → Sda1⇂ Sda1 ∨ Ad1 ∧ Ae → Sda0⇂ ¬Sda0 ∧ (¬Ad0 ∨ ¬Ae) → Sda1↾ ¬Sda1 ∧ (¬Ad1 ∨ ¬Ae) → Sda0↾ Sdb0 ∨ Bd0 ∧ Be → Sdb1⇂ Sdb1 ∨ Bd1 ∧ Be → Sdb0⇂ ¬Sdb0 ∧ (¬Bd0 ∨ ¬Be) → Sdb1↾ ¬Sdb1 ∧ (¬Bd1 ∨ ¬Be) → Sdb0↾ 5.2 Sign Compress by N Many operators may produce redundant outputs. For example, the sum of 127, encoded as …01111111 , and -128, encoded as …10000000 , is -1. A digit-serial adder will produce the result redundantly encoded as …11111111 instead of the minimal encoding of …1 . If data becomes too redundantly encoded, all of the previously described benefits from an adaptive digit serial datapath are lost. So it may be prudent to occasionally compress the digit stream. 5.2.1 Behavioral Specification Each digit stream can be divided into a collection of runs. Each run is a sequence of ones or a sequence of zeros. The last run in the encoding will always contain the cap token. If the last run is longer than a single token, then the encoding is redundant. Sign compression simply cuts the last run back down to a single token. Since all of the tokens in the run will be the same value, either all ones or all zeros, a run is stored and its output delayed by recording its sign and counting up each token as it arrives. If the run ends and it does not contain the cap token, then the whole run is emitted on the output by 122 using the stored sign and length. Effectively, the digit stream is temporarily run-length encoded. Once the cap token has been received, there will be a run of some length stored in the counter. This run is simply dropped and the cap token is forwarded, sign compressing the value. For multi-bit datapaths, a token may not be all ones or all zeros. These tokens cannot be part of a run therefore can bypass the run-length logic entirely. The implementation requires two internal variables: v represents the sign, and n counts the number of tokens. For clarity, the ext() function generates a full token from a sign value, the sign() function determines the sign of a token by examining its most significant bit, and the chain() function returns true if a token is all ones or all zeros. v := 0, n := 0; ∗[∗[ sign(Ld)≠v ∧ n>0 → R!(ext(v),0); n := n-1 ]; v := sign(Ld); [ !chain(Ld) → R!(Ld,0) ▯ chain(Ld) ∧ Lc=0 → n := n+1 ▯ Lc=1 → n := 0, R!(ext(v),1) ]; L? ] If the received token is not part of the stored run, then the run does not contain the cap token. So, the process loops over n , draining the stored run to the output. Then, the next run is recorded, setting v to the last bit in the input token. If the input token is neither all ones nor all zeros, it bypasses the run logic and is forwarded to the output. Otherwise, if the input token is not a cap, it is stored by incrementing n . If the input token is a cap, then the last run is dropped. n is reset and a cap token is forwarded. Flattening this CHP poses a large number of possible designs. This process must strike a balance between attempting to execute independent actions in parallel and the resulting circuit complexity. For example, after decrementing the currently stored run, the next token will immediately increment it again. Alternatively, if this is the last token in the stream, it is possible to end up clearing an already-cleared counter. These actions ultimately cancel each other out suggesting opportunity for optimization. While these look like low hanging fruit, they ultimately result in long transistor stacks in the forward drivers. Therefore, there are four basic conditions. Condition 1 implements the loop, decrementing when the input is not part of the run. Condition 2 implements the bypass case when the input is not part of any run. Condition 3 implements the run accumulation case that consumes inputs and increments n . Finally, condition 4 implements the cap condition in which n is reset and the cap token is forwarded. 123 v := 0, n := 0; ∗[[ sign(Ld)≠v ∧ n≠0 → n := n-1, R!(ext(v),0) ▯ !chain(Ld) ∧ n=0 → v := sign(Ld); R!(Ld,0); L? ▯ Lc=0 ∧ (sign(Ld)=v ∨ n=0) → n := n+1, v := sign(Ld); L? ▯ Lc=1 ∧ sign(Ld)=v → n := 0, R!(ext(v),1); L? ]] This design exposes complex communication patterns between the control and datapath. While the acknowledgement requirements introduced by these patterns make a QDI multi-bit implementation inefficient, an integrated QDI/BD design saves avoids most of these requirements. Ultimately, making intelligent choices about what is implemented in the datapath will make all the difference in creating a simple and efficient design. The n variable and its associated increment, decrement, and clear actions are implemented efficiently by the idczn counter described in Chapter 4. This provides an exchange channel interface that maps well to the control's behavior. For a single-bit datapath, the control is simpler. The bypass case in condition 2 is no longer possible, and the use of the ext() , sign() , and chain() functions is no longer required. v := 0, n := 0; ∗[[ Ld≠v ∧ n≠0 → n := n-1, R!(v,0) ▯ Lc=0 ∧ (Ld=v ∨ n=0) → n := n+1, v := Ld; L? ▯ Lc=1 ∧ Ld=v → n := 0, R!(v,1); L? ] ] 5.2.2 QDI Only In the end, the QDI approach requires a non-standard implementation which more closely resembles three separate interacting processes than just one. First the circuitry necessary for the decrement loop and resulting data send are implemented. D flags whether the input data is equal to the stored value V and is calculated every cycle. While this adds transitions to the overall cycle time, it also reduces the length of the transistor stacks and reduces the capacitive load on both the internal memory and the input data. Rc0 = Cd Dd1 ∧ Cn ∧ Re → Cd↾ Vd0 ∧ (Rc0 ∨ Rc1) → Rd0↾ Vd1 ∧ (Rc0 ∨ Rc1) → Rd1↾ 124 ¬Cn ∧ ¬Re → Cd⇂ ¬Vd0 ∨ ¬Rc0 ∧ ¬Rc1 → Rd0⇂ ¬Vd1 ∨ ¬Rc0 ∧ ¬Rc1 → Rd1⇂ Then, once the counter has been decremented down to zero, the value of the input is copied to the internal memory. Dd1 ∧ Ld0 ∧ Cz ∨ Vd0 → Vd1⇂ (Dd1 ∧ Ld1 ∧ Cz ∨ Vd1) → Vd0⇂ (¬Vd0 ∨ ¬Ld0) ∧ (¬Vd1 ∨ ¬Ld1) → Dd0⇂ (¬Vd0 ∨ ¬Ld1) ∧ (¬Vd1 ∨ ¬Ld0) → Dd1⇂ (¬Dd1 ∧ ¬Ld0 ∨ ¬Cz) ∧ ¬Vd0 → Vd1↾ (¬Dd1 ∧ ¬Ld1 ∨ ¬Cz) ∧ ¬Vd1 → Vd0↾ (Vd0 ∧ Ld0 ∨ Vd1 ∧ Ld1) → Dd0↾ (Vd0 ∧ Ld1 ∨ Vd1 ∧ Ld0) → Dd1↾ This triggers the subsequent increments or clear, consuming the next carry chain. (Cz ∨ Cn) ∧ Dd0 ∧ Lc0 → Ci↾ (Cz ∨ Cn) ∧ Dd0 ∧ Lc1 → Cc↾ Cc ∧ Re → Rc1↾ Ci ∨ Rc1 → Le⇂ ¬Cz ∧ ¬Cn ∧ ¬Dd0 ∧ ¬Lc0 → Ci⇂ ¬Cz ∧ ¬Cn ∧ ¬Dd0 ∧ ¬Lc1 → Cc⇂ ¬Cc ∧ ¬Re → Rc1⇂ ¬Ci ∧ ¬Rc1 → Le↾ These particular design choices significantly reduce the number of required transistors, but also force the average throughput of this approach to be half of the input throughput. This is because every input carry chain must be counted and drained to the output in separate steps. It is possible to save the throughput of this device by doing those two steps in parallel. Such a design would require two separate counters. One which can be devoted to only increment, and one to only decrement and clear. In between, the value of the counter would have to be copied bit-parallel from one counter to the next. 125 5.2.3 Integrated QDI/BD The timing assumptions introduced by the integrated QDI/BD design flow pose new challenges in the design of the compression unit. In particular, the condition that does not acknowledge L skips the delay lines and forces other methods of implementing the timing assumption. Once again, the control starts with the four forward driving cases. The logic for these drivers uses three input data. Cz and Cn implement a one of two encoding that signals whether the counter is zero or not. Lc0 and Lc1 implement a one of two encoding with the input request specifying whether this token is a cap token, and D is a one of three encoding. Dd0 signals that Ld=ext(v) , Dd1 signals Ld=ext(¬v) , and Dd2 signals !chain(Ld) . D comes from the datapath and is assumed to be stable by the time a request on is received Lc0 or Lc1 . The output signals are named by their interaction with the counter. Cd decrements the counter, Cs skips the counter action, Ci increments the counter, and Cc clears it. Rc0 is calculated from the internal nodes of the C-element driving Cd and Cs , and Rc1 lines up with Cc . Finally, all of the output requests except Cd acknowledge the input. Re ∧ Cn ∧ (Dd1 ∨ Dd2) ∧ (Lc0 ∨ Lc1) → Cd↾ Re ∧ Cz ∧ Dd2 ∧ Lc0 → Cs↾ (Cz ∧ (Dd0 ∨ Dd1) ∨ Cn ∧ Dd0) ∧ Lc0 → Ci↾ Re ∧ (Cz ∧ (Dd0 ∨ Dd1) ∨ Cn ∧ Dd0) ∧ Lc1 → Cc↾ ¬_Cd ∨ ¬_Cs → Rc0↾ Rc1 = Cc Ci ∨ Cc ∨ Cs → Le⇂ This yields a fairly clean WCHB implementation, and the reset behavior is as expected. ¬Re ∧ ¬Cn → Cd⇂ ¬Re ∧ ¬Ld0 → Cs⇂ ¬Cz ∧ ¬Cn ∧ ¬Ld0 → Ci⇂ ¬Re ∧ ¬Cz ∧ ¬Cn ∧ ¬Ld1 → Cc⇂ _Cd ∧ _Cs → Rc0⇂ ¬Ci ∧ ¬Cc ∧ ¬Cs → Le↾ Meanwhile, the datapath shown in Fig. 58 is somewhat more complex. Pass transistor logic reduces its energy requirements and increases its maximum switching frequency. A variant of the manchester carry chain checks the conditions that are finally fed into the control block as D . This yields an implementation with high frequency at low energy and area requirements. 126 The MSB of #Ld is copied into v in every condition that L is acknowledged except for the last. However, in the last condition when the input is a cap token, the next value of v does not matter because the associated clear guarantees that n is 0 . So v can be set every time L is acknowledged. This allows v to be implemented as a flip-flop in the datapath, making it accessible to the comparison operation between #Ld and ext(v) , and its assignment to the MSB of #Ld . To start, each bit neads to pick between sending Ld or ext(v) on R in order to implement the bypassing case with Cs . This is done with pass transistor multiplexers. When Zd1 is high, it passes the value from Ld to Rd . When Zd0 is high, it passes the value from V to Rd . Z is one when the counter is zero and comes from latching Cn and Cz using the input acknowledge. ¬@Ld0 ∧ Zd1 ∨ ¬@Vd0 ∧ Zd0 → Rd0⇂ @Ld0 ∧ ¬Zd0 ∨ @Vd0 ∧ ¬Zd1 → Rd0↾ ¬@Ld1 ∧ Zd1 ∨ ¬@Vd1 ∧ Zd0 → Rd1⇂ @Ld1 ∧ ¬Zd0 ∨ @Vd1 ∧ ¬Zd1 → Rd1↾ Next, the equality checks are implemented with Dd0 representing Ld=ext(¬v) , Dd1 representing Ld=ext(v) , and Dd2 representing !chain(Ld) . To do this, each bit will have two Ci signals and two Co signals. Cod0 should be low if this and all previous bits are 0 , and Cod1 if they are 1 . This effectively forms two parallel carry chains. ¬@Cid0 ∧ Ld0 → Cod0⇂ @Cid0 ∧ ¬Ld1 ∨ ¬Ld0 → Cod0↾ ¬@Cid1 ∧ Ld1 → Cod1⇂ @Cid1 ∧ ¬Ld0 ∨ ¬Ld1 → Cod1↾ Finally in the MSB, these two carry signals are compared against the carry chain token value stored in V to implement the equality checks. (Cod0 ∨ Vd1) ∧ (Cod1 ∨ Vd0) → Dd0⇂ (Cod0 ∨ Vd0) ∧ (Cod1 ∨ Vd1) → Dd1⇂ ¬Cod0 ∧ ¬Vd1 ∨ ¬Cod1 ∧ ¬Vd0 → Dd0↾ ¬Cod0 ∧ ¬Vd0 ∨ ¬Cod1 ∧ ¬Vd1 → Dd1↾ ¬Dd0 ∧ ¬Dd1 → Dd2↾ Dd0 ∨ Dd1 → Dd2⇂ 5.3 Sign Compress by One The full compression unit has a few issues when applied to certain problem spaces. First, the total number of bits in the counter dictates the maximum length of any run before incorrect results 127 Fig. 58: The architecture of the integrated QDI/BD stream full compression unit. are given. This means that it does not ultimately support arbitrary precision arithmetic. This also means that longer runs require logarithmically more counter units making the design area hungry. Second, it is possible that a run is close to the full length of the value but the cap token is not part of that run. For example, -128 is minimally encoded as …10000000 with a run of 7 bits and a length of 8. This implementation would be forced to cut the throughput of such numbers in half by waiting for the whole run to be consumed before emitting it again and moving on. This means that the device can vary dramatically between half and full throughput. Instead of storing and collapsing the whole run, a limit could be imposed on its length, simply passing the remaining bits once that limit is reached. Implementing an arbitrary limit would require an idczfn counter and likely be fairly expensive. However, implementing a limit of one token simply requires a single bit register in the compression unit. A limit of one also lends itself to an implementation with a guaranteed constant throughput because the stored value is only important when the current input token is the cap. Therefore, it may be possible to spread these compression units throughout a computational fabric and execute the compression over the course of multiple operations without causing gridlock in the network. 128 5.3.1 Behavioral Specification Unfortunately, the transition between two input streams complicates the necessary control because there is not a stored token. So two values must be stored. v is a multi-bit storage that records the previous token's data, and n signals whether v is valid. Luckily, n is directly represented by the previous token's control signifying cap/not cap. v is not yet valid for any token proceeding a cap token and is valid otherwise. v := 0, n := 0 ∗[[ Lc=0 ∧ n=0 → n := 1, v := Ld; L? ▯ Lc=0 ∧ n=1 → R!(v,0); v := Ld; L? ▯ Lc=1 ∧ n=0 ∧ v≠Ld → n := 1, v := Ld ▯ Lc=1 ∧ n=1 ∧ v≠Ld → R!(v,0); v := Ld ▯ Lc=1 ∧ v=Ld → R!(v,1); n := 0, v := Ld; L? ]] The first condition handles the first token in the stream. v is not valid yet, so the input data needs to be loaded into v and n needs to be set. The second condition handles the majority of the stream. Lc=0 means that the end of the stream has not been reached and n=1 means that the first token has already been received. In this case the previous token stored in v should be forwarded and the new input should be loaded into v . Then, there are three stream completion cases. In the first case, the cap token is also the first token of the stream. This case only happens in the context of a stream representing 0 or -1 . To simplify the circuitry for the output request on R , the input is loaded into v and left unacknowledged. This will transition directly into the last case. In the second case, the stored token is a different value from the cap token. This means that the digit stream cannot be compressed. So v is forwarded and the new input is stored. Again, the input is not acknowledged, transitioning directly into the last case. In the last case, the cap token on the input and the stored value in v are the same, meaning that the digit-stream can be compressed. So, the value in v is forwarded as a cap token and the input is acknowledged. This completes the stream and resets n to 0. Once again, the specification for a single bit datapath is simpler. v and n can be merged into a single internal memory encoding three states: valid 0, valid 1, and invalid. Furthermore because datapath is QDI, the third condition can be merged with the last. Instead of first loading the input data into the internal storage, it is forwarded directly to the output. Otherwise, the specification is about the same. 129 v := inv; ∗[[ Lc=0 ∧ v=inv → v := Ld; L? ▯ Lc=0 ∧ v≠inv → R!(v,0); v := Ld; L? ▯ Lc=1 ∧ (v≠inv ∧ v≠Ld) → R!(v,0); v := Ld ▯ Lc=1 ∧ (v=inv ∨ v=Ld) → R!(Ld,1); v := inv; L? ]] 5.3.2 QDI Only Unfortunately, while the CHP looks fairly clean, the logic for the forward drivers, the internal memory, and the input acknowledgement are almost entirely independent. Furthermore, the equality checks hide a significant amount of complexity introduced by the XOR operations necessary for their implementation. So there end up being 10 forward driving signals and no real way to optimize them. The first two handle condition 1, saving the input data value in an internal driver for the express purpose of setting the internal memory. The next four handle condition 2, and must remain separate because the stored value of v that is subsequently forwarded on R could be different from the input value on L that is then stored in v . The next two handle condition 3, forwarding the stored value without consuming the input when the cap token has been received. Finally, the last two actually implement the compression in condition 4. // Condition 1 v2 ∧ Lc0 ∧ Ld0 → Rxd0↾ v2 ∧ Lc0 ∧ Ld1 → Rxd1↾ // Condition 2 Re ∧ v0 ∧ Lc0 ∧ Ld0 → Rxd2↾ Re ∧ v0 ∧ Lc0 ∧ Ld1 → Rxd3↾ Re ∧ v1 ∧ Lc0 ∧ Ld0 → Rxd4↾ Re ∧ v1 ∧ Lc0 ∧ Ld1 → Rxd5↾ // Condition 3 Re ∧ v1 ∧ Lc1 ∧ Ld0 → Rxd6↾ Re ∧ v0 ∧ Lc1 ∧ Ld1 → Rxd7↾ // Condition 4 Re ∧ Lc1 ∧ (v2 ∨ v0) ∧ Ld0 → Rxd8↾ Re ∧ Lc1 ∧ (v2 ∨ v1) ∧ Ld1 → Rxd9↾ Because there are so many forward drivers, validity trees become absolutely necessary to derive the signals for the output channel's request. In a quest to reduce transistors, _Rd0 and 130 _Rd1 are shared across the output channel's data, control, and the input acknowledgement by omitting Rxd7 and Rxd8 from _Rd0 and Rxd6 and Rxd9 from _Rd1 and adding them later in the computation. This makes for some fairly complex forward logic. Rxd0 ∨ Rxd1 → _Rs⇂ Rxd2 ∨ Rxd3 → _Rd0⇂ Rxd4 ∨ Rxd5 → _Rd1⇂ ¬_Rd0 ∨ ¬_Rxd7 ∨ ¬_Rxd8 → Rd0↾ ¬_Rd1 ∨ ¬_Rxd6 ∨ ¬_Rxd9 → Rd1↾ ¬_Rd0 ∨ ¬_Rd1 ∨ ¬_Rxd6 ∨ ¬_Rxd7 → Rc0↾ ¬_Rxd8 ∨ ¬_Rxd9 → Rc1↾ ¬_Rs ∨ ¬_Rd0 ∨ ¬_Rd1 → _Le↾ _Le ∨ Rc1 → Le⇂ Then, the Rx signals are used to set the value of the internal memory. In the case of Rxd0 and Rxd1 , a token is not being sent on R . So, v must wait for the input to reset, which is the slowest possible implementation of an internal memory. (¬v1 ∧ ¬v2 ∨ ¬Ld0 ∧ ¬_Rxd0 ∨ ¬Re ∧ (¬_Rxd2 ∨ ¬_Rxd4 ∨ ¬_Rxd6)) → v0↾ (¬v0 ∧ ¬v2 ∨ ¬Ld1 ∧ ¬_Rxd1 ∨ ¬Re ∧ (¬_Rxd3 ∨ ¬_Rxd5 ∨ ¬_Rxd7)) → v1↾ ¬v0 ∧ ¬v1 ∨ ¬Re ∧ (¬_Rxd8 ∨ ¬_Rxd9) → v2↾ (v1 ∨ v2) ∧ (Ld0 ∨ _Rxd0) ∧ (Re ∨ _Rxd2 ∧ _Rxd4 ∧ _Rxd6) → v0⇂ (v0 ∨ v2) ∧ (Ld1 ∨ _Rxd1) ∧ (Re ∨ _Rxd3 ∧ _Rxd5 ∧ _Rxd7) → v1⇂ (v0 ∨ v1) ∧ (Re ∨ _Rxd8 ∧ _Rxd9) → v2⇂ Finally, the forward logic is reset following the typical WCHB template with conditional acknowledgement. The first condition does not produce any output on R , so Re will not be lowered. Rxd2 and Rxd5 implementing the second condition do not change the value of the internal memory. So those acknowledgements are removed. The third condition does not acknowledge the input, so input request neutrality does not need to be checked. Finally, the last condition must check all of these features. // Condition 1 ¬v2 ∧ ¬Lc0 ∧ ¬Ld0 → Rxd0⇂ ¬v2 ∧ ¬Lc0 ∧ ¬Ld1 → Rxd1⇂ // Condition 2 ¬Re ∧ ¬Lc0 ∧ ¬Ld0 → Rxd2⇂ ¬Re ∧ ¬v0 ∧ ¬Lc0 ∧ ¬Ld1 → Rxd3⇂ 131 ¬Re ∧ ¬v1 ∧ ¬Lc0 ∧ ¬Ld0 → Rxd4⇂ ¬Re ∧ ¬Lc0 ∧ ¬Ld1 → Rxd5⇂ // Condition 3 ¬Re ∧ ¬v1 → Rxd6⇂ ¬Re ∧ ¬v0 → Rxd7⇂ // Condition 4 ¬Re ∧ ¬v0 ∧ ¬Lc1 ∧ ¬Ld0 → Rxd8⇂ ¬Re ∧ ¬v1 ∧ ¬Lc1 ∧ ¬Ld1 → Rxd9⇂ Resetting the validity trees and input acknowledge is simply a matter of implementing the other half of the combinational gates. ¬Rxd0 ∧ ¬Rxd1 → _Rs↾ ¬Rxd2 ∧ ¬Rxd3 → _Rd0↾ ¬Rxd4 ∧ ¬Rxd5 → _Rd1↾ _Rd0 ∧ _Rxd7 ∧ _Rxd8 → Rd0⇂ _Rd1 ∧ _Rxd6 ∧ _Rxd9 → Rd1⇂ _Rd0 ∧ _Rd1 ∧ _Rxd6 ∧ _Rxd7 → Rc0⇂ _Rxd8 ∧ _Rxd9 → Rc1⇂ _Rs ∧ _Rd0 ∧ _Rd1 → _Le⇂ ¬_Le ∧ ¬Rc1 → Le↾ This is a stark example of everything that can go wrong when attempting QDI design. When nothing lines up, complexity can grow very quickly for seemingly simple specifications. Because of this, QDI design often takes a very long time. 5.3.3 Integrated QDI/BD Separating the control and datapath untangles these independent concerns and allows for simpler XORs with standard clocked logic. This ultimately simplifies the control. However, any feature that might add complexity to the datapath must be reigned in. Therefore, a few constants are maintained in an attempt to optimize the circuitry. First, the data for the output request on R always comes from v . This removes any muxing from the datapath and redirects that complexity into the control. Second, the value stored in v is always set using the data on the input channel L . These two factors allow for the datapath to be implemented as a set of flops that shift the data backwards in the stream by a single pipeline stage. Third, the control circuitry is only dependent upon an equality test from the datapath and the datapath is only dependent upon clocking signals from the control. This allows for a fairly strict separation 132 between the two, simplifying the control. There are two primary challenges presented by the specification. The first is that the clocking signal for the input latches on the input request data #Ld and the clocking signal for the extra set of flops implementing v are different. v needs to be clocked on every iteration of the control while the input request should only be clocked on conditions 1, 2, and 5. The second challenge is that conditions 3 and 4 bypass the delay line on the input request #Lc , but still change the value of v and therefore of the equality test between #Ld and v . This forces the creation a signal specifically for those two conditions with its own delay line. The implementation starts with the five conditions that drive the output request on R . Except for Ls , the signals used to compute these conditions come directly from the specification. Ls is the extra delay line signal for conditions 3 and 4, and D signals whether Ld is different from v . Since conditions 3 and 4 can only transition to 4 or 5, only conditions 4 and 5 must check the reset of Ls . Re ∧ n0 ∧ Ld0 → Rxd0↾ Re ∧ n1 ∧ Ld0 → Rxd1↾ Re ∧ n0 ∧ Dd1 ∧ Ld1 → Rxd2↾ Re ∧ Ls ∧ n1 ∧ Dd1 ∧ Ld1 → Rxd3↾ Re ∧ Ls ∧ Dd0 ∧ Ld1 → Rxd4↾ Then, these conditions are used to drive the output request, the input enable, and the extra delay signal. Because conditions 1 and 3 only serve to load the input into the internal memory, they do not forward any request on the output. Furthermore, conditions 3 and 4 redirect the control to condition 5 and therefore do not acknowledge the input request. ¬_Rxd1 ∨ ¬_Rxd3 → Rd0↾ Rd1 = Rxd4 Rxd0 ∨ Rxd1 ∨ Rxd4 → Le⇂ Rxd2 ∨ Rxd3 → Ls⇂ Because the value of the input control token is reflected in the output request rails, they can be used to set the internal memory unit for n . Conditions 1 and 3 set n to 1 while condition 5 sets it to 0 . The value of n for conditions 2 and 4 is already set correctly. So, the reset phase of the output request is then used to acknowledge the transitions on the internal memory and reset the input acknowledges. The delay line for Ls is placed between the driver and all proceeding usages. (¬n0 ∨ ¬Ld0 ∧ ¬_Rxd0 ∨ ¬Ls ∧ ¬_Rxd2) → n1↾ n0 ∧ (Ld0 ∨ _Rxd0) ∧ (Ls ∨ _Rxd2) → n1⇂ ¬n1 ∨ ¬Re ∧ ¬_Rxd4 → n0↾ 133 Fig. 59: The architecture of the integrated QDI/BD Stream Compress One unit. n1 ∧ (Re ∨ _Rxd4) → n0⇂ ¬Ld0 ∧ ¬n0 → Rxd0⇂ ¬Re ∧ ¬Ld0 → Rxd1⇂ ¬Ls ∧ ¬n0 → Rxd2⇂ ¬Re ∧ ¬Ls → Rxd3⇂ ¬Re ∧ ¬Ld1 ∧ ¬n1 → Rxd4⇂ _Rxd1 ∧ _Rxd3 → Rd0⇂ ¬Rxd0 ∧ ¬Rxd1 ∧ ¬Rxd4 → Le↾ ¬Rxd2 ∧ ¬Rxd3 → Ls↾ To implement the datapath in Fig. 59, the clocking signal for v must be generated using Le and Ls . Meanwhile, the clock signal for the input data is just Le . Ls ∧ Le → vclk↾ ¬Ls ∨ ¬Le → vclk⇂ Then, the equality check between Ld and v are implemented for each bit. 134 Ld0 ∧ Rd1 ∨ Ld1 ∧ Rd0 → Cd0⇂ ¬Ld0 ∧ ¬Rd0 ∨ ¬Ld1 ∧ ¬Rd1 → Cd0↾ Cd0 → Cd1⇂ ¬Cd0 → Cd1↾ Inspiration is taken from a Manchester Carry Chain to propagate this equality check across the bits using pass transistors. ¬@Di ∧ Cd0 → Do⇂ @Di ∧ ¬Cd1 ∨ ¬Cd0 → Do↾ And finally, the one hot encoding D is generated for the equality check used in the control. Dd1 = Do Do → Dd0⇂ ¬Do → Dd0↾ 5.4 Serial to Parallel Serial to parallel conversion is ultimately required for multiplication and shifting. Digit-serial multiplication requires an array of adders to sum up the partial products. Therefore, it actually uses a hybrid architecture with one serial operand and one parallel. Similarly, digit-serial shifting requires a counter to keep track of the number of tokens that have either been pushed to the front of the stream or popped from it. This counter is loaded with a bit-parallel operand. For both of these, extra work must be done to convert the second operand from serial to parallel. 5.4.1 Behavioral Specification However, there are a few complicating factors. The array architecture limits multi-node operations to a distributed control. Furthermore, it is assumed that each network node will have a pipeline stage to maintain switching frequency. This rules out the three standard approaches to this problem: a tree of alternating splits, a counter with a demultiplexer[232], or a broadcast channel. Ultimately, this leaves two possible approaches. For the “upflow” approach, the stream flows up from the least significant parallel output channel to the most significant. The first token in the stream is emitted out the first parallel output channel, and every consecutive token in the stream is forwarded to the next least significant node. This is repeated, popping the first token off the stream at each node until it has been fully converted. Unfortunately, this strategy introduces significant skew to the output tokens, likely affecting performance. The “downflow” approach reverses the direction of flow. The digit stream flows down across all of the output channels until the first token reaches the bottom of the converter. At this point, all of the tokens in the stream will be aligned with the correct output channel. This allows them to be emitted in parallel with very little skew. 135 Each of these approaches imply strong constraints on the construction of the CGRA and its network architecture. The upflow approach allows for dynamic allocation of neighboring idle nodes on the array as in Fig. 60. Meanwhile, the downflow approach allocates a multi-node operation through nodes on which the input digit stream is already being routed as in Fig. 61. This effectively folds the overhead of the serial-to-parallel converter into the routing network's own pipeline structure. However, this approach would also require knowledge of exactly how many nodes are required for the operation ahead of time. As discussed before, this would prevent the system from truly implementing arbitrary length arithmetic. To get the most capacity from the array, it may be prudent to use a hybrid approach, starting with downflow and switching to upflow as needed. Therefore, implementations for both of these approaches will be demonstrated in this chapter. Unfortunately, the naive implementations for these circuits will deadlock. First, both the multiplier and the counter leave the parallel channels unacknowledged until the completion of the operator. Second, once the operator has completed, the counter will acknowledge its parallel inputs from bottom to top while the multiplier will acknowledge them from top to bottom. Third, the basic circuit template implements a half-buffered pipeline. This means that each token requires two spaces in the pipeline, and that the second token will be blocked until the first parallel output is acknowledged. Careful design will be needed to work within these constraints. While it would be easy to add extra pipeline stages wherever needed, doing so incurs significant overhead. Overall, a half-buffered pipeline has enough latching layers to hold all of the parallel outputs. The failure, instead, lies with the typical QDI control templates. Both the upflow and downflow approaches start with a standard half buffered pipeline, adding parallel output channels at each pipeline stage. For the upflow approach, each stage must be able to store the parallel digit while passing the remaining serial digits, requiring a minimum of two latching layers. Furthermore, the QDI control for the parallel output channel must not deadlock the QDI control for the serial pipeline. This means that the parallel output must be driven by a full buffer. Overall, the behavioral description of the upflow approach is unable to capture these constraints. The specification has three channels for data: Si receives the incoming serial stream from below, Pi forwards the first token on the parallel output channel, and Si+1 forwards the remaining tokens of the serial stream up until a cap token is received. 136 Fig. 60: Structure of multi-node operations for first approach for serial to parallel conversion. Fig. 61: Structure of multi-node operations for second approach for serial to parallel conversion. ∗[ Sn-1?v; Pn-1!v; ∗[ v=0 → Sn-1?v ] ] ∥ … ∗[ Si?v; Pi!v; ∗[ v=0 → Si?v; Si+1!v ] ] ∥ … ∗[ S0?v; P0!v; ∗[ v=0 → S0?v; S1!v ] ] For the downflow approach, each pipeline stage can deadlock once it has forwarded its parallel output. This means that only one latching layer is required at each pipeline stage and the parallel 137 output can be driven by a half buffer. The specification therefore has three channels for data: Si+1 receives the incoming serial stream from above, Si forwards the remaining tokens of the serial stream down, and Pi is the parallel output for a single token. Given a half-buffered pipeline, it is guaranteed that the parallel output in the next stage will stall the pipeline, keeping the enable of Si low until after the completion of the whole operator. Unfortunately, there is not a clean way to differentiate that event from a simple pipeline stall, necessitating the use of extra control channels Ci and Ci+1 . After this stall, a token will arrive on Ci signalling that this stage should forward its token waiting on Si+1 through Pi . Once that is done, a signal can be sent through Ci+1 to continue the process. ∗[[ Sn-1 → Sn?v; Sn-1!v ▯ Cn-1 → Sn?v, Cn-1?; Pn-1!v ]] ∥ … ∗[[ Si → Si+1?v; Si!v ▯ Ci → Si+1?v, Ci?; Pi!v, [v=0 → Ci+1! ▯ v=1 → skip] ]] ∥ … ∗[ S1?v; P0!v, [v=0 → C1! ▯ v=1 → skip] ] 5.4.2 Upflow Approach For simplicity, the serial channels of a single pipeline stage, Si and Si+1 , have been renamed D for down and U for up respectively. As previously discussed, each stage is divided into two interacting circuits. A PCFB drives the parallel output channel P and a WCHB drives the serial up-going channel U . The PCFB starts the handshake, forwarding the first token from D to the parallel output channel P . The input enable De is then lowered by way of the state variable x , and the standard PCFB state variable en signals the start of the reset phase. The input channel D resets and De is raised in compliance with the standard PCFB handshake. en ∧ D0 → P0↾ en ∧ D1 → P1↾ en ∧ (P0 ∨ P1) → x↾ P0 ∨ P1 → Pn⇂ ¬De ∧ ¬Pn → en⇂ ¬en ∧ ¬D0 ∧ ¬D1 → x⇂ 138 During the reset phase of the PCFB, the WCHB forwards remaining tokens through the up-going channel U . The WCHB is forced to wait until the reset phase of the PCFB through De , _en , and P0 in the forward drivers for U . This continues until the cap token is received. Before the reset for U1 completed, it waits until P0 is reset. De ∧ Ue ∧ _en ∧ P0 ∧ D0 → U0↾ De ∧ Ue ∧ _en ∧ P0 ∧ D1 → U1↾ x ∨ U0 ∨ U1 → De⇂ ¬Ue ∧ ¬D0 → U0⇂ ¬Ue ∧ ¬D1 ∧ ¬P0 → U1⇂ ¬x ∧ ¬U0 ∧ ¬U1 → De↾ Once the cap token is forwarded up, the parallel channel P is allowed to reset. This is enforced by the check on _U1 in the reset rule for P0 . Then the PCFB handshake waits for the reset of the WCHB to complete by checking _U1 in the reset rule for en . ¬en ∧ ¬Pe ∧ ¬_U1 → P0⇂ ¬en ∧ ¬Pe → P1⇂ ¬P0 ∧ ¬P1 → Pn↾ De ∧ Pe ∧ Pn ∧ _U1 → en↾ Once everything has reset, the process starts again with a new digit stream. Therefore, P is allowed to deadlock for as long as necessary without affecting the other tokens flowing through the system, and is allowed to reset in any order with respect to the other parallel channels. For the datapath, the input requests on D are delayed to implement the bundled-data timing assumption. en is amplified and used to clock the layer of latches for P that are open when en is high and closed otherwise. Finally, the input enable on D is amplified and used to clock the layer of latches for U . Once again, those latches are open when De is high and closed otherwise. While this QDI control is fairly non-standard, it allows the datapath to stay as small and simple as possible while faithfully implementing those constraints. Overall, this has a dramatic effect on the transistor count. 5.4.3 Downflow Approach The downflow approach is similarly non-standard, but ultimately simpler. Tokens flow down the converter until they reach their destination. Each token lines up one after another, so only one layer of latches is necessary. Once again, the two serial channels Si and Si+1 have been renamed 139 to D for down, and U for up respectively. The control channels have been similarly renamed to Cu and Cd . *[[ #D → U?v; D!v [] #Cd ∧ #U=0 → U?v, Cd?; P!v, Cu! [] #Cd ∧ #U=1 → U?v, Cd?; P!v ]] Because no assumptions are made about the order in which the parallel outputs are reset, it is possible for the reset of a parallel channel break the guarantee for the stall on D . Specifically, if the parallel channel of the node attached to D is reset before P , then De will be raised. In parallel, U has been acknowledge and the handshake is waiting on the input requests on U to be lowered. This could allow the token that was forwarded through P to be duplicated onto D , producing an incorrect result later in the pipeline and possibly causing an instability. This means that the rules generating the requests on D must be gated by the output requests on P . Pe ∧ De ∧ Ud0 ∧ _Pd0 → Dd0↾ Pe ∧ De ∧ Ud1 ∧ _Pd1 → Dd1↾ Dd0 ∨ Dd1 ∨ _Cde → Ue⇂ ¬De ∧ ¬Ud0 → Dd0⇂ ¬De ∧ ¬Ud1 → Dd1⇂ ¬Dd0 ∧ ¬Dd1 ∧ ¬_Cde → Ue↾ After D lowers the enable, a token will be received on Cd . If the token on Cd arrives before the input requests on U are lowered, then it is possible for the token on D to be duplicated to P . This will produce an incorrect result on P which may be unstable as the input requests on U can be lowered before that transition completes. This means that the rules generating the requests on P must be gated by the output requests on D . Finally, there can be no constraints on the order in which the parallel channels are reset. They could be reset from bottom to top, top to bottom, or in parallel. The control channels Cu and Cd form a pipeline with tokens running from the bottom of the converter to the top. If a typical WCHB pipeline is used, then that will force the parallel channels to reset from bottom to top, causing deadlock. This means that the Cd?; Cu! pipeline must be at least a PCHB reshuffling. Pe ∧ Cue ∧ Cdd0 ∧ Ud0 ∧ _Dd0 → Pd0↾ Pe ∧ Cdd0 ∧ Ud1 ∧ _Dd1 → Pd1↾ Cud0 = Pd0 140 Pd0 ∨ Pd1 → Cde⇂ ¬Cde → _Cde↾ ¬Pe ∧ ¬Cue ∧ ¬Ud0 → Pd0⇂ ¬Pe ∧ ¬Ud1 → Pd1⇂ ¬Cdd0 ∧ ¬Pd0 ∧ ¬Pd1 → Cde↾ Cde → _Cde⇂ Now a single layer of latches can be clocked to serve both U and P . Ue ∧ Pe → clk⇂ ¬Ue ∨ ¬Pe → clk↾ Until a control token is received on Cd and tokens are being forwarded through D , the latches serve to pass data along D . Then, a control token is received and a token is passed onto P . At this point the latches lock with the value passed through P and are not used again until P completes its reset. 5.5 Evaluation 5.5.1 Sign Extension Overall, five different approaches were explored: The QDI PCHB and Integrated approaches presented in this chapter, an Integrated approach with control logic similar to the QDI PCHB, a Bundled-Data, and a QDI WCHB approach. The QDI approaches were measured with both 4-bit and 1-bit datapaths while the others were measured with only 4-bit. Fig. 62 shows the average performance for each unit assuming a maximum input bitwidth of 64 bits. Overall, the Integrated QDI/BD design presented in this chapter is the most desirable solution for a few reasons. It has the lowest energy requirement, a reasonably high throughput per transistor, and it is fairly simple to graft more complex control to it. This will be important for the development an efficient adder and could help with efficient bitwise operators. Table 4 shows the raw per-token measurements for each condition of the sign extension units. In the "ab" condition, the control tokens on the inputs are both internal tokens, and thus both inputs are acknowledged and forwarded. In the "a" and "b" condition, only the token on A or B respectively is an internal token. Therefore, only that input is acknowledged, sign-extending the other. Finally, in the "cap" condition, both inputs are cap tokens. So both are acknowledged, completing the operation. The frequency and latency are fairly consistent between the four conditions, but the energy drops significantly during the extend condition for the Integrated and Bundled-Data solutions because it only has to latch one of the two inputs. 141 Fig. 62: Overview of the sign-extension unit performance. The sign extension unit is used by the addition, subtraction, AND, OR, and XOR operators. Therefore, the average utilization of the four behavioral conditions is determined by the joint bitwidth distribution of the two inputs from those operators as shown in Fig. 63. This is moderately different from the output bitwidth from the addition operator in Fig. 32. The center plot shows the joint probability distribution while each histogram shows the associated individual probability and cumulative distributions for that axis. However, this plot includes some operations that should not be handled by a digit-serial architecture. Specifically, there are significant spikes around 47 and 48 bits as discussed in Chapter 2 Section 4. These spikes represent memory address computations with a 48-bit wide memory bus. These operations have predictable bitwidth and should be handled by their own bit-parallel datapath. max_bitwidth(A) and max_bitwidth(B) represent the maximum bitwidth of the input operands A and B . The probability at each coordinate in Fig. 63 is sampled with P(bitwidth(A) == a and bitwidth(B) == b) , ignoring cases in which A or B are 47 or 48 bits wide. The number of internal tokens in each digit stream are computed from the bitwidth. If the bitwidth is 1 , then that single bit is placed in the cap token int(1+4-2) = 3/4 = 0 . With a bitwidth of 2 , the digit stream has one internal token int(2+4-2)/4 = 4/4 = 1 and one cap token. The average number of cycles per stream is computed for each condition. "ab" is equal to the number of non-cap tokens before sign-extension is required. Then, if one stream is longer than another, this is counted in the "a" or "b" conditions. Finally, every digit stream has exactly one cap token. 142 Type Transistors Condition Token Frequency Energy/Token Latency ab 2.76 GHz 36.31 fJ 92 ps Integrated Serial Adaptive a 2.79 GHz 23.98 fJ 90 ps 181 (Extensible 4-bit) b 2.81 GHz 24.72 fJ 89 ps cap 2.82 GHz 27.95 fJ 86 ps ab 3.04 GHz 50.15 fJ 123 ps Integrated Serial Adaptive a 3.19 GHz 28.92 fJ 116 ps 178 (Standard 4-bit) b 3.17 GHz 30.24 fJ 120 ps cap 3.94 GHz 28.70 fJ 87 ps ab 3.01 GHz 58.52 fJ 117 ps BD Serial Adaptive a 2.84 GHz 35.52 fJ 123 ps 188 (4-bit) b 3.03 GHz 34.66 113 ps cap 3.09 GHz 47.93 111 ps ab 1.38 GHz 125.06 fJ 286 ps QDI Serial Adaptive a 1.36 GHz 120.98 fJ 284 ps 445 (PCHB 4-bit) b 1.36 GHz 121.00 fJ 285 ps cap 1.38 GHz 122.61 fJ 285 ps ab 1.86 GHz 130.88 fJ 69 ps QDI Serial Adaptive a 1.86 GHz 137.16 fJ 61 ps 504 (WCHB 4-bit) b 1.87 GHz 135.92 fJ 63 ps cap 1.85 GHz 135.40 fJ 56 ps ab 1.98 GHz 46.07 fJ 187 ps QDI Serial Adaptive a 1.97 GHz 41.05 fJ 184 ps 169 (PCHB 1-bit) b 1.97 GHz 40.87 fJ 183 ps cap 1.99 GHz 43.21 fJ 184 ps ab 2.44 GHz 53.48 fJ 69 ps QDI Serial Adaptive a 2.44 GHz 54.93 fJ 61 ps 228 (WCHB 1-bit) b 2.46 GHz 54.46 fJ 63 ps cap 2.42 GHz 55.11 fJ 57 ps Table 4. Raw performance measurements for the sign extension units. 143 Fig. 63: Combined probability distribution of the input bitwidths to the addition, subtraction, and bitwise operations. u = {'ab': 0, 'a': 0, 'b': 0, 'cap': 0} for a in range(1, max_bitwidth(A)+1): for b in range(1, max_bitwidth(B)+1): # ignore memory address operations if a in [47,48] or b in [47,48]: tmpa = 44 if a in [47,48] else a tmpb = 44 if b in [47,48] else b p = P(bitwidth(A) == tmpa and bitwidth(B) == tmpb) else: p = P(bitwidth(A) == a and bitwidth(B) == b) # number of internal tokens atok = int((a+packet-2)/packet) btok = int((b+packet-2)/packet) u['ab'] += min(atok,btok)*p u['a'] += max(0, atok-btok)*p u['b'] += max(0, btok-atok)*p u['cap'] += p 144 The computed utilizations for both 1-bit and 4-bit datapaths in Table 5 show that the average digit stream for a 4-bit datapath has around 4.5 tokens with roughly equal time spent in the "ab" and extend conditions "a" and "b". A tends to have longer digit streams due to compiler and human preference." Finally, the average performance of each sign extension unit were computed from these utilizations in Fig. 64 for a given maximum input bitwidth. Overall, the QDI only solutions are not competitive. However, the integrated solutions show the best energy efficiency and competitive throughput. 5.5.2 Compression Six different approaches were explored: the Integrated and QDI compressN and compress1 approaches presented in this chapter, and Bundled-Data compressN and compress1 approaches. Fig. 65 shows the average performance of these approaches for a maximum bitwidth of 64-bits. Once again, the Integrated approaches perform the best, operating with around 29% higher throughput per transistor and using around 14% less energy than the bundled-data approaches. The compressN units have four conditions. During the increment condition, it counts all input tokens that are part of a run, emitting no tokens on the output. During the decrement condition, it emits the stored run, consuming no tokens from the input. During the pass condition, there is no stored run and the input token is not part of any new run, so the input is passed directly to the output. Finally, the clear condition clears any stored run and forwards the input token directly to the output. The raw measurements for these conditions for each design are presented in Table 6. Meanwhile, the compress1 units have three conditions. During the init condition, the first token is consumed and stored for later comparison, emitting no tokens on the output. During the pass condition, tokens on the input are stored and the stored token is passed to the output. Then, the end condition simply emits the stored token to the output. The raw measurements for these conditions for each design are presented in Table 7. The compression units are ultimately difficult to compare. The compressN units have the capability to compress digit streams that are extremely redundant, but incur significant throughput cost to do so. Meanwhile, the compress1 units can only compress a single token at a time, but also affect the throughput significantly less. Three measurements must be taken to resolve this trade-off. First, for the 4-bit compressN units, computing the average per-stream performance metrics requires digit run-length statistics from the output of each operator. This is because every token in each run must execute both the inc and dec conditions, which has a significant effect on both energy and throughput. Fig. 66 shows the average occurrence count per operation for a run of a 145 Condition Average Cycles/Stream 1-bit 4-bit ab 6.602 1.998 a 4.511 1.123 b 1.359 0.358 cap 1.000 1.000 Total Cycles 13.472 4.479 Table 5. Utilization of each condition for the addition, subtraction, AND, OR, and XOR operators. Fig. 64: Overview of the sign-extension unit performance. Fig. 65: Overview of the compression unit performance. given length. For the 1-bit compressN unit, every bit is part of a run and therefore executes both an inc and dec conditions. The compress1 units are not sensitive to digit runs. Second, the bitwidth distribution at the output of the operations is different from the bitwidth distribution at their inputs. Fig. 67 shows this bitwidth after compression. Third, on top of the bitwidth in Fig. 67, the addition, subtraction, multiplication, and bitwise operators introduce redundant bits to the end of the encoding. For example, given the operation 1024-1023=1, the result would be encoded using 10 bits, 8 of which are unnecessary. Fig. 68 shows the probability distribution for an operation to introduce some number of redundant bits. The utilization of the behavioral conditions for each circuit must be determined from the data in 146 Type Transistors Condition Token Frequency Energy/Token Latency inc 2.24 GHz 75.51 fJ 170 ps Integrated Serial Adaptive dec 2.36 GHz 77.27 fJ 170 ps 304+710 CompressN (4-bit) pass 2.30 GHz 56.34 fJ 178 ps clear 2.38 GHz 71.77 fJ 169 ps inc 1.98 GHz 91.46 fJ 215 ps BD Serial Adaptive dec 2.21 GHz 84.57 fJ 211 ps 338+710 CompressN (4-bit) pass 1.94 GHz 61.66 fJ 226 ps clear 1.99 GHz 93.46 fJ 187 ps inc 2.37 GHz 49.66 fJ 168 ps QDI Serial Adaptive 127+710 dec 2.90 GHz 47.10 fJ 40 ps CompressN (1-bit) clear 2.32 GHz 57.10 fJ 73 ps Table 6. Raw performance measurements for the compressN units. Type Transistors Condition Token Frequency Energy/Token Latency init 1.81 GHz 63.24 fJ 167 ps Integrated Serial Adaptive 396 pass 2.46 GHz 64.34 fJ 174 ps Compress1 (4-bit) end 2.25 GHz 51.77 fJ 176 ps init 1.99 GHz 79.86 fJ 221 ps BD Serial Adaptive 356 pass 1.71 GHz 81.39 fJ 209 ps Compress1 (4-bit) end 1.78 GHz 59.33 fJ 209 ps init 2.48 GHz 20.72 fJ 45 ps QDI Serial Adaptive 238 pass 2.24 GHz 32.15 fJ 98 ps Compress1 (1-bit) end 2.65 GHz 29.78 fJ 88 ps Table 7. Raw performance measurements for the compress1 units. Fig. 66, Fig. 67, and Fig. 68. Realistically, doing so correctly would require the joint distribution of all of these measures which is unfortunately unavailable. Therefore, these distributions must be assumed to be independent. Though, there are a few things that can constrain this assumption. First, the sum of the bitwidth and redundant bits must be within the maximum bitwidth. Second, any bit run must also fit within the maximum bitwidth. Third, runs cannot overlap within a single input. For the compressN units, the behavior can be divided into two sections. The first section covers the non-redundant internal tokens while the second covers the redundant internal tokens. 147 Fig. 66: Combined probability distribution of the output run lengths for a given start bit for the addition, subtraction, multiplication, and bitwise operations. Fig. 67: Average bitwidth distribution of the output for the addition, subtraction, multiplication, and bitwise operations. For the first section, the run-length distribution gives information about the "inc", "dec", and "pass" conditions. Specifically, only the last run in the digit matters. This means that the run's start bit is inside the digit and its length at least covers the remaining bits in the digit length >= excess . This prevents that digit from being double counted in the pass condition. First, this digit is passed u['pass'] += p . Then, any further digit that is completely contained within the run int((length-excess)/packet) is incremented u['inc'] += ltok*p and then decremented 148 Fig. 68: Average number of redundant bits introduced into the encoding by the addition, subtraction, multiplication, and bitwise operations. u['dec'] += ltok*p . For the second section, the output and redundant distributions give information about the remaining "inc" condition cycles. The redundant bits that do not already slot into the last digit in the stream int((red-excess-1)/packet) are incremented in the counter u['inc'] += rtok*p and then cleared u['clear'] = 1.0 . 149 u = {'inc': 0, 'dec': 0, 'pass': 0, 'clear': 0} for start in range(0, max_bitwidth(L)): for length in range(1, max_bitwidth(L)-start+1): p = P((start,length) in runs(L)) offset = start%packet excess = (packet - offset)%packet if length >= excess: ltok = int((length-excess)/packet) u['inc'] += ltok*p u['dec'] += ltok*p if off > 0: u['pass'] += p for width in range(1, max_bitwidth(L)+1): for red in range(0, max_bitwidth(L)-width): p = P(bitwidth(L) == width and redundant(L) == red) offset = (width-1)%packet excess = (packet - offset)%packet if red >= excess: rtok = int((red-excess-1)/packet) u['inc'] += rtok*p u['clear'] = 1.0 This computes the utilizations in Table 8. On average, the compressN unit will compress 4.733 bits or 0.835 digits. These are slightly off from each other because of digit-boundaries. Keep in mind that this assumes a compression unit is placed after every operation. Computing the compress1 utilizations is somewhat easier. All of the internal tokens in the digit stream are forwarded in the "pass" condition except the last redundant digit u['pass'] += (rtok-skip)*p . The "init" condition is only executed if there is an internal token to load rtok > 0 . Then, every operation executes a single "end" condition u['end'] += p , emitting the cap token. 150 Condition Average Cycles/Stream 1-bit 4-bit inc 23.021 1.714 dec 18.289 0.879 pass 0.000 3.253 clear 1.000 1.000 Total Cycles 42.310 6.846 Table 8. Utilization of each compressN condition for the addition, subtraction, multiplication, and bitwise operators. u = {'init': 0, 'pass': 0, 'end': 0} for width in range(1, max_bitwidth(L)+1): for r in range(0, max_bitwidth(L)-o): p = P(bitwidth(L) == width and redundant(L) == red) rtok = int((width+red+packet-2)/packet) skip = 0 offset = width%packet excess = (packet-offset)%packet if red > excess+packet: skip = 1 if rtok > 0: u['init'] += p u['pass'] += (rtok-skip)*p u['end'] += p This computes the utilizations in Table 9. While the compress1 approach significantly reduces the total cycle count for a 1-bit datapath, it is actually worse for the 4-bit datapaths. This is because the "pass" condition in the compressN units covers the vast majority of tokens leaving the "inc" condition mostly to the redundant tokens. Finally, the average performance for each approach is computed using these utilizations in Fig. 69. The Integrated QDI/BD designs successfully outperform the QDI and Bundled-Data designs by a significant margin. Ultimately, the compressN approach outperforms the compress1 approach for wider datapaths since the "pass" condition becomes exponentially more likely to cover the majority of the internal tokens. 151 Condition Average Cycles/Stream 1-bit 4-bit init 0.986 0.986 pass 20.419 5.361 end 1.000 1.000 Total Cycles 22.405 7.347 Table 9. Utilization of each compress1 condition for the addition, subtraction, multiplication, and bitwise operators. Fig. 69: Overview of the compress unit performance. 5.5.3 Serial to Parallel The serial-to-parallel units ultimately do not incur as high a cost. They are not used particularly often, and they do not need to be particularly fast. Though low latency would certainly be a beneficial property. Ultimately, the two approaches offer similar performance, routing tokens at around 1.8 GHz. With 153 transistors per stage, the downflow approach requires 30% fewer transistors because it only needs a single layer of latches. This brings the upflow approach to 219 transistors per stage. However, the upflow approach uses 20% less energy on average because shorter streams do not need to be routed across the whole pipeline. Therefore, as the length of the pipeline increases, the energy required by the upflow approach levels off to the average stream length while the downflow approach continues with linearly increasing energy requirements. Fig. 70 shows the frequency level off to its steady state as the pipeline length increases. Meanwhile the energy increases with pipeline length because the tokens have to be routed further along the pipeline. 152 Fig. 70: Throughput and energy per serial token of the upflow and downflow serial to parallel units as pipeline length increases. 153 CHAPTER 6 BITWISE OPERATIONS Bitwise operators generally fall into two categories. In the first category, operators have no cross-bit dependencies. Each bit in the result is dependent only upon the bits in the inputs at the same bit location. This category covers and, or, invert, xor, etc. The second category involves moving bits around without changing their value, covering shift and rotate. Unfortunately, these operators are quite a bit more complex in the context of digit-serial data. This work will only examine shifting. Rotation involves saving multiple tokens from the beginning of a digit-stream to move them to the end. Direct implementation requires an unbounded amount of memory making it a bad fit for a digit-serial ALU. Furthermore, in the context of adaptive digit-serial arithmetic, rotation is no longer meaningful because it requires a rigid bitwidth to rotate around. Non-adaptive digit-serial shifting is well studied, leading to a particularly simple implementation found in [144]. Unfortunately, this simplicity is ultimately derived from the non-adaptive nature of the architecture, shifting the bitwidth boundary location and overwriting digits of earlier or later digit-streams instead of shifting the value itself. An alternative technique in [141] simply uses a multi-port shift register. However, the complexity of this circuit grows quickly with the maximum amount of shifting possible. The implementation that most closely matches our requirements is the adaptive bit-serial shifter found in BitSNAP [232][233]. It shifts the value by adding zeros to or deleting bits from the front of the stream, tracking the shift amount with a counter. This work will build on that premise to support digit-serial shifting. 6.1 AND, OR, XOR A lack of cross-bit dependencies means that bitwise operators can be grafted directly onto the control circuitry of other operators with no extra overhead. The most basic example of this would be grafting them into the datapath of the sign extension unit in Chapter 5 Section 1. 6.2 Shift Left There are three primary modes for the shift left operator: the digit-shift, the bit-shift, and the cap-token. The first mode, as implemented by the first and second conditions, handles the digit-shift by adding zeros to the front of the stream. The digit-shift amount is recorded by a counter which is decremented for each new zero added. 154 When the counter flags zero, the shifter switches to the second mode which handles the bit-shift within each digit as implemented by the third and fourth conditions. Assuming a four-bit digit, two bits of the shift amount are loaded directly into the shifter control as and offset, off , when the counter is written. off is used to control a barrel shifter driving M . The three most significant bits from previous digit, stored in X , are shifted in from the right, displacing the bits from the current digit #Ad . Each cycle the result of the shift is sent through the output channel S , the three most significant bits of the current digit are stored in X for the next cycle, and then the input digit is acknowledged. When the cap token is received on A and the remaining results from the shift are successfully covered by the repeated value of the cap token, the shifter enters the third mode implemented by the last condition. Specifically, this does not require the counter to be zero if the cap token value is also zero since shifting zero left will always yield zero. In order to complete the output stream by sending a cap token, the shifter must wait until the bit-shift has resolved. If the bit-shift shifts 1-valued bits into a zero cap or 0-valued bits into a one-cap, then the protocol will be broken. Therefore, the bit shift is checked and resolved by sending a non-cap token if necessary. Once this is done, the cap token is sent, the previous digit X is reset, and a write command is sent to the counter to load the next shift amount. ∗[[ cnt=0 → M:=({Ad,X} << off)[3:7] ▯ cnt≠0 → M:=0 ]; [ cnt≠0 and Ac=0 → S!(0, M); cnt:=cnt-1 ▯ cnt≠0 and Ac=1 and Ad≠M → S!(0, M); cnt:=cnt-1 ▯ cnt=0 and Ac=0 → S!(0, M); X:=Ad; A? ▯ cnt=0 and Ac=1 and Ad≠M → S!(0, M); X:=Ad ▯ Ac=1 and Ad=M → S!(1, M); X:=0; A?; B?(off,cnt) ] ] Unfortunately, implementing this logic requires a relatively complex datapath. Fig. 71 shows a high-level block diagram of the shift left operator. At the top, a dwzn counter manages cnt along with loading the new cnt from B . The two least significant bits loaded from B are forwarded out of the counter and used to control the barrel shifter as it shifts bits from X into M . The mux from M to Sd sets the resulting data to all zeros when handling the first condition in the behavioral specification, and passes the value on M otherwise. On the other side of the barrel shifter, X is stored by a special flip-flop with two clocking signals as seen in Fig. 72. Sc0 loads the next value from Ax when vzx is high. Sc1 resets X to 0 , implementing the last condition in the behavioral specification. Finally, the data from Ad is latched using the input enable Ae . 155 Fig. 71: Block diagram of the shift left operator. Fig. 72: Block diagram of the flop driving X . Most of the control signals ( Sc0 , Sc1 , and Cw ) are ultimately driven by the QDI handshake. Before implementing this handshake, the bundled datapath requires a stable view of the counter status throughout the cycle. During the handshake, decrements and writes store the counter status into a latch that drives vz and vn signifying whether or not the counter is zero. However, this latch changes half-way through the handshake. Therefore, another latch must be used to store that value for the other half of the cycle. This latch is allowed to switch when the output requests are lowered, signalling the completion of the handshake. ¬vzx ∨ ¬vz ∧ ¬Sc1 → vnx↾ ¬vnx ∨ ¬vn ∧ ¬Sc0 → vzx↾ vzx ∧ (vz ∨ Sc1) → vnx⇂ 156 vnx ∧ (vn ∨ Sc0) → vzx⇂ The output mux and the barrel shifter are combined, and the control Ox and vzx are combined into a one-hot encoding for the five possible conditions. ctrl0 maps Sd[0\:4] to Axd[0\:4] implementing a shift amount of 0. ctrl1 maps Sd[0\:4] to {Axd[0\:3],X2} implementing a shift amount of 1. ctrl2 maps Sd[0\:4] to {Axd[0\:2],X[1\:3]} implementing a shift amount of 2. ctrl3 maps Sd[0\:4] to {Axd[0],X[0\:3]} implementing a shift amount of 3. Finally, vnx maps Sd[0\:4] to 0 implementing the output mux. The logic driving ctrl0 is swapped relative to the others in order to pull the signals from Ox with the lowest possible gate delay. Ox0 ∨ Ox1 ∨ vnx → ctrl01⇂ ¬Ox0 ∧ ¬Ox1 ∧ ¬vnx → ctrl01↾ ctrl01 → ctrl00⇂ ¬ctrl01 → ctrl00↾ Ox0 ∧ ¬Ox1 ∧ vzx → ctrl10⇂ ¬Ox0 ∨ Ox1 ∨ ¬vzx → ctrl10↾ ctrl10 → ctrl11⇂ ¬ctrl10 → ctrl11↾ ¬Ox0 ∧ Ox1 ∧ vzx → ctrl20⇂ Ox0 ∨ ¬Ox1 ∨ ¬vzx → ctrl20↾ ctrl20 → ctrl21⇂ ¬ctrl20 → ctrl21↾ Ox0 ∧ Ox1 ∧ vzx → ctrl30⇂ ¬Ox0 ∨ ¬Ox1 ∨ ¬vzx → ctrl30↾ ctrl30 → ctrl31⇂ ¬ctrl30 → ctrl31↾ This shifting logic is implemented with pass transistor logic using only a single layer of pass transistors. Four of the five conditions are controlled by the ctrl signals. The final condition is the output mux controlled by vnx . With the inputs taken from the inverted sense of Ax , and the output of the shifting logic protected by inverters, the total gate delay of the datapath from Ad to Sd is two. These five conditions happen to map well to the cases in which bits from the previous digit are stored in X . This also allows for clock-gating of the unused flops in X depending upon the amount of the bit-shift and the value of the counter. @Rclk ∧ ¬ctrl30 → Xclk0↾ ¬@Rclk ∧ ctrl31 ∨ ctrl30 → Xclk0⇂ 157 @Rclk ∧ (¬ctrl30 ∨ ¬ctrl20) → Xclk1↾ ¬@Rclk ∧ (ctrl31 ∨ ctrl21) ∨ ctrl30 ∧ ctrl20 → Xclk1⇂ @Rclk ∧ (¬ctrl30 ∨ ¬ctrl20 ∨ ¬ctrl10) → Xclk2↾ ¬@Rclk ∧ (ctrl31 ∨ ctrl21 ∨ ctrl11) ∨ ctrl30 ∧ ctrl20 ∧ ctrl10 → Xclk2⇂ Finally, because the delay of the datapath is so low, the comparison of Ax to Sd can be implemented directly on the output. All of the bits in Ax will be the same when A is a cap token, so the comparison must only check one. Furthermore, only the first three bits of the output could differ from the cap token value due to the shift. Therefore, the comparison can be implemented using a single gate. ¬Axd3 ∧ ¬Sd2 ∧ ¬Sd1 ∧ ¬Sd0 ∨ Axd3 ∧ Sd2 ∧ Sd1 ∧ Sd0 → D1⇂ Axd3 ∧ (¬Sd2 ∨ ¬Sd1 ∨ ¬Sd0) ∨ ¬Axd3 ∧ (Sd2 ∨ Sd1 ∨ Sd0) → D1↾ D1 → D0⇂ ¬D1 → D0↾ Because this is an Integrated QDI/BD circuit, it would be possible to pull the input request data Ac into the bundled datapath using a latch before the delay lines. This would facilitate for grouping the first four conditions into a single forward driver for Sc0 . Unfortunately, the fourth condition handling sign extension does not acknowledge the input request and so skips the delay lines. Therefore, a delay line must be placed on the reset of Sc0 for only the fourth condition to allow the datapath and the comparison D to resolve. If all four conditions are combined into a single forward driver, separating the fourth condition out for delay would become quite expensive. Furthermore, this approach requires an extra latch anyways which is equivalent in cost to an extra C-element for a third forward driver. So instead, the first and third conditions from the CHP can be combined into a single forward driver R0 , and the second and fourth conditions may be combined into R1 . Finally, the last condition is directly implemented by R2 . This removes the need for the latch on Ac and allows for a delay line on the reset phase of R1 to cover the fourth condition. Cz ∧ Cn ∧ Se ∧ Ac0 → R0↾ Cz ∧ Cn ∧ Se ∧ Ac1 ∧ D1 → R1↾ We ∧ Se ∧ Ac1 ∧ D0 → R2↾ R2 ∨ vzx ∧ R0 ∨ Cz ∧ Cn ∧ Cw → Ae⇂ The pass transistor intermediate forward drivers micro-optimization can be used to drive Sc0 using R0 and R1 . After that, the pass transistor gated forward driver can be used to generate the counter decrement command. More specifically, the counter may be decremented by R0 or R1 , 158 but only when it is not empty. @R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Sc0↾ ¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Sc0⇂ Sc1 = R2 @Sc0 ∧ ¬vzx → Cd↾ ¬@Sc0 ∧ vzn → Cd⇂ vzx → Cd⇂ When the counter decrement command is acknowledged by the counter, that acknowledgement is stored into a latch for use next cycle. ¬vz ∨ ¬Cn → vn↾ ¬vn ∨ ¬Cz → vz↾ vz ∧ Cn → vn⇂ vn ∧ Cz → vz⇂ During the reset phase of the forward drivers, all three have sent an output request and therefore must wait for the output acknowledge on Se . R0 and R1 decrement the counter when it is not zero, which means that either vz and vn have not changed when the counter acknowledged with Cn or vz is set by an acknowledgement on Cz . Finally, R0 acknowledges the input and must wait for the input request on Ac0 to reset. Thankfully, this condition is mutually exclusive from the counter decrement. Meanwhile, R1 just skips in this case. Since vzx and vnx provide a stable view of the counter, they can be used directly to check for this skip condition. ¬Se ∧ (¬vn ∧ ¬Cz ∨ ¬Cn ∨ ¬Ac0) → R0⇂ ¬Se ∧ (¬vn ∧ ¬Cz ∨ ¬Cn ∨ ¬vnx) → R1⇂ ¬Se ∧ ¬We ∧ ¬Ac1 → R2⇂ ¬R2 ∧ (¬vzx ∨ ¬R0) ∧ (¬Cw ∨ ¬Cz ∨ ¬Cn) → Ae↾ The behavior implemented by R2 is rather special. Of note is the interaction between the counter write and the output of the cap token with Sc1 . Specifically, Sc1 must be allowed to reset before Cz or Cn acknowledge the write request. If not, then it introduces two undesirable behaviors. First, it would emit a token on the output during reset before any input had been received. Second, each shift operation would have to wait for the next one to be ready before completing. This means that the final shift operation would never complete. Therefore, an extra half-buffer must be inserted between the shifter and the counter during the write command with Sc1 as the input request, Cw as the output request, and We as the input enable. The output acknowledge is stored in vz and vn with the standard completion detection. 159 Cz ∧ Cn ∧ Sc1 → Cw↾ Cw → We⇂ ¬Sc1 ∧ (¬Cz ∧ ¬vn ∨ ¬Cn ∧ ¬vz) → Cw⇂ ¬Cw → We↾ Take note that R0 and R1 do not acknowledge We because the input enable Ae is held low by Cw until the write command has been acknowledged. This is necessary because the pull-down stack for Sc0 is already five transistors long without We . This design ultimately requires four delay lines. Two delay lines are placed on the input requests, handling the majority of the cases. A delay line must be placed on the reset of Cz . This allows time for the datapath to resolve during the transition from the digit-shift mode to the bit-shift mode. Finally, as previously discussed, a delay line is placed on the reset of R1 . This allows the datapath to resolve due to updates on X . This means that a separate clocking signal Rclk must be generated before the delay line on R1 and Sc0 must be generated after the delay line on R1 @R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Rclk↾ ¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Rclk⇂ 6.3 Shift Right The right shift is very similar to the left shift in many aspects of its behavior. The second, third, and fourth conditions of the right shift operator are nearly identical to the third, fourth, and fifth conditions of the left shift operator. However, the first condition consumes input tokens instead of sourcing zeros on the output. Once again, there are three primary modes for this shifter: the digit-shift, the bit-shift, and the cap-token. The first mode is implemented by the first condition and handles the digit-shift by deleting tokens from the front of the input stream. The second mode is implemented by the second and third conditions and handles the bit-shift. The final mode is handled by the fourth condition, sending the cap token and loading a new value into the counter. Unfortunately, the right shifter has some extra considerations. First, the values on A are digit-serial with the least significant digit first. Suppose this value is being shifted right by two bits. Assuming that each digit is four bits, then the first digit sent on S will require two bits from the first digit of A and two bits from the second. Therefore, if the shift is not aligned to the size of the digit, then the right shift will always need to read one extra digit from A . This is why the counter is incremented during the load. Second, when developing a full ALU, it would be beneficial for these operators to share 160 devices when possible. Requiring a bi-directional barrel-shifter to implement the combined left and right shift datapath is undesirable. Instead, the right shift in the datapath M:=({#Ad,X} >> off)[0\:4] can be replaced by a left shift M:=({#Ad,X} << off)[3\:7] by negating the offset off . ∗[[ cnt=0 → M:=({Ad,X} << off)[3:7] ▯ cnt≠0 → M:=Ad ]; [ cnt≠0 and Ac=0 → X:=Ad; A?; cnt:=cnt-1 ▯ cnt=0 and Ac=0 → S!(0, M); X:=Ad; A? ▯ Ac=1 and Ad≠M → S!(0, M); X:=Ad ▯ Ac=1 and Ad=M → S!(1, M); X:=0; A?; B?(off,cnt) [ off≠0 → off:=4-off, cnt:=cnt+1 ▯ off=0 → skip ] ] ] This allows for the datapath of the right shift operator to be nearly identical to the left shift. The primary difference between the two comes from the handling of the cap token. If the counter is empty, then the cap token is handled similarly to the left shift. However, if the counter is not, then there is no need to extend the digit stream in the same way that the left shift operator does. Instead, the cap token is simply forwarded and the counter is ignored. This means that when the counter is not zero, the shift amount needs to be zeroed so that the cap token data can be correctly forwarded to S . Instead of directly incrementing the counter when the value is loaded, the QDI control will have an extra state in its internal memory. Therefore, vz is replaced by v0 , vn is replaced by v2 , and a new state v1 is introduced to handle the increment case. This is significantly cheaper than just incrementing the counter on the write for two reasons. First, it reduces the number of decrement operations required from the counter. Second, implementing a write-increment command also requires a functional write command for when the carry on the increment is zero. This doubles the overhead of the write command which is already expensive to implement. Much like the left shift operator, the datapath for the right shift needs a stable view of this internal memory. Therefore, a 3-valued latch drives vx which is allowed to change once the forward drivers have been reset. However, the internal memory in the QDI control will be handled with a positive latch. This means that its valid state has one signal high and two low. Chaining a positive latch on top of that for vx would introduce some complexity into the set logic. Instead, a negative latch is used to generate the inverted value _vx in which a valid state has one signal low and two high, simplifying the set logic. Finally, the datapath logic also needs the non-inverted signals vx2 and vx0 . ¬_vx0 ∨ ¬_vx1 ∨ ¬v2 ∧ ¬R0 ∧ ¬R1 ∧ ¬R2 → _vx2↾ 161 Fig. 73: Block diagram of the shift right operator. ¬_vx0 ∨ ¬_vx2 ∨ ¬v1 ∧ ¬R0 ∧ ¬R1 ∧ ¬R2 → _vx1↾ ¬_vx1 ∨ ¬_vx2 ∨ ¬v0 ∧ ¬R2 → _vx0↾ _vx0 ∧ _vx1 ∧ (v2 ∨ R0 ∨ R1 ∨ R2) → _vx2⇂ _vx0 ∧ _vx2 ∧ (v1 ∨ R0 ∨ R1 ∨ R2) → _vx1⇂ _vx1 ∧ _vx2 ∧ (v0 ∨ R2) → _vx0⇂ ¬_vx0 → vx0↾ _vx0 → vx0⇂ ¬_vx2 → vx2↾ _vx2 → vx2⇂ The logic surrounding the offset O[0\:2] is a little complex. The control signal inc determines when the counter is incremented during the write. Unfortunately, the QDI control needs access to this signal before Cw is lowered in order to correctly set the value of its internal memory v . Furthermore, the QDI control also needs inc to be stable throughout the rest of the operation for subsequent control of v while decrementing. Therefore, inc must be sampled between the p and n latches that make up the flop for Ox . Per the CHP, when the offset is not zero, the counter should be incremented. Op00 ∧ Op10 → inc1⇂ ¬Op00 ∨ ¬Op10 → inc1↾ inc1 → inc0⇂ ¬inc1 → inc0↾ Building off this idea, putting the negation unit between the two latches allows the delay of the 162 negation unit to be hidden in the half-cycle between when Cw is raised and when it is lowered. Because only a two-bit value is being negated and the carry is not necessary, the negation unit is very simply an XOR gate, which can be implemented with pass-transistor logic. nOp0 = Op0; @Op11 ∧ ¬Op00 ∨ @Op10 ∧ ¬Op01 → nOp10↾ ¬@Op11 ∧ Op01 ∨ ¬@Op10 ∧ Op00 → nOp10⇂ The control signals for the right shift are generated much like the left shift. Once again, the multiplexer can be merged into the control for the shifter. When the counter is not zero, ctrl0 is set to keep the shift value equal to zero. vx0 ∧ (Ox0 ∨ Ox1) → ctrl01⇂ ¬vx0 ∨ ¬Ox0 ∧ ¬Ox1 → ctrl01↾ ctrl01 → ctrl00⇂ ¬ctrl01 → ctrl00↾ Ox0 ∧ ¬Ox1 ∧ vx0 → ctrl10⇂ ¬Ox0 ∨ Ox1 ∨ ¬vx0 → ctrl10↾ ctrl10 → ctrl11⇂ ¬ctrl10 → ctrl11↾ ¬Ox0 ∧ Ox1 ∧ vx0 → ctrl20⇂ Ox0 ∨ ¬Ox1 ∨ ¬vx0 → ctrl20↾ ctrl20 → ctrl21⇂ ¬ctrl20 → ctrl21↾ Ox0 ∧ Ox1 ∧ vx0 → ctrl30⇂ ¬Ox0 ∨ ¬Ox1 ∨ ¬vx0 → ctrl30↾ ctrl30 → ctrl31⇂ ¬ctrl30 → ctrl31↾ The shifting logic for the datapath of the right-shift operator is implemented similarly to the left shift. There is only a single layer of pass transistors with all conditions controlled by the ctrl signals. Once again, the inputs are taken from the inverted sense of Ax , and the output of the shifting logic is protected by a layer of inverters. Therefore, the total gate delay of the datapath from Ad to Sd is two. The datapath for the left-shift unit was able to clock-gate the flops driving X when the counter was not zero. This saved a lot of energy along the way. When the counter reached zero, the previous digit was guaranteed to be zero. Because X was reset to zero during the write, no extra work was necessary to store the digit just prior to the counter being empty. 163 The right-shift unit is a bit more difficult. While X is no longer reset during the write, the last input digit consumed right before the counter reaches zero must be recorded. This means that X can only be clock-gated while the value of the counter is greater than one. Most cases increment the counter during the write, and that increment uses the internal memory in the QDI control v1 . Therefore, it is possible to clock-gate X for those cases. The one case that does not increment the counter is when the shift amount is aligned to the digit size. In this case, the offset is zero, so the flop driving X is left unused and can therefore be turned off. This means that it is still possible to fully clock-gate X to save energy. Unfortunately, this new condition with the counter being greater than one means that the clock-gating for X no longer maps as nicely to the ctrl signals. Therefore, gate signals must be generated separate of the ctrl signals to handle this clock-gating. (Ox0 ∨ Ox1) ∧ _vx2 → gate20⇂ ¬Ox0 ∧ ¬Ox1 ∨ ¬_vx2 → gate20↾ gate20 → gate21⇂ ¬gate20 → gate21↾ Ox1 ∧ _vx2 → gate10⇂ ¬Ox1 ∨ ¬_vx2 → gate10↾ gate10 → gate11⇂ ¬gate10 → gate11↾ Ox0 ∧ Ox1 ∧ _vx2 → gate00⇂ ¬Ox0 ∨ ¬Ox1 ∨ ¬_vx2 → gate00↾ gate00 → gate01⇂ ¬gate00 → gate01↾ @S0 ∧ ¬gate00 → Xclk0↾ ¬@S0 ∧ gate01 ∨ gate00 → Xclk0⇂ @S0 ∧ ¬gate10 → Xclk1↾ ¬@S0 ∧ gate11 ∨ gate10 → Xclk1⇂ @S0 ∧ ¬gate20 → Xclk2↾ ¬@S0 ∧ gate21 ∨ gate20 → Xclk2⇂ Once again, because the delay of the datapath is so low, the comparison of Ax to Sd can be implemented directly on the output. This comparison is identical to the one in the left shift operator. ¬Axd3 ∧ ¬Sd2 ∧ ¬Sd1 ∧ ¬Sd0 ∨ Axd3 ∧ Sd2 ∧ Sd1 ∧ Sd0 → D1⇂ 164 Axd3 ∧ (¬Sd2 ∨ ¬Sd1 ∨ ¬Sd0) ∨ ¬Axd3 ∧ (Sd2 ∨ Sd1 ∨ Sd0) → D1↾ D1 → D0⇂ ¬D1 → D0↾ To simplify the control a bit and reduce the length of transistor stacks, the inc control signal will be used to gate Cz into two separate signals Cz1 and Cz0 . Doing this requires special treatment for pass transistor logic in the QDI delay model. Therefore, acknowledging Cz should also acknowledge both Cz1 and Cz0 . @Cz ∧ ¬inc0 ∨ ¬inc1 → Cz1↾ ¬@Cz ∧ inc1 → Cz1⇂ @Cz ∧ ¬inc1 ∨ ¬inc0 → Cz0↾ ¬@Cz ∧ inc0 → Cz0⇂ With the datapath complete, it is now possible to implement a clean QDI control. Once again, it would be possible to pull the input request data Ac into the bundled datapath using a latch before the delay lines. However, like the left shift operator, this does not actually reduce the complexity of the circuit overall. This gives three forward drivers R0 , R1 and R2 that are identical to the three forward drivers of the left shift. Furthermore, the gate driving Ae has only one minor difference. While the left shift only acknowledges the input for R0 when the counter is empty, the right shift always acknowledges the input for R0 . Cz ∧ Cn ∧ Se ∧ Ac0 → R0↾ Cz ∧ Cn ∧ Se ∧ Ac1 ∧ D1 → R1↾ We ∧ Se ∧ Ac1 ∧ D0 → R2↾ R2 ∨ R0 ∨ Cn ∧ Cz ∧ Cw → Ae⇂ Once again, the pass transistor intermediate forward drivers micro-optimization is used to drive Sc0 using R0 and R1 . However for the right shift, it must also be gated by vx0 . This ensures that the digits deleted from the front of the input stream on A are not then forwarded on S . While the left shift unit gated Sc0 to generate Cd , the right shift unit does not decrement the counter when R1 is raised. This means that code Cd can be gated directly from R0 instead. @R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → S0↾ ¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → S0⇂ @S0 ∧ ¬_vx0 → Sc0↾ ¬@S0 ∧ vx0 ∨ _vx0 → Sc0⇂ Sc1 = R2 165 @R0 ∧ ¬_vx2 → Cd↾ ¬@R0 ∧ vx2 ∨ _vx2 → Cd⇂ The extra increment is handled by the internal memory. Now, when receiving the counter status, Cn sets v2 , and Cz conditionally sets either v0 or v1 depending upon the increment. Finally, the last decrement is handled directly by R0 , switching the internal memory from v1 to v0 . ¬v1 ∧ ¬v2 ∨ ¬Cz0 ∨ ¬_vx1 ∧ ¬_R0 ∧ ¬Ae → v0↾ ¬v2 ∧ ¬v0 ∨ ¬Cz1 → v1↾ ¬v1 ∧ ¬v0 ∨ ¬Cn → v2↾ (v1 ∨ v2) ∧ Cz0 ∧ (_vx1 ∨ _R0 ∨ Ae) → v0⇂ (v2 ∨ v0) ∧ Cz1 → v1⇂ (v1 ∨ v0) ∧ Cn → v2⇂ The reset phase of the forward drivers is now fairly different from the left shift. For R0 , instead of always sending a digit on S , it now always consumes the input digit from A . Furthermore, R0 also handles the last decrement from v1 to v0 which must be acknowledged in the reset phase. For R1 , the counter never decrements. More specifically, the counter is always guaranteed to be zero when R1 occurs. If the counter is not zero, then D is guaranteed to be false. Therefore, R1 must only acknowledge the output enable Se . R2 and the extra half-buffer on Cw remains identical between the left and right shift. ¬Ac0 ∧ (¬v2 ∧ ¬Cz ∨ ¬Cn ∨ ¬Se ∨ ¬_vx1 ∧ ¬v1) → R0⇂ ¬Se → R1⇂ ¬Se ∧ ¬We ∧ ¬Ac1 → R2⇂ ¬R2 ∧ ¬R0 ∧ (¬Cw ∨ ¬Cz ∨ ¬Cn) → Ae↾ However, in the reset phase of the rules driving Cw , the extra increment condition must be correctly acknowledged. Cz ∧ Cn ∧ R2 → Cw↾ Cw → We⇂ ¬R2 ∧ (¬Cn ∧ ¬v0 ∧ ¬v1 ∨ ¬v2 ∧ (¬Cz0 ∧ ¬v1 ∨ ¬Cz1 ∧ ¬v0)) → Cw⇂ ¬Cw → We↾ Once again, R0 and R1 do not acknowledge We because the input enable Ae is held low by Cw until the write command has been acknowledged. The right shift unit requires three delay lines. Like the left shift unit, two delay lines are placed in the input requests, handling the majority of the cases and one delay line is placed on the reset of 166 R1 allowing updates on X to resolve. However unlike the left shift unit, the delay line on the input requests successfully covers the transition from the digit-shift to the bit-shift modes. Therefore, the right-shift unit does not require the fourth delay line on Cz . However, like the left shift unit, this means that a separate clocking signal Rclk must be used for X . @R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Rclk↾ ¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Rclk⇂ Finally, Ac0 does not always need to be delayed. While the input digits are being consumed by the shift, the data of those tokens does not matter because it is not forwarded through the datapath and into S this means that while vx0 is false meaning the counter is not empty, Ac0 need not be delayed. Therefore, pass transistors can be used to conditionally enable the delay line on Ac0 . 6.4 Evaluation 6.4.1 Bitwise Operators Fig. 74 shows the bitwidth distribution for the input operands of the bitwise operators. There is a strong distribution along the diagonal, signifying that the operands in most bitwise operations are the same width. Furthermore, a significant amount of operations seem to be only one bit. This is likely used to resolve compound conditions for branches. Finally, the vast majority of operations are 32-bits wide. Because the bitwise operators have been grafted to the sign extension unit, the behavioral conditions and utilization computation remain unchanged from Chapter 5. This computes the utilization data found in Table 10. Ultimately, there are not many ways to implement bit-parallel bitwise operators. Furthermore, all of the bitwise operators (AND, OR, XOR) have extremely similar performance metrics. Therefore, only the Integrated Adaptive Digit-Serial XOR operator and the clocked and QDI 64-bit bit-parallel versions as seen in Table 11 will be compared. While the clocked operator can run at 10 GHz in a vacuum, it will need to run slower in the context of a larger architecture. In most architectures, the maximum frequency is around 4 GHz. This analysis will be conservative and compare the digit-serial operator against the bit-parallel operator at 10 GHz. In Table 12, there are four conditions. During the "ab" condition, neither input operand is a cap token, so both are consumed to produce the output. During the "a" and "b" conditions, only one of the two input operands have cap tokens, so only one input is consumed. Finally, during the "cap" condition, both input operands are cap tokens. Both are consumed to complete the operation. As stated earlier, the performance of the three bitwise operators are nearly identical, operating around 2.6 GHz at around 30 fJ per token. 167 Fig. 74: Probability distribution for the bitwidth of the left operand A and right operand B for AND, OR, and XOR. Condition Average Cycles/Stream (4-bits) ab 2.439 a 0.868 b 0.627 cap 1.000 Total Cycles 4.933 Table 10. Utilization of each condition for the AND, OR, and XOR operators. Type Transistors Frequency Energy/Op Latency Clocked Parallel XOR (64-bit) 3712 10.00 GHz 0.508 pJ QDI Parallel XOR (PCHB 64-bit) 4096 3.93 GHz 1.780 pJ Table 11. Performance measurements for the bit-parallel bitwise operators. In Fig. 75, the distribution in Fig. 74 is applied to the raw measurements in Table 12 and compared against Table 11. Overall, as the max width of the operator is larger, the adaptive serial operator does less and less work. At 64 bits, the adaptive serial operator is competitive in 168 Type Transistors Condition Token Frequency Energy/Token Latency ab 2.62 GHz 41.98 fJ 94 ps Integrated Serial Adaptive a 2.64 GHz 28.92 fJ 92 ps 221 XOR b 2.66 GHz 29.50 fJ 91 ps cap 2.68 GHz 30.42 fJ 88 ps Table 12. Raw performance measurements for the digit-serial bitwise operators. throughput per transistor with the clocked bit-parallel architectures, but uses less than half the energy for the same computation. Unfortunately, this approach introduces some redundant tokens into the encoding of the result. If two numbers are ANDed and one is shorter than the other, then the cap token is zero and the number will be sign extended. The bitwidth of the output would match the bitwidth of the larger of the two inputs, but should really only match the shorter. Fig. 76 shows the distribution of the number of redundant bits introduced into the output encoding. Overall, about 50% of operations do not introduce any redundant bits. However, a non-negligible number of operations introduce a significant number of redundant bits. Tackling this will be important for future work. 6.4.2 Shift Operators Overall, Fig. 77 shows that the adaptive digit-serial shift operators developed here use half the energy for the same operations while remaining competitive with the circuits synthesized by Synopsys. The integrated QDI/BD shift operators are compared against three implementations for synchronous bit-parallel shifters: the standard circuit produced by Synopsys Design Compiler with a base-2 shift amount, a full custom pass transistor shifter with a base-2 shift amount, and a full custom pass transistor shifter with a base-4 shift amount. These circuits were each evaluated for a range of bitwidths. The performance of the full 64-bit shifters of these implementations are shown in Table 13. Table 14 shows the raw per-token measurements for each condition of the length-adaptive digit-serial shifters. In the "digit-shift" condition, the counter is not empty and the shifter is either producing zeros on the output in the case of the left shift or consuming the input digits in the case of the right shift. In the "bit-shift" condition, the counter is empty and the shifter is forwarding shifted input digits to the output. In the "extend" condition, the shifter has bits remaining in its token storage that are shifted into the data from the cap token. This requires the stream to be sign-extended. Finally, in the "cap" condition, the digit-stream is completed with a cap token and the next value is loaded into the counter in preparation for the next operation. 169 Fig. 75: Throughput/transistor (left) and energy/op (right) metrics scaled by maximum bitwidth. Fig. 76: Probability distribution for the number of redundant bits introduced per operation by this implementation of the bitwise operators. Fig. 77: Performance and energy averaged over the distribution in Fig. 78 and Fig. 80 vs Transistor Count. Type Transistors Frequency Energy/Op Latency Clocked Parallel Synopsys Left (64-bit) 6918 1.89 GHz 1.603 pJ Clocked Parallel 2x Left (64-bit) 3791 2.36 GHz 1.103 pJ Clocked Parallel 4x Left (64-bit) 3192 3.34 GHz 0.911 pJ Clocked Parallel Synopsys Right (64-bit) 5988 0.89 GHz 1.070 pJ Clocked Parallel 2x Right (64-bit) 3808 2.36 GHz 1.128 pJ Clocked Parallel 4x Right (64-bit) 3185 3.34 GHz 0.991 pJ Table 13. Performance measurements for the bit-parallel shift operators. 170 Type Transistors Condition Token Frequency Energy/Token Latency digit-shift 2.67 GHz 52.28 fJ 59 ps bit-shift 2.34 GHz 61.00 fJ 152 ps Integrated Serial Adaptive Left 450+441 extend 2.33 GHz 61.29 fJ 163 ps cap 2.0 - 2.20 GHz 127.37 fJ 150 ps digit-shift 2.40 GHz 82.70 fJ - bit-shift 1.72 GHz 73.70 fJ 206 ps Integrated Serial Adaptive Right 479+441 extend 2.35 GHz 56.18 fJ 149.5 ps cap 2.0 - 2.15 GHz 151.45 fJ 149 ps Table 14. Raw performance measurements for the shift operations. The frequency of the cap condition is determined primarily by the counter. The maximum frequency of the write command for the counter is ultimately 2.0 GHz. However, the counter can handle other commands in parallel while the write command is working. This means that as long as write commands are not called consecutively, it will effectively operate at around 3.0 GHz. At this point, the shifter logic becomes the limiting factor, setting a frequency around 2.20 GHz. The transistor counts for the digit-serial shifters show the transistor counts for the shifter unit plus the count for a 4-bit dwzn counter. 6.4.3 Shift left For the left shift, Fig. 78 shows the probability distribution for the bitwidth of the shifted input A and the shift amount B as measured from the SPEC2006 benchmark. The center plot shows the joint probability distribution while each histogram shows the associated individual probability and cumulative distributions for that axis. The bitwidth of the input operand A averages around 8.97 bits while the shift amount B averages around 8.65. The utilization is computed directly from this distribution as follows. max_bitwidth(A) represents the maximum bitwidth of the shifted input A and max(B) represents the maximum value on B which is loaded into the counters. packet represents the size of each digit in the stream, which is assumed to be 4. P(bitwidth(A) == w and B == s) samples the distribution in Fig. 78 at w,s . u stores the computed utilization of each condition. 171 Fig. 78: Probability distribution for the bitwidth of the shifted value A and the shift amount B for the left shift operator. u = {'digit-shift': 0, 'bit-shift': 0, 'extend': 0, 'cap': 0} for w in range(1, max_bitwidth(A)+1): for s in range(0, max(B)): p = P(bitwidth(A) == w and B == s) if i == 0: u['digit-shift'] += 0.5 * int(s/packet) * p else: u['digit-shift'] += int(s/packet) * p u['bit-shift'] += int((w+packet-2)/packet) * p if s%packet > (max_bitwidth-w+1)%packet: u['extend'] += p u['cap'] += p Ultimately, most of the shift amount is used to append digits to the front of the digit stream. This executes the digit-shift condition int(s/packet) times in most cases. However, if the shifted input value A is 0, then it would be redundant to append these digits to the front of the 172 stream. In this case, the digit-shift condition is skipped. If the input bitwidth is 1, then the shifted value is 0 or -1. It is assumed that 0 happens about 50% of the time. Once the digit-shift condition has run its course, the bit-shift condition takes over. This forwards each token received on A , shifted by the remaining shift amount. Ultimately, the number of non-cap digits in A is equal to int((w+packet-2)/packet) . Then, if bits from a non-cap token would be shifted into the cap token, the extend condition executes. This happens when the remaining shift amount s%packet is greater than the bit-spaces left over in the last non-cap token (max_bitwidth-w)%packet . Finally, every digit-stream executes the cap condition exactly once. This computes the utilization of each condition as shown in Table 15. On average, each shift runs about 5.433 cycles in total, generating that many output digits as well. To compute these numbers for a given max-bitwidth, the distribution is assumed to be effectively truncated to that bitwidth. Then, the raw performance numbers in Table 14 are multiplied with the utilization in Table 15 to produce the average performance of the left shift operator as shown in Fig. 79. Fig. 79 shows the average performance of the length adaptive digit-serial left shift operation against the other bit-parallel shifter implementations in Table 13 for a given maximum input bitwidth. Ultimately, the digit-serial shifter operates 1.83 times faster per transistor than the Synopsys shifter but 52% slower per transistor than the best bit-parallel full custom design. However, the digit-serial shifter uses 77% and 58% less energy respectively to execute the same operations. 6.4.4 Shift Right For the right shift, Fig. 80 shows the probability distribution for the input bitwidth and shift amount. The bitwidth averages around 16.2 bits while the shift amount averages around 14.3 bits. Once again, the utilization is computed from this distribution. wp represents the number of digits in the input digit stream. sp represents the maximum number of digits that could be consumed by the shift. p0 represents the actual number of digits consumed. 173 Condition Average Cycles/Stream digit-shift 1.839 bit-shift 2.401 extend 0.193 cap 1.000 Total Cycles 5.433 Table 15. Utilization of each condition for left shift. Fig. 79: Throughput/transistor (left) and energy/op (right) metrics scaled by maximum bitwidth. Fig. 80: Probability distribution for the bitwidth of the shifted value A and the shift amount B for the left shift operator. 174 u = {'digit-shift': 0, 'bit-shift': 0, 'extend': 0, 'cap': 0} for w in range(1, max_bitwidth(A)+1): for s in range(0, max(B)): p = P(bitwidth(A) == w and B == s) wp = int((w+packet-2)/packet) sp = int((s+packet-1)/packet) p0 = min(wp, sp) u['digit-shift'] += p0 * p u['bit-shift'] += (wp - p0) * p if s < w and (s+packet-1)%packet < (w+packet-2)%packet: u['extend'] += p u['cap'] += p First, the right shift operation executes the digit-shift condition. This consumes one more token than necessary so that the data from the next token is available to be shifted in. Either the whole shift can be executed or the shift amount is greater than the bitwidth of the input, hence min(wp, sp) . Then, the remaining digits wp-p0 are forwarded during the bit-shift condition. If the extra token consumed during the digit-shift condition was necessary, then the extend condition generates that token. Finally, the cap token finishes the digit-stream. This computes the utilization of each condition as shown in Table 16. On average each shift runs about 5.411 cycles in total, generating 2.897 digits on the output. Once again, the distribution is assumed to be effectively truncated to the max bitwidth. Then, the raw performance numbers in Table 14 are multiplied with the utilization in Table 16 to produce the average performance of the right shift operator as shown in Fig. 81. Fig. 81 shows the average performance of the length adaptive digit-serial right shift operation against the other bit-parallel shifter implementations in Table 13 for a given maximum bitwidth. The digit-serial shifter operates about 2.84 times faster per transistor than the Synopsys implementation but 60% slower than the best bit-parallel full custom design. However, it once again uses 54% and 50% less energy respectfully for the same operations. 175 Condition Average Cycles/Stream digit-shift 2.514 bit-shift 1.632 extend 0.265 cap 1.000 Total Cycles 5.411 Table 16. Utilization of each condition for right shift. Fig. 81: Throughput and energy metrics scaled by maximum bitwidth. 176 CHAPTER 7 ADDITION AND SUBTRACTION Addition and subtraction are ultimately the core operations executed in general-compute applications, accounting for 43% of all integer arithmetic operations as shown in Fig. 31. Therefore, these operations are also some of the most explored operations in a CPU. There has long been significant research toward a varied array of bit-parallel arithmetic circuitry [137]. The Ripple-Carry Adder is simple and energy efficient but ultimately slow, producing a result in a worst case linear time. The Manchester Carry Chain improves upon this structure using pass transistor logic along the carry chain [104]. Sacrificing area and energy for latency and throughput [139][140], there is a large class of carry-lookahead adders that produce a result in worst case logarithmic time [99][100][101][102][103]. Finally, there are hybrid adders that mix multiple strategies: tying four bit Machester Carry Chains together using Carry Lookahead techniques [105][106]. 7.1 Addition The fundamental algorithm for LSD first serial addition is fairly simple. Both input streams are assumed to be aligned such that the first token in each stream represents the same digit-place. Digits arrive on the input channels A and B in the same order that the carry chain is propagated. So, they are added with the carry from the previous iteration, ci , to produce the sum on the output channel S and a new carry for the next iteration, co . The CHP below describes the algorithm. ci:=0; ∗[s := (Ad + Bd + ci) % pow(2, N); co := (Ad + Bd + ci) / pow(2, N); S!s; A?,B?; ci:=co; ] However, a real implementation must support finite length streams. So, the not-cap/cap control bit added to each token. To operate on two streams of differing lengths, the shorter stream is sign extended by skipping the acknowledgement of its cap token, repeating it until the cap token of the longer stream. Then, both are acknowledged, continuing to the next operation. Because streams can extend to an arbitrary length, they can represent arbitrarily large numbers with a fixed precision. This builds upon the sign-extend logic presented in Chapter 5. For addition, finite-length streams also introduce overflow conditions. When both inputs are 177 cap tokens, then two's complement dictates that their values repeat. So the output values must also repeat. However, if co ≠ ci , then the next sum token will be different from the current one. Extending the input streams by one more token on an overflow condition guarantees that the co = ci on the next iteration and that consecutive sum bits will all be the same. Then, ci is reset and the output stream is completed by forwarding a cap token. ∗[s := (Ad + Bd + ci) % pow(2, N); co := (Ad + Bd + ci) / pow(2, N); [ !Ac ∨ !Bc → S!(s,0); ci:=co; [ !Ac → A? ▯ else → skip ], [ !Bc → B? ▯ else → skip ] ▯ Ac ∧ Bc ∧ co≠ci → S!(s,0); ci:=co ▯ Ac ∧ Bc ∧ co=ci → S!(s,1); ci:=0; A?,B? ] ] This circuit implementation builds from the integrated QDI/BD sign extension unit described in Chapter 5. Once again, this has four cases defined by the intersection of Ac and Bc that need to be implemented. In the first case, neither inputs are cap tokens. In the second only A and in the third only B has a cap token. Finally both inputs have cap tokens in the fourth case. Unfortunately, none of these cases line up in the forward driver or acknowledgement logic. In the forward drivers conditions 0, 1, and 2 drive Sc0 and condition 3 drives Sc1 . In the acknowledgement logic, A is only acknowledged for conditions 0, 2, and 3, and B for conditions 0, 1, and 3. To avoid this, SR latches store a static version of the input requests. These are placed before the delay lines on the input request to give them time to stabilize before the QDI circuitry starts to operate as in the sign extension unit. Ax0 ∨ Ac0 → Ax1⇂ Ax1 ∨ Ac1 → Ax0⇂ ¬Ax0 ∧ ¬Ac0 → Ax1↾ ¬Ax1 ∧ ¬Ac1 → Ax0↾ Bx0 ∨ Bc0 → Bx1⇂ Bx1 ∨ Bc1 → Bx0⇂ ¬Bx0 ∧ ¬Bc0 → Bx1↾ ¬Bx1 ∧ ¬Bc1 → Bx0↾ Then, the input requests are combined before the delay lines. This reduces the number of delay lines by two and has zero overhead with respect to the rest of the control. After delaying AB , the rest of the control can be implemented as necessary using AB as its input. 178 Fig. 82: The architecture of the Adaptive Adder. (Ac0 ∧ (Bc0 ∨ Bc1) ∨ Ac1 ∧ Bc0) → AB0↾ Ac1 ∧ Bc1 → AB1↾ ¬Ac0 ∧ ¬Bc0 → AB0⇂ ¬Ac1 ∧ ¬Bc1 → AB1⇂ The comparison logic for Ci and Co comes next. To reduce the overall gate area, a pass transistor XOR is used to determine whether Ci and Co are different. Because this XOR will be used in the QDI handshake, the output of this XOR must remain high as Ci is transitioning between values through its neutral state, (1,1) . This means that the usual pass transistor XOR is not sufficient. However, Co remains stable through the QDI handshake and both Ci and Co are one hot encodings. @Cid1 ∧ ¬Cod1 ∨ @Cid0 ∧ ¬Cod0 → Dd1↾ ¬@Cid0 ∧ Cod1 ∨ ¬@Cid1 ∧ Cod0 → Dd1⇂ Dd1 → Dd0⇂ ¬Dd1 → Dd0↾ With the above setup, the main cycle can now be implemented starting with the forward drivers. Luckily, it can be drastically simplified by a few key observations. First regarding the acknowledgement signals Ae and Be , if AB is not a cap, then a non-cap token is output on S, A is acknowledged if it is not a cap, and B is acknowledged if it is not a cap. However, if AB is a cap token, then there are two conditions. The overflow condition when Co ≠ Ci also outputs a non-cap token on the output. It acknowledges neither A nor B , and luckily both A and B are cap tokens. So the acknowledgement is automatically implemented by the same logic that handles the case in which AB is not a cap. The final case in which Co = Ci outputs a cap token on S and so 179 must be handled as a separate set of logic anyways. Second, on an overflow condition, both A and B are cap tokens but Co ≠ Ci . This means that the inputs are not acknowledged, the next operation does not pass through the delay lines, and the bundled-data timing assumption breaks. However, because cap tokens must be all ones or all zeros, if Co is not equal to Ci , then the data on A and B must be equal. If they were not, then the resulting addition would be all ones and the value on Ci would be faithfully propagated to Co making them equal. If A and B are all zeros, then Co is guaranteed to be 0 meaning Ci must be 1 . In this case, only the least significant bit of the datapath changes. If A and B are all ones, then Co is guaranteed to be 1 and Ci must be 0 . In this case no bits are changed in the datapath. This means that the max delay required by the datapath in this case is constant at one bit in the carry chain, which is far less than the natural cycle time of the control process. This makes the forward driver and acknowledgement logic extremely simple. Se ∧ (ABd0 ∨ ABd1 ∧ Dd1) → Sd0↾ Se ∧ ABd1 ∧ Dd0 → Sd1↾ Sd0 ∧ Ax0 ∨ Sd1 → Ae⇂ Sd0 ∧ Bx0 ∨ Sd1 → Be⇂ Third, if Co ≠ Ci , then setting Ci = Co will not cause any transition on Co . The only time Co is dependent upon the value of Ci is when all of the bits in the adder propagate the carry. However, in that case Co is guaranteed to be equal to Ci . This means that the value of Ci can be both an input to the datapath and set by an output from the datapath without any extra control circuitry. ¬Cid0 ∨ ¬Se ∧ ¬_Sd0 ∧ ¬Cod0 → Cid1↾ ¬Cid1 ∨ ¬Se ∧ (¬_Sd1 ∨ ¬_Sd0 ∧ ¬Cod1) → Cid0↾ Cid0 ∧ (Se ∨ _Sd0 ∨ Cod0) → Cid1⇂ Cid1 ∧ (Se ∨ _Sd1 ∧ (_Sd0 ∨ Cod1)) → Cid0⇂ Fourth, on the reset phase the next Ci must have the correct value before resetting the forward drivers and the acknowledgement. Luckily, the overflow case does not acknowledge ABd1 , so resetting Sd0 only has to make sure ABd0 is acknowledged and Ci = Co as evaluated by D . ¬Se ∧ ¬ABd0 ∧ ¬Dd1 → Sd0⇂ ¬Se ∧ ¬ABd1 ∧ ¬Cid1 → Sd1⇂ (¬Sd0 ∨ ¬Ax0) ∧ ¬Sd1 → Ae↾ (¬Sd0 ∨ ¬Bx0) ∧ ¬Sd1 → Be↾ 180 Fig. 83: Transistor diagram of LSB adder control circuitry. For the datapath shown in Fig. 82, the input data for A and B are latched using Ae and Be respectively. The latched data and the Ci are fed into a Manchester Carry Chain which drives the output data, Sd , and the carry-out, Co . 7.3 Subtraction Adaptive digit-serial subtraction can be implemented by simply inverting the second of the two inputs just after the latches as in Fig. 84 and resetting the carry-in to one instead of zero. In a CGRA, this should be configured during initialization synchronously. (¬Cid0 ∨ ¬Se ∧ (¬_Sd1 ∧ ¬_cfg ∨ ¬Cod0 ∧ ¬_Sd0)) → Cid1↾ (¬Cid1 ∨ ¬Se ∧ (¬_Sd1 ∧ ¬cfg ∨ ¬Cod1 ∧ ¬_Sd0)) → Cid0↾ Cid0 ∧ (Se ∨ (_Sd1 ∨ _cfg) ∧ (_Sd0 ∨ Cod0)) → Cid1⇂ Cid1 ∧ (Se ∨ (_Sd1 ∨ cfg) ∧ (_Sd0 ∨ Cod1)) → Cid0⇂ ¬Se ∧ ¬ABd1 ∧ (¬Cid1 ∧ ¬cfg ∨ ¬Cid0 ∧ ¬_cfg) → Sd1⇂ 7.4 Evaluation Aside from the Integrated Adaptive adder found in this paper, four other serial adders were 181 Fig. 84: The architecture of the Adaptive Adder/Subtractor. developed for comparison including a clocked non-adaptive digit-serial adder, a clocked adaptive digit-serial adder synthesized by Synopsys Design Compiler, a BD adaptive serial adder, and a QDI adaptive serial adder. Furthermore, a set of parallel adders were built for comparison including clocked Kogge & Stone[101], Han & Carlson[103], and Brent & Kung[102] carry lookahead adders, a clocked Manchester-Carry[104], a clocked Ripple-Carry, and a QDI ripple carry adder[68]. Table 17 shows the measured performance for the parallel adders. Table 18 shows the raw per-token measurements for each condition of the digit-serial adders. In the "ab" condition, both inputs are non-cap tokens and are therefore both acknowledged. In the "a" condition, A is a non-cap token and B is a cap token. This means that only A is acknowledged. Similarly, in the "b" condition, only A is a cap token. This means that only B is acknowledged. Finally, in the "cap" condition, both inputs are cap tokens and are therefore acknowledged. The utilization of these behavioral conditions are directly determined by the joint bitwidth distribution of the two inputs as shown in Fig. 85 as measured from the SPEC2006 benchmark. The center plot shows the joint probability distribution while each histogram shows the associated individual probability and cumulative distributions for that axis. However, this plot includes some operations that should not be handled by a digit-serial architecture. Specifically, there are significant spikes around 47 and 48 bits as discussed in Chapter 2 Section 4. These spikes represent memory address computations with a 48-bit wide memory bus. These operations have predictable bitwidth and should be handled by their own bit-parallel datapath. Therefore ignoring those operations, the bitwidth of the input operand A averages around 11.01 bits while B averages around 7.63 bits. The utilization is computed directly from this distribution as follows. max_bitwidth(A) 182 Type Transistors Frequency Energy/Op Clocked Parallel Kogge-Stone (64-bit) 7846 4.72 GHz 0.955 pJ Clocked Parallel Han-Carlson (64-bit) 6552 3.88 GHz 0.846 pJ Clocked Parallel Brent-Kung (64-bit) 5832 1.87 GHz 0.799 pJ Clocked Parallel Ripple (64-bit) 4830 0.39 GHz 0.736 pJ Clocked Parallel Manchester (64-bit) 4958 0.94 GHz 0.865 pJ QDI Parallel Ripple (64-bit) 8196 2.88 GHz 2.572 pJ Table 17. Performance measurements for the bit-parallel addition operators. Type Transistors Condition Token Frequency Energy/Token Latency ab 2.16 GHz 71.70 fJ 172 ps Integrated Serial Adaptive a 2.21 GHz 47.99 fJ 171 ps 302 (4-bit) b 2.22 GHz 47.32 fJ 170 ps cap 2.20 GHz 52.53 fJ 167 ps ab 2.09 GHz 117.46 fJ 140 ps QDI Serial Adaptive a 2.11 GHz 106.49 fJ 129 ps 423 (1-bit) b 2.11 GHz 106.42 fJ 127 ps cap 2.10 GHz 109.23 fJ 120 ps ab 2.07 GHz 110.87 fJ 222 ps BD Serial Adaptive a 1.99 GHz 65.82 fJ 227 ps 344 (4-bit) b 2.08 GHz 65.28 fJ 218 ps cap 2.13 GHz 87.01 fJ 241 ps ab 1 GHz 247.56 fJ 305 ps BD Serial Adaptive a 1 GHz 221.35 fJ 305 ps 616 Synopsys (4-bit) b 1 GHz 216.87 fJ 305 ps cap 1 GHz 242.60 fJ 305 ps Clocked Serial (4-bit) 288 3.65 GHz 66.26 fJ Table 18. Raw performance measurements for the digit-serial addition operators. and max_bitwidth(B) represent the maximum bitwidth of the input operands A and B . The probability at each coordinate in Fig. 85 is sampled with P(bitwidth(A) == a and bitwidth(B) == b) , ignoring cases in which A or B are 47 or 48 bits wide. Then, the average number of cycles per stream is computed for each condition. "ab" is equal to the number of non-cap tokens before sign-extension is required. Then, if one stream is longer than another, this 183 Fig. 85: Joint probability distribution for the two input bitwidths. is counted in the "a" and "b" conditions. Finally, every digit stream has exactly one cap token. u = {'ab': 0, 'a': 0, 'b': 0, 'cap': 0} for a in range(1, max_bitwidth(A)+1): for b in range(1, max_bitwidth(B)+1): # ignore memory address operations if a in [47,48] or b in [47,48]: tmpa = 44 if a in [47,48] else a tmpb = 44 if b in [47,48] else b p = P(bitwidth(A) == tmpa and bitwidth(B) == tmpb) else: p = P(bitwidth(A) == a and bitwidth(B) == b) atok = int((a+packet-2)/packet) btok = int((b+packet-2)/packet) u['ab'] += min(atok,btok)*p u['a'] += max(0, atok-btok)*p u['b'] += max(0, btok-atok)*p u['cap'] += p 184 Overall, the computed results in Table 19 show that the average digit stream is relatively short with around 4.4 tokens with the majority time spent in the "ab" condition. Furthermore, A tends to have longer digit streams, which could be reasonably explained by compiler behavior and human preference. The main result being operated on tends to be scheduled to the first input while the modifier tends to be scheduled to the second. These utilization values where computed for both 1 and 4 bit digit sizes. Now, the average performance of the circuits can be computed using the raw performance data and the utilization data. Fig. 86 shows the average addition throughput per transistor versus the energy per add of each adder. For a 4-bit datapath, the integrated adaptive serial adder requires 302 transistors. This is competitive with the 288 transistors necessary for the clocked non-adaptive serial adder because the integrated adaptive adder is only half-buffered, using latches instead of flip-flops. While the integrated adaptive design operates at 60% of the frequency using 8% more energy per token, this overhead allows the adaptive adder to skip a majority of the tokens whereas the non-adaptive design cannot. This translates to a 2.2x increase in throughput from 228 MHz to an average of 494 MHz, and a 75% decrease in energy from 1060 fJ to an average of 263 fJ. The most competitive 64 bit parallel adder, the Han-Carlson, has 6552 transistors. Its operation throughput is 7.85 times the integrated adaptive adder's at 3.876 GHz. However, given the same transistor count to multiple instances of the integrated adaptive adder, they would have an average of 11 GHz, using 69% less energy per operation. The synchronous adaptive adder synthesized by Synopsys uses twice as many transistors at 616 and has 54% lower operation throughput at 226 MHz. Furthermore, it uses 4 times the energy per operation. This difference is likely because the design is synthesized using a standard cell library while the rest are full custom. Adaptivity requires stateful control-flow either in the form of a val-rdy interface or some asynchronous channel protocol. The devices best geared to implement stateful control flow are asymmetric c-elements. These do not really exist in any standard-cell libraries because of the gross number of possible cells. For this reason, good self-timed circuits often custom-layout these cells for each design. Synthesizers do not have this option though. Instead they cobble together stateful control from latches, flops, and combinational logic which is ultimately not a good fit. In the datapath, the integrated adaptive adder uses latches on the data while the synthesizer used flops, dramatically increasing the transistor count. Furthermore, the integrated adaptive adder used a 4-bit Manchester Carry Chain while the synthesized implementation uses full-adder cells from the standard-cell library, ultimately implementing a normal Ripple Carry Adder. This means that while the integrated adaptive adder can operate at 2.16 GHz, the synthesized adder is limited to 1 GHz. 185 Condition Average Cycles/Stream 1-bit 4-bit ab 6.377 1.951 a 4.631 1.152 b 1.250 0.329 cap 1.000 1.000 Total Cycles 13.258 4.432 Table 19. Utilization of each condition for the addition operator. Fig. 86: Performance and energy averaged over the distribution in Fig. 85 vs Transistor Count. All of these differences are fairly typical in synthesized vs full-custom and the synthesis could be tuned to produce a better result. In recognition of this a full custom latched synchronous adaptive adder is also compared. In the end, it required a val-rdy interface because the input streams are variable-length. The circuitry required to implement a val-rdy interface is ultimately near-identical to the circuitry required to implement a bundled-data interface. The only difference is that for the val-rdy interface, the control signals are clocked instead of delayed with a delay line. So, this design ended up being the BD Adaptive Add. Ultimately, the architecture is very similar to the integrated design. At 344 transistors, it burns only 1.5 times the energy per operation with only 6% lower operation throughput. The only adaptive self-timed adder in the literature is from Bitsnap [232]. This adder was not directly compared because because the implementation of its adaptivity was not self contained. The design of the adder is ultimately a single bit from the QDI bit-parallel ripple-carry adder with its carry-out fed back into the carry in through a FIFO. The implementation of the control relied heavily upon the Bitsnap Microprocessor architecture as a whole and was entirely inseparable. The QDI adaptive serial adder compared here is the self-contained version of this. It is ultimately more expensive than other approaches due to acknowledgement requirements between the control and the datapath, implementing a 1 bit datapath with 423 transistors, 68% lower operation 186 Fig. 87: Each point corresponds to the simulated energy per add averaged for multiple adds over the distribution in Fig. 85 for a given maximum bitwidth. Fig. 88: Probability distribution for the number of redundant bits introduced per operation by the adder. throughput, and burning 5.6 times as much energy per operation. Fig. 87 shows the performance of these adders on average for a given maximum bitwidth using the bitwidth distribution from SPEC2006. At 32 bits, the Integrated Adaptive adder has the same average throughput efficiency as the Kogge & Stone adder. Once above 32 bits, the average throughput efficiency of the Integrated Adaptive adder is significantly better than any other architecture. For widths of more than 12, the Integrated Adaptive adder uses significantly less energy on average. Overall, the digit-serial addition operator can introduce some redundant tokens into the encoding of the result. Specifically if the result of adding a positive and negative number is smaller in magnitude than the widest input, then some number of bits near the cap token of the result will be redundant. Fig. 88 shows the probability distribution of these redundant bits over all addition and subtraction operations. 187 CHAPTER 8 COMPARISON AND CONDITIONALS Control operations account for 22% of all instructions executed in the SPEC2006 benchmark. Efficient execution of these operations is one of the largest determining factors for a platform's performance. Modern systems employ complex branch predictors and large caches to mitigate the effect of control operations on performance. The goal of this chapter is not to design something that takes its place, but to reduce the load on these systems to help increase their accuracy for the control instructions that matter. Overall, 68% of the control logic comes from branches. A branch operates directly on the program counter. However, in a dataflow platform like a CGRA, there is no program counter. Instead, a single branch instruction must be broken up into a conditional move for every piece of data used or modified in the body of the conditional. Unfortunately, time constraints have impeded further measurements regarding the distribution of the number of conditional moves required to implement each branch. Overall, it is safe to say that at least some branches would require only one or two conditional moves, and supporting this kind of operation in the CGRA accelerator would allow for larger basic blocks, reducing the load on the branch predictor circuitry. Fig. 31 shows that comparison operators account for 39% of all arithmetic operations, though many of these may not be implemented on the CGRA depending upon the branch/conditional move distribution. Therefore, the implementation of the compare and conditional move operations are not guided by input data distributions. Luckily, these operations are relatively simple and efficient. 8.1 Compare to Zero The comparison algorithm is fairly straightforward. Ultimately, this circuit must be able to resolve any one of six possible comparison operations: < , > , = , <= , >= , or ≠ . Unfortunately, it is not possible to determine whether the input is greater than zero or less than zero until the sign has been received from the cap token at the very end of the stream. Furthermore, it is not possible to determine that the stream is equal to zero until all of the tokens have been received. Therefore, the naive approach for encoding those six cases into output requests would be to assign an output request for each of < , = , and > . The other three cases can then be computed with an OR. For example, ≠ is < OR > . The comparison unit would keep track of whether all of the received tokens were zero, and then resolve the relation upon receiving the cap token. 188 v := 0; ∗[ L?l; [ lc = 0 → [ ld ≠ 0 → v := 1 ▯ else → skip ] ▯ lc = 1 → [ ld ≠ 0 → R!"<" ▯ v = 0 ∧ ld = 0 → R!"=" ▯ v = 1 ∧ ld = 0 → R!">" ] ]] However, it is ultimately possible to determine that the input is not equal to zero before receiving the cap token. In fact, the ≠ condition can generally be resolved by the very first token of the digit-stream. This early out would affect 73% of all arithmetic comparison operation executions. Unfortunately, tackling this early-out opportunity yields quite an unusual architecture. The naive assignment for the output requests is insufficient since it is not possible to resolve which of < or > that ≠ condition belongs to until the very end of the digit-stream. Therefore, the output requests must be inverted. This yields output requests for >= , ≠ , and <= respectively. Unfortunately, while the naive approach ensures mutual-exclusivity for the output requests, this approach does not. For example, < would require both <= and ≠ to be asserted simultaneously. This would force the output channel to use a 2of3 encoding on the output requests. Therefore, taking advantage of the early out for ≠ at the receiving side will be difficult because it would have to wait for a second signal before acknowledging the condition and then wait for both signals to be reset before enabling the channel again. Fortunately, it is possible to implement a non-standard channel protocol to allow for early acknowledgement and early enable. However, this means that the behavior of this circuit is no longer reasonably expressible using CHP. (Re ∨ R1 ∨ R2) ∧ _Rv ∧ Lc1 ∧ ¬Ld0 → R0↾ (Re ∨ R0 ∨ R2) ∧ _Rv ∧ (Lc0 ∨ Lc1) ∧ z0 → R1↾ (Re ∨ R0 ∨ R1) ∧ _Rv ∧ Lc1 ∧ (_R1 ∨ Ld0) → R2↾ Starting with the forward drivers: R0 implements >= , R1 implements ≠ , and R2 implements <= . R0 and R2 can only be determined using the cap token as signalled by Lc1 . Because all of the bits in the cap token have the same value, it is enough to simply check one. If the cap token is zero, signalled by ¬Ld0 , then the sign is positive and L is greater than or equal to zero. This is enough to signal R0 . If the cap token is not zero, signalled by Ld0 , then the sign is negative and L must be less than zero. Unfortunately, this is not enough to signal R2 because it does not cover whether L is equal to zero. For that, the inverted sense of R1 can be used since R1 will remain low if L is zero. 189 ¬Ld0 ∧ ¬Ld1 ∧ ¬Ld2 ∧ ¬Ld3 → z1↾ Ld0 ∨ Ld1 ∨ Ld2 ∨ Ld3 → z1⇂ z1 → z0⇂ ¬z1 → z0↾ The bits in the datapath are combined to generate a zero/not-zero signal z for each token. Because this process does not latch the datapath, this signal z is only stable after Lc0 or Lc1 go high and before Le is lowered, acknowledging the channel. Then, if one of the bits is not zero, as signalled by z0 , then R1 is raised to signal the ≠ event. R0 ∧ R1 ∨ R0 ∧ R2 ∨ R1 ∧ R2 → Rv↾ The forward drivers check Re . However, because Re can be lowered after any one of the output requests is raised, that check must be stabilized with the output requests themselves. Finally, an extra signal Rv ensures that the handshake remains stable. Rv ∨ (z1 ∨ R1) ∧ Lc0 → Le⇂ Rv is only raised once the cap token has been received on L . This signals the completion of the comparison operation. Therefore, a different expression must handle the input acknowledgement on Le for the rest of the tokens in the digit stream. If z0 is high, then R1 must be raised. So Le waits for either R1 or z1 . Finally, Le acknowledges the non-cap input request on Lc0 directly. (¬Re ∨ ¬R1 ∧ ¬R2) ∧ ¬_Rv ∧ ¬Lc1 → R0⇂ (¬Re ∨ ¬R0 ∧ ¬R2) ∧ ¬_Rv ∧ ¬Lc1 → R1⇂ (¬Re ∨ ¬R0 ∧ ¬R1) ∧ ¬_Rv ∧ ¬Lc1 → R2⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Rv⇂ ¬Rv ∧ (¬z1 ∧ ¬R1 ∨ ¬Lc0) → Le↾ In the reset phase, Re can be lowered at any time after the first output request is sent. However, because of the check on ¬_Rv in the reset phase of the forward drivers, none of the forward drivers will reset until two of the three output requests have been raised, completing the delay insensitive encoding. Once that happens, one the output requests are reset. However, once that happens, the output enable Re can be raised. This means that the handshake has to be stabilized once again by checking the value of the output requests. Once all of the output requests have been reset, Rv is lowered and the input channel enabled on Le following the standard WCHB handshake. Now, implementing the conversion to the six comparison conditions is similar to the naive approach. However, instead of combining them with OR, they are combined with AND. For example, the following implements < on C . C0 is connected directly to R0 , because R0 goes high when L is greater than or equal to zero. Meanwhile, C1 is generated by combining R2 which is 190 raised when L is less than or equal to zero and R1 which is raise when L is not equal to zero. C0 = R0 R2 ∧ R1 → C1↾ ¬R2 ∨ ¬R1 → C1⇂ The non-standard handshake implemented by the comparison units allows for this conversion to be implemented directly with combinational logic instead of a collection of C-elements. 8.2 Conditional Sink The conditional sink is not much more complex than a WCHB buffer. The input condition C determines whether or not a given digit-stream from L is forwarded through R . If C is 1, then the digit stream is forwarded, sending each token received from L on R . Otherwise, those tokens are discarded. When the cap token is finally received, the input condition C is also acknowledged, signalling the completion of the operation. ∗[L?l; [ C? = 1 → R!l ▯ else → skip ]; [ lc = 1 → C? ▯ else → skip ] ] This is implemented directly with the standard WCHB reshuffling, similar to a token split. C1 is used to condition the forward drivers controlling R . Meanwhile, one of the forward drivers for the skip can be bypassed directly to the input acknowledgement on Le . When the cap token is received, then the condition is acknowledged on Ce , and everything resets. Re ∧ C1 ∧ Lc0 → Rc0↾ Re ∧ C1 ∧ Lc1 → Rc1↾ C0 ∧ Lc1 → S1↾ Rc0 ∨ Rc1 ∨ C0 ∧ Lc0 ∨ S1 → Le⇂ Rc1 ∨ S1 → Ce⇂ ¬Re ∧ ¬Lc0 → Rc0⇂ ¬Re ∧ ¬C1 ∧ ¬Lc1 → Rc1⇂ ¬C0 ∧ ¬Lc1 → S1⇂ ¬Rc0 ∧ ¬Rc1 ∧ (¬C0 ∨ ¬Lc0) ∧ ¬S1 → Le↾ ¬Rc1 ∧ ¬S1 → Ce↾ 8.3 Evaluation Unfortunately, it will be difficult to evaluate these operations relative to typical synchronous architectures, because it is not really possible to separate their implementations from the 191 architectures themselves. Comparison operations are built into other operations in the execute stage, and often do not have their own pipeline stage. Meanwhile, there is not anything comparable to the conditional sink in a synchronous architecture. Overall, these circuits would really need to be evaluated architecture-wide, which is simply not possible within the scope of this thesis. 8.3.0 Comparison In synchronous systems, the comparison operation is generally grafted onto the addition operation by examining the overflow flag. Effectively, the comparison operation is generally less than a pipeline stage. This means that it should be difficult for any self-timed digit-serial approach to compete with the performance of the bit-parallel approach. Furthermore, because the comparison is less than a pipeline stage, it is difficult to determine how much of the power from that pipeline stage and from that part of the clock tree should be attributed to this particular operation. Therefore, four bit-parallel compare-to-zero architectures are compared both with and without the pipeline overhead. The real performance is ultimately somewhere in between. Table 20 shows the performance for a set of possible synchronous bit-parallel comparison architectures. The Precharge architecture uses a distributed OR across all 64 bits to determine whether the input is equal to zero. This OR skips the long transistor stack by precharging the node high in the first half of the clock cycle and resolving the OR in the second half. The Manchester architecture simulates what the performance might be if the comparison operation were fused with a 64 bit Manchester Carry Adder. The Tree architecture simply uses a 3-stage gate tree. Finally, the Synopsys architecture was automatically synthesized by Synopsys Design Compiler from a verilog specification. Ultimately, many of these architectures can operate well beyond the clock speed found in most synchronous systems which is why they are generally allocated less than one pipeline stage. Therefore, it is not particularly easy to compare these results with the approach presented in this chapter. There are four basic behaviors. During the "zero" condition, the input channel is source zeros and the ≠ flag has not been resolved yet. During the "early-out" condition, all previous tokens have been zero and the current token is not. This is the condition during which the ≠ flag is sent early. Then, for the "internal" condition, the input channel is sourcing non-cap tokens and the ≠ flag has already been sent. Finally, in the "cap" condition, the cap token has arrived on the input channel and the final flags are sent on the output. The "zero" and "internal" conditions have identical performance metrics. Interestingly, the overhead introduced by the integrated adaptive digit-serial approach is fairly 192 Type Transistors Token Frequency Energy/Token Gate Parallel Precharge (64-bit) 79 5.21 GHz 52.26 fJ Gate Parallel Manchester (64-bit) 220 0.90 GHz 6.05 fJ Gate Parallel Tree (64-bit) 174 5.68 GHz 6.07 fJ Gate Parallel Synopsys (64-bit) 470 4.20 GHz 61.65 fJ Clocked Parallel Precharge (64-bit) 1871 5.10 GHz 260.15 fJ Clocked Parallel Manchester (64-bit) 2012 0.87 GHz 269.15 fJ Clocked Parallel Tree (64-bit) 1966 5.56 GHz 253.80 fJ Clocked Parallel Synopsys (64-bit) 2770 5.19 GHz 576.76 fJ Table 20. Raw performance measurements for the bit-parallel comparison operators. Type Transistors Condition Token Frequency Energy/Token Latency zero/internal 3.86 GHz 9.78 fJ - Integrated Serial Adaptive 107 early-out 3.08 GHz 15.01 fJ 149 ps cap 1.80 GHz 34.21 fJ 140 ps Table 21. Raw performance measurements for the digit-serial comparison operator. limited. Because there are no latches, the energy cost of each token is significantly less than most other operators. Meanwhile, the frequency of the comparison operator sits well above all of the other operators, meaning it will not be the bottleneck in any scenario. To analyze the performance of this circuit, two things are required. The first is the bitwidth of the input digit stream, determining how many internal conditions execute before the cap token. The second is the alignment, or the number of zero bits before the least significant non-zero bit in the input. This determines the latency of the ≠ flag and the average number of early out conditions executed. The joint distribution of these two features are shown in Fig. 89. One should note that this distribution includes any redundant tokens introduced by the add that generally precedes the comparison operation during the execution of a branch. The utilization is computed from this distribution as follows. max_bitwidth(A) and max_alignment(A) represent the maximum bitwidth and alignment respectively of the input operand A . The probability at each coordinate in Fig. 89 is sampled with P(bitwidth(A) == b and alignment(A) == a) , then the average number of cycles per stream is computed for each condition. With a token or packet size of 4, the zero bits that represent the alignment are tokenized to determine the number of tokens in the "zero" condition. If A is zero, then the alignment a will be equal to the bitwidth b , and all of the non-cap tokens will execute the "zero" condition. 193 Fig. 89: Probability distribution for the input bitwidth. Then, if A is not zero, the comparison unit will execute an "early-out" condition, followed by some number of "internal" conditions determined by the number of tokens left in the digit stream. Finally, the "cap" condition is executed sending out the remaining comparison results. u = {'zero': 0, 'early-out': 0, 'internal': 0, 'cap': 0} for b in range(1, max_bitwidth(A)+1): for a in range(0, max_alignment(A)): p = P(bitwidth(A) == b and alignment(A) == a) btok = int((b+packet-2)/packet) atok = int(a/packet) if atok < btok: u['zero'] += atok*p u['early-out'] += p u['internal'] += (btok-atok-1)*p else: u['zero'] += btok*p u['cap'] += p For a max bitwidth of 64 bits, the average length of the digit stream on A is around 3.4 tokens. Furthermore, there are only around 1.1 tokens before the early out condition is executed. This means that the ≠ flag has an average latency of 444 ps from the arrival of the first token on the 194 input channel A while the others have an average latency of 791 ps. On average, when the overhead of the bit-parallel pipeline is ignored, the adaptive digit-serial comparison operator is competitive with the bit-parallel comparator generated by Synopsys Design Compiler with only 10% lower throughput per transistor but 4% less energy per operation. Furthermore, while the throughput of the digit-serial operator is only 12% of the max throughput of the fastest bit-parallel comparator, the real throughput of the bit-parallel comparator will be determined by the system as a whole. Meanwhile, while the digit-serial operator uses 9.8 times as much energy as the most energy efficient comparator, the absolute difference is 53 fJ per operation which is thoroughly accounted for by the other arithmetic operators. When the overhead of the bit-parallel pipeline is accounted for, there is no longer any competition. The digit-serial comparator has 2.84 times the throughput per transistor of the fastest bit-parallel comparator and uses 77% less energy than the most energy efficient bit-parallel comparator. Overall, the digit-serial comparator is more energy efficient than all of the bit-parallel operators at any bitwidth. However, the throughput per transistor is lower until around 32 bits. 8.3.0 Conditional Sink Unfortunately, there is no comparable bit-parallel synchronous counterpart for the conditional sink. In synchronous control-flow architectures, conditional actions are implemented by branches, which are inseparably tied to the execution pipeline, the program counter, the branch predictors, and the branch target buffer. Furthermore, it would take multiple conditional sinks to implement all of the conditional routing implied by a single branch instruction. Meanwhile, synchronous data-flow architectures cannot implement a conditional sink because data must always flow through a clocked pipeline. Instead they implement conditional merges and clock-gate the not-taken branch of the dataflow split. This means that the performance of that conditional action is tied to the system-wide implementation of the clock-gating circuitry. This means that it is impossible to evaluate the conditional sink against any particular baseline. Fortunately, Table 23 shows that the conditional sink is relatively inexpensive. There are four behavioral conditions. The "internal" condition deals with non-cap tokens when the branch is taken. The "cap" condition deals with the cap token when the branch is taken. Similarly, the "skip-internal" and "skip-cap" conditions deal with the non-cap and cap tokens when the branch is not taken. All of these conditions can operate faster than 3 GHz and use around 20 fJ per token. Because other operators execute in the 2 GHz range, this will never be the bottleneck in the system. 195 Condition Average Cycles/Stream zero 0.148 early-out 0.765 internal 1.242 cap 1.000 Total Cycles 3.155 Table 22. Utilization of each condition for the comparison operator. Fig. 90: Performance and energy averaged over the distribution in Fig. 89 vs Transistor Count. Fig. 91: Each point corresponds to the simulated energy per compare averaged for multiple compare operations over the distribution in Fig. 89 for a given maximum bitwidth. Type Transistors Condition Token Frequency Energy/Token Latency internal 3.23 GHz 19.63 fJ 48 ps cap 3.11 GHz 22.12 fJ 51 ps Integrated Serial Adaptive 105 skip-internal 3.82 GHz 13.30 fJ - skip-cap 3.55 GHz 21.09 fJ - Table 23. Raw performance measurements for the digit-serial conditional sink operator. 196 CHAPTER 9 MULTIPLICATION Multiplication has been studied extensively for decades with hundreds of people weighing in on the subject. In 1956, Kolmogorov conjectured multiplication to be an operation of complexity O(n^2) [108]. This conjecture was quickly disproved by Anatolii Karatsuba in 1960 when he introduced a trick that brought the complexity to O(n^log2(3)) [107]. Since then, there has been iterative progress, with lower bound complexity conjectured to be n*log(n) in [112] and finally proven in [114] and [109]. Unfortunately, while these multiplication algorithms are asymptotically superior, they also have successively larger constants making them largely inapplicable to computer arithmetic circuitry. Instead, research in multiplier circuitry for general compute has focused heavily on structures with O(n^2) complexity while the asymptotically more efficient algorithms have been relegated to cryptography applications where larger multiplications are required. [115] Today, bit-parallel multipliers fall into one of just four underlying architectures. Array multipliers directly implement the long multiplication algorithm using n-bit adders to combine each successive row of partial products with the sum from the previous row [117]. Carry-save array multipliers use carry-save adders to sequentially sum the rows [124]. Tree multipliers build on this idea by removing the sequentiality and summing each column in parallel with a tree of adders [118][119][120][123]. While these multipliers use O(n^2) one-bit full adders to multiply two numbers in O(1) cycles, iterative multipliers time-multiplex the operation. Typically allocating O(n) one-bit full adders for O(n) cycles and accumulating the sum of the rows in memory. Digit-serial or “online” multipliers were first introduced in [125]. Like iterative multipliers, they allocate O(n) one-bit full adders for O(n) cycles. However, instead of completing a whole row per cycle, digit-serial multipliers complete a whole column per cycle. This allows them to emit digits on the output as digits are received on the input. This approach can be trivially made bit-parallel as in [126][127][129]. However, doing so requires the added overhead of shift registers. All of the digit-serial multiplier implementations in the literature are ultimately a hybrid parallel-serial design [131][134][130]. While they are all structured similarly, different strategies for input and carry digit routing in the multiplier array yield slight architectural variations [132][133][135]. While these architectures tend to focus on unsigned multiplication, the partial products may 197 Year Algorithm Complexity 1960 Karatsuba[107] O(n^log2(3)) 1963 Toom[110] O(n^log3(5)) 1966 Toom-Cook[111] O(n^logk(2k+1)) 1971 Schönhage-Strassen[112] O(n*log(n)*log(log(n))) 2007 Fürer[113] O(n*log(n)*16^log*(n)) 2019 Harvey-Hoeven[114] O(n*log(n)) Table 24. Short history of multiplication algorithms and their complexity. always be recoded to support multiplication of signed numbers with a two's complement encoding [116][121][122]. These strategies are demonstrated concretely for digit-serial multipliers in [128]. This chapter explores neither the development of the multiplication algorithm nor the datapath circuitry. In fact, [136] already endeavored to make self-timed control for a digit-serial multiplier. Instead the goal of this chapter is to develop QDI control to make an existing digit-serial multiplier architecture length adaptive. Specifically, in the context of a statically configured CGRA, there are two constraints that select a particular variation of the digit-serial multiplier. First, because the CGRA is statically configured, the route from one operation to the next must also be static. Second, because arithmetic is designed to implement arbitrary length operations, the multiplier must also support inputs of arbitrary length. These two constraints select the underlying architecture demonstrated in Fig. 92. Specifically, the node that receives the inputs also emits the output. This keeps the route through this operator static regardless of the input digit length. It also allows successively more multiplier units to be allocated higher on the stack as more input digits arrive. The architecture rendered in Fig. 92 shows the bit-serial variant as found in [131] while this chapter will develop the digit-serial variant. Simply put, in the digit-serial variant, the AND gates become digit-multiplies, and the one-bit adders become a multi-bit summation tree. 9.1 Behavioral Specification Each node in the synchronous version of the architecture presented in Fig. 93 implements the following behavioral specification assuming a digit size of M bits and a word size of N digits. Take note that B is loaded as determined by the count stored in c . The upper bits of the result are stored in x for the next digit and the lower bits are forwarded through Yo . Finally, data from Yi is delayed by one cycle through y which is initialized to 0. 198 Fig. 92: The underlying multiplier architecture used in this chapter. x := 0; y := 0; c := 0; ∗[[ c > 0 → s := A*B + y + x; A?, Yi?y; Yo!s0:M, x := sM:2M; c := c-1 ▯ c = 0 → B?; y := 0, x := 0; c := N ]] Converting this to a length-adaptive specification requires the introduction of control data alongside the datapath storing whether this digit is a cap token. Doing so removes the need for c and N . Unfortunately, it also greatly increases to complexity of the specification. So, this specification will be derived from the original multiply algorithm to ensure correctness. This derivation begins quite tautologically. The multiplier will receive values from its input channels A and B , multiply those values, and send the result on the output channel Y . ∗[ A?a, B?b; Y!(a*b) ] The first step in the derivation is to convert A and Y to length-adaptive digit-serial channels Ai and Yo . Each digit that arrives on Ai is multiplied with the whole of B and the result is accumulated in s . Every digit received on Ai allows for the least significant digit of s to be shifted to the output channel Yo . Once the last digit on Ai has been received, that digit must be held in place until the end of the multiply. This sign-extends Ai for the rest of the operation, during which the remaining data in s is shifted out to Yo . Once that has completed, the last digit on Ai is acknowledged with B and s is reset to zero. Keep in mind that this algorithm only implements unsigned multiplication. A booth encoding will be applied to the datapath after synthesis to implement signed multiplication. 199 s:=0; ∗[ ∗[ Aic=0 → s:=Aid*B + sM:N*M; Ai?; Yo!(0,s0:M) ]; i:=N; ∗[ i>0 → i:=i-1; s:=Aid*B0:i*M + sM:(i+1)*M; Yo!(i=0,s0:M) ]; Ai?; B?; s:=0; ] Next, the algorithm needs to be broken in to simpler segments to facilitate circuit implementation. To do this, B should be split yielding two new channels Bi carrying the least significant digit of B , and Bn carrying the remaining digits. Similarly s is split into s0 handling the computation for the least significant digit of B , and sn handling the computation for the remaining digits. s0:=0, sn:=0; ∗[ ∗[ Aic=0 → s0:=Aid*Bi + sn0:M + s0M:2M; sn:=Aid*Bn + snM:(N-1)*M; Ai?, Yo!(0,s00:M) ]; i:=N-1; ∗[ i>0 → i:=i-1; s0:=Aid*Bi + sn0:M + s0M:2M; sn:=Aid*Bn0:i*M + sn{M:(i+1)*M}; Yo!(0,s00:M) ]; s0:=Aid*Bi + sn0:M + s0M:2M; Yo!(1,s00:M); Ai?; Bn?; Bi?; s0:=0, sn:=0 ] Unfortunately, this does not implement length-adaptivity for B . So, the usual control bit Bic must be added to Bi to condition the rest of the computation on Bn . This moves the data on Bi to Bid . 200 s0:=0, sn:=0; ∗[ ∗[ Aic=0 → s0:=Aid*Bid + sn0:M + s0M:2M; [ Bic=0 → sn:=Aid*Bn + snM:(N-1)*M ▯ else → skip ]; Ai?; Yo!(0,s00:M) ]; [ Bic=0 → i:=N-1; ∗[ i>0 → i:=i-1; s0:=Aid*Bid + sn0:M + s0M:2M; sn:=Aid*Bn0:i*M + snM:(i+1)*M; Yo!(0,s00:M) ] ▯ else → skip ]; s0:=Aid*Bid + sn0:M + s0M:2M; Yo!(1,s00:M); Ai?; [ Bic=0 → Bn? ▯ else → skip ]; Bi?; s0:=0, sn:=0; ] In the next few steps, a process handling the least significant digit of B will be split from this one using projection. During that transformation, Ai , Yo , Bi , and s0 will be placed in one process while Bn and sn will be placed in the other. Unfortunately, the value assignment of s0 directly uses sn in the current specification. Therefore, In order to split this process using projection, a few extra channels, Ao and Yi , must be added to split up any interactions between these two internal variables. Ao forwards the input digit on Ai from the process handling the least significant digit of B to the process handling the remaining digits. Meanwhile, Yi returns the result of the digit multiply in the opposite direction. 201 s0:=0, sn:=0, yd:=0; ∗[ ∗[ Aic=0 → s0:=Aid*Bid + yd + s0M:2M; [ Bic=0 → Ao!Ai; sn:=Aod*Bn + snM:(N-1)*M; Ao?; Yi!(0,sn0:M), Yi?y ▯ else → skip ]; Ai?; Yo!(0,s00:M) ]; [ Bic=0 → i:=N-1; ∗[ i>0 → i:=i-1; s0:=Aid*Bid + yd + s0M:2M; Ao!Ai; sn:=Aod*Bn0:i*M + snM:(i+1)*M; Yi!(i=0,sn0:M), Yi?y; Yo!(0,s00:M) ]; Ao? ▯ else → skip ]; s0:=Aid*Bid + yd + s0M:2M; Yo!(1,s00:M); Ai?; [ Bic=0 → Bn? ▯ else → skip ]; Bi?; s0:=0, sn:=0, yd:=0; ] Take note that Ao has been added such that it forwards the cap token from Ai multiple times. This will yield different process specifications for the least significant digit and every other digit. While only the least significant digit will be derived here, both specifications will be used in the multiplier. To avoid this one could move Ao!Ai from inside the second loop to just before the second loop. However, this would yield a less efficient circuit. With those channels added, the projection transformation can proceed. This yields two processes: the first process handling the least significant digit of a length-adaptive digit-serial multiply, and the second process which, aside from the extra receives on Ao in the second loop, is identical to the original digit-serial specification. This means that these transformations can be repeated, recursively pulling the next least significant digit from the specification to build the full digit-serial multiply. 202 s0:=0, yd:=0; ∗[ ∗[ Aic=0 → s0:=Aid*Bid + yd + s0M:2M; [ Bic=0 → Ao!Ai; Yi?y ▯ else → skip ]; Ai?; Yo!(0,s00:M) ]; [ Bic=0 → ∗[ yc=0 → s0:=Aid*Bid + yd + s0M:2M; Ao!Ai; Yi?y; Yo!(0,s00:M) ] ▯ else → skip ]; s0:=Aid*Bid + yd + s0M:2M; Yo!(1,s00:M); Ai?; Bi?; s0:=0, yd:=0; ]∥ sn:=0; ∗[ ∗[ Aoc=0 → sn:=Aod*Bn + snM:(N-1)*M; Ao?; Yi!(0,sn0:M) ]; i:=N-1; ∗[ i>0 → i:=i-1; sn:=Aod*Bn0:i*M + snM:(i+1)*M; Ao?; Yi!(i=0,sn0:M) ]; Ao?; Bn?; sn:=0 ] The next goal is to flatten this specification for a multiply digit into the Dynamic Single-Assignment (DSA) format to facilitate circuit implementation. During this transformation, all of the internal loops are merged into the outer loop, and all of the conditions are combined into a single condition inside the outer loop. This will yield a set of conditions that can be implemented using standard WCHB techniques. For this process, there are two internal loops representing the two phases of the digit-serial multiply. The first phase consumes the input digits from Ai , producing the first half of results on the output channel Yo . The second phase drains the second 203 half of the results from the multiplier without consuming any more digits from Ai . Aside from these two loops, there is also the final cap token and reset condition. For most of these conditions, there is enough internal state to create a unique predicate in the DSA. Unfortunately, this does not hold for the reset condition. Therefore, a state variable vo must be introduced to create unique predicates for the reset condition. The naive approach to this would require a three-valued internal memory with 0 covering the first loop, 1 covering the second loop, and 2 covering the reset condition. However, because the process handling the next most significant digit is inactive during the reset condition of this process, its internal memory will remain stable. This means that vo can be reduced to a two-valued memory by taking advantage of the shared variable vi from the next most significant digit to disambiguate what would be the vo=1 case in the naive approach. As a side note, not all of the bits in s0 need to be stored from cycle to cycle. Ultimately, only the upper half of s0 actually needs to be stored. Therefore, a new variable x will be introduced to act as this storage. vo:=0, x:=0, yd:=0; ∗[ ∗[ Aic=0 → s0:=Aid*Bid + yd + x; [ Bic=0 → Ao!Ai; Yi?y ▯ else → skip ]; Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M) ]; [ Bic=0 → ∗[ yc=0 → s0:=Aid*Bid + yd + x; Ao!Ai; Yi?y; vo:=0; x:=s0M:2M, Yo!(0,s00:M) ] ▯ else → skip ]; s0:=Aid*Bid + yd + x; vo:=1; Yo!(1,s00:M); Ai?; Bi?; x:=0, yd:=0; ] The first loop introduces first two conditions in the DSA specification. Both are predicated by Aic=0 , while the value of Bic=0 selects between the two. In the first condition, Aic=0 ∧ Bic=0 traces through just the first loop. First, it executes the datapath computation s0:=Aid*Bid + yd + x . Then, because Bic=0 , it executes Ao!Ai; Yi?y from the selection statement. Finally, it 204 finishes the loop by executing Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M) . All of the conditions are generated in this way, tracing out a particular section of the above program. The second loop introduces the next condition. Because the first loop consumes all of the non-cap tokens on Ai , the second loop is predicated on Aic=1 . This is combined with the explicit predicate of Bic=0 and the value of the internal memory vo and shared variable vi . vi=1 signifies two possible cases. If vo=0 , then the cap token has propagated all the way up and down the multiplier and the next most significant digit has just executed its reset condition. However, if vo=1 , then this is simply the first non-cap token being emitted by the multiplier after the reset condition of the last multiply operation. Then, if vi=0 , this is one of many non-cap tokens emitted by the multiplier. The condition introduced by the second loop selects for all of the non-cap tokens in the second phase with vi=0 ∨ vo=1 . Finally, per the first loop, the reset condition is predicated by Aic=1 . Then if this is the last digit in the multiplier as signified by Bic=1 , or the cap token has propagated through the whole multiplier and the next most significant digit has executed its reset condition as signified by vi=1 ∧ vo=0 , then the reset condition of this digit may proceed. vo:=1, x:=0, yd:=0; ∗[[ Aic=0 ∧ Bic=0 → s0:=Aid*Bid + yd + x; Ao!Ai; Yi?y; Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=0 ∧ Bic=1 → s0:=Aid*Bid + yd + x; Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ Bic=0 ∧ (vi=0 ∨ vo=1) → s0:=Aid*Bid + yd + x; Ao!Ai; Yi?y; vo:=0; x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ (Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0) → s0:=Aid*Bid + yd + x; vo:=1; Yo!(1,s00:M); Ai?; Bi?; x:=0, yd:=0; ]] A bit of re-organization produces the final specification below. s0 is assigned the same expression in every condition and can therefore be pulled out of the conditional. Once this is done, the assignment order of the internal memories no longer matters and can therefore be made parallel. In the third condition, vi=0 implies Bic=0 , simplifying the predicate. Finally in the fourth condition, vi=1 implies Aic=1 again simplifying the predicate. 205 vo:=1, x:=0, yd:=0; ∗[s0:=Aid*Bid + yd + x; [ Aic=0 ∧ Bic=0 → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=0 ∧ Bic=1 → Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; Yi?y; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, x:=0, yd:=0, Yo!(1,s00:M) ]] To recap the first condition, the parallel data from Bi is not a cap token. This means that this multiplier unit is internal to the multiplier and must forward the input request from Ai to Ao and receive the result from Yi . Meanwhile, Ai is also not a cap token, so this condition behaves the most like the specification of the synchronous digit-serial multiplier. In the second condition, the parallel data from Bi is a cap token. This means that this multiplier unit is the last one in the stack and should neither forward the request on Ao nor receive any data on Yi . This keeps yd equal to zero and leaves any other multiplier units beyond this inactive. In the third condition, the serial input on Ai is a cap token, but the parallel data on Bi is not. This means that the multiplier just needs to stream the high order digits of the multiply result to the output Yo . While these digits are being received on Yi and the digit is not a cap token, this condition will continue to execute. Importantly, this must sign extend the input on Ai by not acknowledging it until a cap token arrives on Yi . In the fourth condition, the serial input on Ai is a cap token and either the parallel input on Bi is a cap token or the cap token has propagated through the multiplier and the next most significant digit has executed its reset condition. Therefore, no token is forwarded on Ao , the final digit is emitted on Yo , the parallel input on Bi is acknowledged, and all of the internal memories are reset. If these transformations were to be applied again to pull out the next least significant digit from the overall multiplier specification, it would differ only slightly. Specifically, the third condition would also receive on Ai as shown below. This ensures that every send on Yo is paired with a receive on Ai and every receive on Yi with a send on Ao allowing for the use of exchange channels in the implementation. This is a small enough difference that the two specifications can be combined and selected using a multiplexer fairly efficiently. 206 vo:=1, x:=0, yd:=0; ∗[s0:=Aid*Bid + yd + x; [ Aic=0 ∧ Bic=0 → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=0 ∧ Bic=1 → Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, x:=0, yd:=0, Yo!(1,s00:M) ]] 9.2 Datapath While the derived multiplier specification implements an unsigned multiplier, the goal of this work is to implement signed multiplication. Therefore, this specification must be modified to add a booth encoding. In a booth encoding, the most significant bit of each digit, Bi , is multiplied by -1 . This accounts for the sign designated by the most significant bit of B . However, for internal digits, this must be reverted. Therefore, the most significant bit of each digit is multiplied by 2 and added back in. This effectively shifts this value by one bit, specifically into the least significant bit of the next digit. Effectively, the value of each digit Bi in B is now as follows assuming a four bit digit. -8*Bi,3 + 4*Bi,2 + 2*Bi,1 + Bi,0 + Bi-1,3 This means that Bi overflows at 8 and starts multiplying by negative values, but with a magnitude never greater than 8 . Therefore, when designing the booth encoder, the magnitude is Bi,0:3+Bi-1,3 when Bi,3=1 and ¬Bi,0:3+¬Bi-1,3 when Bi,3=0 . This value M will be used to generate the partial products for Aid*Bid as shown in Table 25. If Bi,3=1 , then these partial products must be negated. However, doing so would require a carry chain. Instead of negating the partial products directly, the partial products can be inverted. Then, all that is left is to add 1 in an initial data to whichever of the partial products should be negative. Now this booth encoding may be added into the datapath specification as follows. 207 M Aid*Bid 0 0 1 1*Aid 2 2*Aid 3 4*Aid - 1*Aid 4 4*Aid 5 4*Aid + 1*Aid 6 4*Aid + 2*Aid 7 8*Aid - 1*Aid 8 8*Aid Table 25. Booth encoding for a four-bit digit multiply. vo:=1, yd:=0; ∗[[ vo=1 → x:=booth_init(Bid) ▯ vo=0 → skip ]; s0:=booth(Bid, Aid) + yd + x; [ Aic=0 ∧ Bic=0 → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=0 ∧ Bic=1 → Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M) ▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, yd:=0, Yo!(1,s00:M) ]] Unfortunately, the datapath's clocking requirements introduce new difficulties. With Ai , Yi , and X , a naive approach to clocking these signals throughout the handshake might require up to six layers of latching which would ultimately introduce a significant amount of latching overhead beyond what is required for the bit-parallel approach. Furthermore, because this circuit is length adaptive and booth encoded, the reset conditions are quite a bit more complex than the baseline synchronous bit-serial approach. The design of the datapath in Fig. 93 is able to cut the latching requirements down to three layers, but requires careful attention to detail regarding the clocking order and places constraints on the control circuitry. To remove the three latching layers that would be required for the internal memory X , the latching of X and Yid can be folded together by adding them together before latching Yid . Normally, Yid would be latched using Yie . Unfortunately, this would still require one more n-latch layer along X beyond the p and n layers latching Yid . Removing this final layer requires 208 Fig. 93: Datapath architecture for each digit unit of the multiplier. the latched data from Aid to remain stable until after Yid+X has been latched in. This forces an ordering Yie⇂; Aie↾ or Yie⇂ • Aie↾ in the control. This transforms the datapath specification as follows: vo:=1; ∗[[ vo=1 → yd:=booth_init(Bid) ▯ vo=0 → skip ]; s0:=booth(Bid, Aid) + yd; [ Aic=0 ∧ Bic=0 → Ao!Ai; yd:=Yid+s0M:2M; Yi?; Ai?; vo:=0, Yo!(0,s00:M) ▯ Aic=0 ∧ Bic=1 → yd:=0+s0M:2M; Ai?; vo:=0, Yo!(0,s00:M) ▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; yd:=Yid+s0M:2M; Yi?; Ai?; vo:=0, Yo!(0,s00:M) ▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, Yo!(1,s00:M) ]] Furthermore, Yie would only cycle when there is a token arriving on Yi . Meanwhile, X must be clocked regardless of an incoming token on Yi . This means that a new strategy must be taken to clock the combined Yid+X on every cycle as triggered by Ae0 , Ae1 , and Ae2 while maintaining 209 the ordering constraint. This strategy uses compound latches with the three clocking signals, and pushes the synchronization problem to the control. Next, while I[0:4] remains stable throughout the whole multiply operation, the multiplexer selection signal vo does not. During the first token of a multiply operation, the booth encoders reset data must remain stable until Yid+X has been latched in with Ae⇂ . Unfortunately, vo may only transition in the set phase of the WCHB handshake while its next value is available on the forward drivers, and Ae⇂ may only occur in the reset phase after the reset of the forward drivers acknowledges the incoming request on Yi . This means that there is a short time before Ae⇂ in which the signal multiplexing I[0:4] transitions. This is fixed by moving the multiplexer before the p-latch for Yid+X . This ensures that the reset data for the booth multiplier is held stable once Ae⇂ has occurred. This also means that the internal memory vo must wait for Ae⇂ before transitioning. Finally, at the most significant digit of the multiplier, signified by a cap token on Bi , the input data from Yid must be zeroed. This is implemented by a multiplexer on Yid controlled by the input request on Bi . Since that request remains valid throughout the operation, this multiplexer will be stable. 9.3 Control Now, the techniques in Chapter 3 are applied to this specification to generate efficient control circuitry. Given that only the control is being synthesized and that the control operates independently from any values on the datapath, much of the specification can be ignored. This leaves the control specification shown below. vo:=1; ∗[[ Aic=0 ∧ Bic=0 → Aoc!Aic; Yic?; Aic?; vo:=0, Yoc!0 ▯ Aic=0 ∧ Bic=1 → Aic?; vo:=0, Yoc!0 ▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Aoc!Aic; Yi?; Aic?; vo:=0, Yoc!0 ▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Aic?, Bic?; vo:=1, Yoc!1 ]] Of the four conditions, the first and second can be combined into a single condition handling the first loop. The only difference between the two is when to forward the request on Ao and receive the result on Yi , and that difference is quite easily conditioned using Bic since it remains stable throughout the operation. Therefore, the first two conditions may be combined into a single forward driver R0 . From there, Aico is enough to disambiguate the first and second conditions from the third and fourth. However, because the first condition sends on Ao , the forward driver must also acknowledge the channel's enable Yic as covered by Re . Because Re is low when Bic0 is not set, Bic1 must be added to unblock the second condition. 210 Because the first two conditions only deal with Aic=0 , only the forward request Aoc0 needs to be predicated. Furthermore, the output acknowledge Yic can also be predicated on Bic to create an output enable Re . Bic0 → _B0⇂ // delay ¬Bic0 → _B0↾ Bic1 → B1↾ // delay ¬Bic1 → B1⇂ ¬_B0 ∧ ¬Ae0 → Aoc0↾ _B0 ∨ Ae0 → Aoc0⇂ ¬Ae1 → Aoc1↾ Ae1 → Aoc1⇂ _B0 ∨ Yic0 ∨ Yic1 → Re⇂ ¬_B0 ∧ ¬Yic0 ∧ ¬Yic1 → Re↾ The third condition always sends on Ao , thus R1 must acknowledge Re . Furthermore, it must check vi and vo to ensure that the multiply operation has not yet completed. R2 mostly implements the predicate of the fourth condition. However, because Aoc1 is acknowledged by Yic through Re during the final execution of R1 , R2 must wait for the reset of that acknowledgement through Re , allowing Re to further acknowledge the reset of Bic0 . Then, as an unfortunate side effect of the shared variable vi , R2 must be predicated by Re . Because vi is an internal variable of the next most significant digit, it will transition in the middle of the handshake, allowing R2 to transition before the data from that operation has been received from Yid . This is fixed with the Re predicate. Each forward driver is then amplified and inverted to serve as clocking signals for the datapath. In order to guarantee the bundled-data assumption, the output requests on Ao and Yo are generated from the clocking signals in Ae . This ensures that the input data from the exchange channels remains stable until the input latches have been closed by Ae . en1 ∧ Aic0 ∧ (Re ∨ B1) → R0↾ en1 ∧ Aic1 ∧ Re ∧ (vi0 ∨ vo1) → R1↾ en1 ∧ Aic1 ∧ (Re ∧ vi1 ∧ vo0 ∨ B1) → R2↾ R0 → Ae0⇂ // amplify R1 → Ae1⇂ // amplify R2 → Ae2⇂ // amplify Bie = Ae2 There are two things to notice. First, the value from Yi? is never used in the specification. 211 Instead, Yi communicates a valid value during the reset phase of this handshake and its value is effectively recorded in the shared variable vi every cycle, so there is no way nor need to use the value received directly from the channel. However, the least significant digit of the multiplier still needs to produce the cap/not-cap value on Yo . Second, for only the first digit of the multiply, both Ai and Yo have enable signals Aie and Yoe . For Ai , the enable signal can be driven by pass transistor logic. For digits of greater significance, this signal can simply be ignored without introducing any instability due to the special QDI treatment for pass transistor logic. For the first digit, it is an efficient way to generate the input enable from the clocking signals. @Ae0 ∧ ¬_Ae2 ∨ @Ae2 ∧ ¬_Ae0 → Aie↾ ¬@Ae0 ∧ _Ae0 ∨ ¬@Ae2 ∧ _Ae2 → Aie⇂ For Yo , en0 and en1 take the place of Yoe . For the first digit, both en signals are connected to Yoe as a synchronous configuration. en1 = Yoe en0 = Yoe For the further digits, the en signals are connected to GND and Vdd respectively. This allows the handshake to bypass the acknowledgement check, and simply operate using the acknowledgement from the exchange channels. en1 = gVdd en0 = gGND Then, the request on Yo is assigned the cap/not-cap value using another pass transistor OR on the amplified clocking signals Ae . The upgoing transitions are delayed to ensure the bundled-data constraint in the datapath. Yoc0 handles the first, second, and third conditions while Yoc1 handles the forth. @Ae0 ∧ ¬_Ae1 ∨ @Ae1 ∧ ¬_Ae0 → _Yo0↾ ¬@Ae0 ∧ _Ae0 ∨ ¬@Ae1 ∧ _Ae1 → _Yo0⇂ _Yo0 → Yoc0⇂ ¬_Yo0 → Yoc0↾ // delay Ae2 → Yoc1⇂ ¬Ae2 → Yoc1↾ // delay In the standard WCHB handshake, vo0 would be set by with _R0 or R1 and vo1 with _R2 . However, vo plays a couple further roles. First, it is a shared variable vi during the set phase of the next least significant digit. Since the set phase of the next least significant digit triggers the 212 set phase in this digit, the transition on vo must be delayed until the input request Aic1 has been lowered. The other input request Aic0 does not need to be considered because only R0 drives Aic0 and R0 does not depend upon a stable value on vi . Unfortunately, the reset of R2 must still acknowledge Aic1⇂ to cover for when B1 is high. Second, vo is a reset signal for the datapath, feeding the initialization of the booth multiplier into the datapath instead of the carry digit from Yid+X . This means that the initialization data must remain stable until after Yoe⇂ and Ae0↾,Ae1↾,Ae2↾ . Unfortunately, it would be impossible to wait for the appropriate transitions on Ae because they happen after the reset phase of the handshake and the transition on vo must happen before. Therefore, the datapath will have to latch the initialization data appropriately using Ae and vo can replace _R with Ae to ensure that latch remains stable. Third, vo is used in the forward drivers of this digit. To protect those forward drivers from instability, the transitions on vo must be guarded by the reset of Re . This also helps acknowledge the reset of Re during the reset phase of R2 . vo1 ∧ (Ae0 ∧ Ae1 ∨ Re) → vo0⇂ vo0 ∧ (Ae2 ∨ Re ∨ Aic1) → vo1⇂ ¬vo1 ∨ (¬Ae0 ∨ ¬Ae1) ∧ ¬Re → vo0↾ ¬vo0 ∨ ¬Ae2 ∧ ¬Re ∧ ¬Aic1 → vo1↾ R0 sends on Yo , receives on Ai , and sets vo0 . However, it does not always send on Ao . This is covered by Re . If R0 sends on Ao , then Bic0 is high and Re is allowed to reflect Aoe . However, if R0 does not send on Ao , then Bic0 is low and Re is held low. So it is enough to just acknowledge Re . Remember, for the first digit, en0 is connected with Yoe . For the subsequent digits, en0 is simply tied to GND . R1 sends on Yo and Ao and sets vo0 . Either it is the first digit and Ai is not acknowledged, but Yoe acknowledges the request on Yo through en1 or it is not the first digit and Ai is acknowledged, resetting Aic1 while en1 is tied to Vdd . R2 sends on Yo , receives on Ai and Bi , and sets vo1 . Setting vo1 covers the acknowledgement of Bic0 through Re . Therefore the reset phase of the forward driver just has to cover Yo , Bic1 , and Aic1 . ¬en0 ∧ ¬vo1 ∧ ¬Re ∧ ¬Aic0 → R0⇂ (¬en1 ∨ ¬Aic1) ∧ ¬vo1 ∧ ¬Re → R1⇂ ¬en0 ∧ ¬vo0 ∧ ¬B1 ∧ ¬Aic1 → R2⇂ ¬R0 → Ae0↾ // amplify ¬R1 → Ae1↾ // amplify 213 ¬R2 → Ae2↾ // amplify Finally, the input enables are reset following the typical WCHB reshuffling. 9.4 Evaluation Overall, it seems that the vast majority of multiply operations executed on modern hardware are limited to 32 bits. This means that the performance for the length-adaptive digit-serial 64-bit operations would be inflated compared the standard bit-parallel approaches. Therefore, Fig. 94 shows the overall performance comparison for both 32 and 64 bit multiplier datapaths. Note that there are three performance numbers for the length adaptive digit serial multiplier. This multiplier consists of a collection of processes. The processes implementing the higher-order digits are only active when executing a multiply in which B is long enough. Therefore, there are three contexts in which this multiplier may be used. If it stands alone in the overall architecture and the higher order digits cannot be allocated to another multiply when not in use, then the multiplier digits are “statically” allocated. In this case, the transistor count for a 64-bit multiply when computing throughput/transistor is simply 8 times the transistor count for a single digit assuming a 4-bit datapath. However, in the context of a CGRA, the multiply digits are “dynamically” allocated as needed. This means that the transistor count for the throughput/ transistor metric is now dependant upon the number of digits in B for a given multiply. Therefore, the transistor count for the dynamically allocated multiplier is averaged over the distribution of the bitwidth of B for the throughput/transistor metric. Third, the number of cycles executed by the allocated processes in the second phase of the multiply is dependent upon the length of B . Therefore, if it is possible to always assign the shorter of the two inputs to B , then some cycles can be avoided thereby saving energy. This is the “sorted” multiplier. Fig. 94 shows that the statically allocated length-adaptive digit-serial multiplier is 1.4 times faster per transistor than the fastest bit-parallel multiplier while using 31% less energy for a 32-bit multiply. Meanwhile, the dynamically allocated multiplier is 4.81 times faster per transistor assuming perfect scheduling. Of course, perfect scheduling is not possible. Therefore, the real performance of the dynamically allocated multiplier will be somewhere in between. Finally, the sorted dynamic multipler is 6.1 times faster per transistor using 43% less energy. Two bit-parallel multipliers were synthesized for comparison. The first is a custom Dadda Tree followed by a Manchester Carry Chain Adder. The second is automatically synthesized from Verilog by Synopsys Design Compiler. Both implementations only latch the two inputs, leaving the rest of the multiply unpipelined. The digit-serial multiplier has three conditions. The “internal” condition covers the first two conditions in the behavioral specification in which the input request on Ai is a non-cap token. 214 Fig. 94: Performance and energy averaged over the distribution in Fig. 95 vs Transistor Count. Type Transistors Frequency Energy/Op Latency Clocked Parallel Dadda Ripple (64-bit) 118122 374 MHz 16.497 pJ Clocked Parallel Synopsys (64-bit) 130402 272 MHz 80.314 pJ Table 26. Performance measurements for the bit-parallel bitwise operators. The “extend” condition covers the third condition in the behavior specification in which the input request on Ai is a cap token, but the second half of the output digits are still propagating out of the multiplier. Finally, the “cap” condition covers the final token to be emitted out of a given multiply digit, covering the final condition in the behavioral specification. Ultimately, the multiplier is quite a bit slower than the other operations in this thesis, generally operating around 1.4 GHz. This will likely be a limiting factor for the throughput of a CGRA. Quite a bit more time should be spent to optimize the datapath to allow for shorter delay lines. Furthermore, this operator tends to require 200-300 fJ per token which is quite a bit more than any other operation in this thesis. It is likely that this could be optimized through intelligent usage of pass-transistor logic. These per-token measurements are averaged over the input distribution found in Fig. 95. Keep in mind that the bitwidth of the result of a multiply is the sum of the bitwidths of its inputs. This distribution has two interesting features. First, there are four cutoff points. The first seems to be around 16-bits which would produce 32-bit results. This is likely from the algorithmic workload required by the program. The second seems to be 26 bits with a pre-cutoff peak at around 24 bits which would produce results around 48-bits wide. This matches the width of the memory-bus on a 64-bit Intel processor and is therefore attributable to memory address calculations. The third is at 32-bits which would produce 64-bit results, matching the width of the datapath on a 64-bit Intel processor. The final cutoff point is a bit odd with one input around 32 bits and the other around 64 bits wide which would produce 96-bit results. It is unclear what this might be from. The other interesting feature is the strong diagonal representing the multiplication of two 215 Type Transistors Condition Token Frequency Energy/Token Latency internal 1.46 GHz 243.48 fJ 561 ps Integrated Serial Adaptive 1152 extend 1.40 GHz 171.37 fJ 607 ps cap 1.29 GHz 284.91 fJ 550 ps Table 27. Raw performance measurements for the digit-serial multiply operator. numbers that have the same bitwidth. This is likely due to computing the square of a number, which is one of the main operations in the iterative algorithm implementing the pow() function. Two things must be computed from this distribution. The average energy requires the total number of times a given condition was executed throughout the multiplier stack per operation. Meanwhile, the average throughput per operation requires the total number of times a given condition was executed by just the first digit in the multiplier stack. For the average energy, the distribution in Fig. 95 is iterated over with a and b representing given bitwidths for their respective inputs A and B . p is the probability of that combination of bitwidths. Then, for a given packet size (4 in this case), the number of internal tokens is computed for each input as atok and btok . If the input has one bit (0 or -1), then the cap token will be the only token in the stream. Further bits are then allocated to internal tokens. The “internal” condition of the multiplier is executed once per digit allocated by a token in B per internal token in A or atok*(btok+1) times. The “extend” condition is executed in one of the digits once per each digit of greater significance in the multiplier. This means if there are two digits, then the top digit will execute it zero times, and the bottom digit one time. Pick's Theorem is used to compute the number of discrete points in the resulting triangle for the square triangle with side-length of btok . This comes out to int((btok*btok+btok)/2) . Finally, the “cap” condition is executed once per allocated digit, or simply btok times. u = {'internal': 0, 'extend': 0, 'cap': 0} for a in range(1, max_bitwidth(A)+1): for b in range(1, max_bitwidth(B)+1): p = P(bitwidth(A) == a and bitwidth(B) == b) atok = int((a+packet-2)/packet) btok = int((b+packet-2)/packet) u['internal'] += atok*(btok+1)*p u['extend'] += (int((btok*btok+btok)/2))*p u['cap'] += (btok+1)*p For the distribution in Fig. 95, this yields the token counts in Table 28 with each condition being executed 3 to 4 times for a total of 11.273 cycles. 216 Fig. 95: Probability distribution for the bitwidth of the left operand A and right operand B for multiplication. Condition Average Cycles/Stream internal 4.191 extend 4.035 cap 3.047 Total Cycles 11.273 Table 28. Utilization of each condition for a multiply. For throughput per transistor, the distribution is iterated over like before. The first digit in the multiplier executes the “internal” condition once per each internal token in A or atok times. The “extend” condition is executed once per digit of greater significance or btok times, and the “cap” condition is executed once per multiply. 217 Condition Average Cycles/Stream internal 1.127 extend 2.047 cap 1.000 Total Cycles 4.174 Table 29. Utilization of each condition for the least significant digit of the multiply circuit. Fig. 96: Throughput/transistor (left) and energy/op (right) metrics scaled by maximum bitwidth. u = {'internal': 0, 'extend': 0, 'cap': 0} for a in range(1, max_bitwidth(A)+1): for b in range(1, max_bitwidth(B)+1): p = P(bitwidth(A) == a and bitwidth(B) == b) atok = int((a+packet-2)/packet) btok = int((b+packet-2)/packet) u['internal'] += atok*p u['extend'] += btok*p u['cap'] += p For the distribution in Fig. 95, this yields the tokens counts in Table 29 with each condition being executed 1 to 2 times for a total of 4.174 cycles. These token calculations are expanded to various maximum bitwidths including 4, 8, 16, 32, and 64 bits by truncating the distribution as needed. This allows for a comparison of the length-adaptive digit-serial multiplier against bit-parallel multipliers at varying bitwidths in Fig. 96. This shows that the digit-serial multiplier surpasses the best bit-parallel multiplier in both metrics shortly before 32 bits. Like the addition operator, the multiplication operator can introduce some redundant tokens into the encoding of the result as shown in Fig. 97. These tokens come from the addition operations that are internal to the multiplier. 218 Fig. 97: Probability distribution for the number of redundant bits introduced per operation by the multiplier. 219 CHAPTER 10 EXAMPLE ARRAY ARCHITECTURE The operators designed in this thesis show dramatic improvements against industry standard architectures and synthesis methods. While the implementation of each operator may be complicated, the final circuitry is elegant and robust with simple plug-and-play interfaces. They intelligently avoid unnecessary work, saving significant amounts of time and energy in each operation. Overall, this thesis has completed the most difficult tasks necessary for length-adaptive digit-serial computation. These operators form an underlying framework on which many highly performant architectures may be constructed. As an example of such an architecture, this final chapter implements the Arithmetic Cube [266][268]. This architecture was designed in 1987 as a systolic digit-serial accelerator for the Discrete Fourier Transform and other related operations. While this is not a fully configurable CGRA, it is an array architecture that has some amount of configurability and showcases the behaviors of the presented circuitry and its benefits. Overall, the architecture implements the operation found in Fig. 98. This operation multiplies the signal matrix X by a weight matrix B with values of only -1,0, or 1. Then, it does an element-wise multiply with the scaling matrix H as signified by the ⊛ operator. For more information about how this is derived and used to implement the Discrete Fourier Transform, see [266]. The Arithmetic Cube architecture as seen in Fig. 99 was chosen because it allows for a drop-in replacement of the arithmetic operators with the operators developed in this thesis. It consists of a two dimensional array of adders/subtracters followed by a row of multipliers. Each adder/subtracter is configured using a two-bit element from B . The bottom row of adders is sent the first column of B and the left column of adders is sent the first row of B , etc. Given j in 0 < j < n1 and k in 0 < k < n2 , if Bk,j is 1 , then the node is configured as an adder. If B{k,j} is -1 , then the node is configured as a subtracter. If Bk,j is 0 , then the node is bypassed in both directions. Overall, each node receives a row of elements Xj,* on the left and a series of partials P from above. For i in 0 < i < n0 , it forwards Xj,i to the right and Bk,j*Xj,i+Pi down. Each multiplier receives a row of bit-parallel elements Hk,* and multiplies each element Hk,i with the partial received from the adder/subtracter array Pi . Therefore, Zk,i = Hk,i*sum(Bk,j*Xj,i for 0 < j < n1) . 220 Fig. 98: Operation implemented by the arithmetic cube. Fig. 99: Architecture of the Arithmetic Cube. 10.1 Sum Process The vast majority of the functionality required by the “sum” process in Fig. 99 is already implemented by the adder/subtracter unit described in Chapter 7 with Xj,i routed to B , Pi to A , and S to the next row's Pi . Ultimately, the most significant bit of Bk,j would be connected directly to cfg . When Bk,j is -1 , then the most significant bit and therefore cfg is 1 configuring the unit to subtract Xj,i from Pi . When Bk,j is 1 , then the most significant bit and therefore cfg is 0 configuring the unit to add Xj,i to Pi . There are two functionalities that remain to be implemented. The first requires Xj,i connected to B to be forwarded to the right. This can be done by adding a simple buffer and re-using the platches that are already latching the data from B . This modifies the adder/subtracter unit very slightly as shown in Fig. 100. 221 Boe ∧ Bc0 → Boc0↾ Boe ∧ Bc1 → Boc1↾ (Boc0 ∨ Boc1) ∧ (Sd0 ∧ Bx0 ∨ Sd1) → Be⇂ ¬Boe ∧ ¬Bc0 → Boc0⇂ ¬Boe ∧ ¬Bc1 → Boc1⇂ ¬Boc0 ∧ ¬Boc1 ∧ (¬Sd0 ∨ ¬Bx0) ∧ Sd1 → Be↾ The second requires the unit to be bypassed when the least significant bit of Bk,j is 0 . This is simply done with a few multiplexers. The final circuit is shown in Fig. 101. 10.2 Mul Process The “mul” process is entirely implemented by the multiplier discussed in Chapter 9 with Hk,i connected to B , Pi to A , and Zk,i to S . However, for the Arithmetic Cube architecture the scaling matrix is optional. Therefore, the “mul” process will also need multiplexers to bypass the unit. This can be seen in Fig. 102. 10.3 Evaluation This evaluation will demonstrate how data flows through this architecture. Fig. 103 shows the few steps of simulation for the first column in the Arithmetic Cube. On the left, each channel is labelled and the configuration information is given. During the simulation, the value of the internal memory is shown in each box. For the sum units, the internal memory represents the one-bit carry-in for the next cycle. For the multiplier units, this represents the carry digit for the next cycle and initial carry for the booth encoder. In this simulation, B0,1 and B0,2 are both 0 , bypassing those sum units entirely. Meanwhile B0,0 and B0,3 are both 1 adding their respective X to the running total. X[3] receives the hexadecimal value 07C3DBD3D or 2,084,420,925 , X[0] receives the hexadecimal value 02367 or 9,063 , and H[0] receives the hexadecimal value F6B2 or -2,382 . Overall the value computed is (2,084,420,925+9,063)*-2,382 = -4,965,112,231,416 . This comes out to the hexadecimal value FFB7BF83FCA08 . The carry digits in the multiplier are initialized per the booth encoding. 1. The parallel input for the first digit is 2 , and the last bit from the previous digit is 0 by definition. This is lower than the threshold of 8 on the booth encoder. Following Table 25, the multiplier is 2*A and the initial carry is 0 . 2. For the second digit, the parallel input is B and the last bit from the previous digit is 0 . Because the parallel input is greater than the threshold, the multiplier is (¬B+¬0)*¬A=(4+1)*¬A=5*¬A . It follows that the initial carry is 5 to correctly implement 222 Fig. 100: Forwarding B . Fig. 101: Sum process with bypassing multiplexers. Fig. 102: Mul process with bypassing multiplexers. the negation. 3. The parallel input for the third digit is 6 and the last bit from the previous digit is 1 . Following the table, the multiplier is (6+1)*A=7*A and is implemented as 8*A + ¬A . To properly implement the negation, the initial carry is set to 1 . 4. The final digit is F and the last bit of the previous digit is 0 . F is greater than the threshold, 223 Fig. 103: Walk through of simulation of channels along the first column of the Arithmetic Cube. so the multiplier is (¬F+¬0)*¬A=¬A making the initial carry 1 . The simulation continues from this initial state as follows. 1. In the first step the least significant digit of each input arrives on the associated channel. This is D or X[3] , 7 for X[0] , and all the values of H[0] in parallel. In the next step, the top sum unit has received all of its inputs and may therefore execute 0+D+0=D . This result bypasses the middle two sum units through the multiplexers to arrive at the input of the bottom sum unit. Meanwhile the top sum unit enters its reset phase, acknowledging the left input and sign-extending the top input. 2. In the second step, the bottom sum unit has received all of its inputs and may execute D+7+0=14 . This forwards 4 on the output and stores the carry in memory for the next cycle. Meanwhile, the top sum unit has exited the reset phase and is now receiving new inputs. 3. In the third step, the top sum unit has received a new token on its left input, allowing it to execute 0+3+0=3 . Once again, this result bypasses the middle two sum units and arrives as an input to the bottom sum unit. Meanwhile, the multiplier has just received its first serial input token. The least significant digit in the multiplier now computes the booth encoded result. 2*4+0+0=8 , emitting 8 on the output and storing 0 in the internal memory. This puts that first multiplier unit in its reset phase. 4. In the fourth step, the bottom sum unit has received its second pair of input tokens, allowing it to execute 3+6+1=A . This is forwarded to the multiplier, putting the sum unit in its reset phase. Meanwhile, the first token has made its way to the second least significant digit in the multiplier, allowing it to compute the booth encoded 5*¬4+5+0=5*B+5+0=3C . C is returned 224 to the least significant digit and 3 is stored in the digit carry. 5. In the fifth step, the top sum unit received new inputs and computes 0+D+0=D . The bottom sum unit is in its reset phase. The least significant digit of the multiplier receives inputs and computes 2*A+0+C=20 . The second multiplier digit is in its reset phase, and the third multiplier digit computes 8*4+¬4+1+0=8*4+B+1+0=2C . C is returned to the second multiplier digit and 2 is stored in the carry. 6. In the sixth step, the top sum unit is in its reset phase, and the bottom sum unit receives inputs and computes D+3+0=10 . This forwards 0 to the multiplier and stores 1 in the carry. The first multiplier digit is in its reset phase, the second multiplier digit receives inputs and computes 5*¬A+3+C=5*5+3+C=28 , the third multiplier digit is in its reset phase, and the final multiplier digit receives inputs and computes ¬4+1=B+1=C . This is continued in the spice simulation of the full 4x4 Arithmetic Cube. The relevant part of this simulation is shown in Fig. 104. As previously discussed, each channel has three parts: the requests c0 (blue) and c1 (red) which signify not-cap or cap respectively, the enable e , and the bundled data d[0:4] (blue, red, yellow, green). There are a few things to notice here. First and foremost, the power consumption of the whole 4x4 array is fairly smooth, never going above 12 mW. While the multiplier is in its first phase, the adder network is allowed to compute. Meanwhile, when the multiplier is in its second phase, the adder network is stalled. This is reflected in the power by a slow increase and decrease in power usage. Second, the operating frequency of a linear self-timed pipeline is limited by the slowest process. For this array, the slowest process is the multiplier, operating around 1.4 GHz. Normally, the differences in pipeline length caused by the bypassed nodes in the array would slow down the pipeline. Often slack matching is employed to counter this problem. However, in this case slack matching is not necessary because the sum units in the network operate at around 2 GHz, which is .6 GHz faster than the multiplier. This gives enough slack so the array's operating frequency is not slowed beyond that of the multipliers. Third, like most self-timed circuits, there is no single number that can be used to report performance or power consumption. Both of these metrics depend heavily on the operation being performed and no single signal keeps a constant frequency. The closest signal to sample for operating frequency tends to be the enable. However, for conditional communication the enable signals can stall for significant amounts of time. Furthermore, if the multipliers were disabled, the array would then run at the operating frequency of the adders which is around 2 GHz and would burn significantly less power. This means that for larger circuits, the only way to really compare 225 c0,c1 X[3] e d[0:4] D 3 D B D 3 C 7 0 c0,c1 S[3][0] e d[0:4] D 3 D B D 3 C 7 0 c0,c1 X[0] e d[0:4] 7 6 3 2 0 c0,c1 S[0][0] e d[0:4] 4 A 0 E D 3 C 7 0 c0,c1 Z[0] e d[0:4] 8 0 A C F 3 8 F B 7 B F F 12 P (mW) 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Time (ns) Fig. 104: Waveform (left) of channels along the first column (right) of the Arithmetic Cube. them is to check their average throughput and energy across a large testbench. 226 CHAPTER 11 CONCLUSION This thesis proposed, implemented, and evaluated length-adaptive digit-serial arithmetic to optimize capacity, throughput, and energy consumption for coarse-grained reconfigurable arrays in the context of general compute. General compute workloads were analyzed to determine the viability of such architectures and identify avenues for optimization. Novel micro-architectural optimizations for QDI process design were discussed in great detail along with novel methods for integrated QDI/BD design. These methods were then used throughout this thesis to implement efficient arithmetic and supporting circuitry. As summarized in Table 30, the length-adaptive digit-serial arithmetic circuitry developed in this thesis successfully demonstrates many of the hypothesized benefits outlined at the end of Chapter 2. Every operator reduces energy consumption by a factor of two or more compared to their clocked counterparts. While clocked circuits save some energy when the input values don't change, the majority of energy overhead in a clocked system ultimately comes from driving the clock to toggle the flops in each pipeline stage, and the circuits developed in this thesis specifically seek to optimize that energy expenditure. Furthermore in larger system contexts like a CGRA, the routing network can introduce a significant overhead. These circuits will send less data over the network, saving even more energy beyond the metrics listed in Table 30. The adder/subtractor and multiplier naturally increase throughput per transistor metrics compared to their clocked counterparts by significant factors. For the multiplier, this largely depends upon the scheduling efficiency of the dynamic allocation of multiplication resources in a CGRA. However, even without that dynamic allocation, compute density is increased by a factor of 1.4x compared to 32-bit multiplication operations. This means that while dynamic allocation in hardware is an exciting opportunity, it is not necessary to get compute density improvements. At first glance, the other operators reduce compute density compared to their clocked counterparts. However when making this comparison, one must keep in mind that the operating frequency of those operators in a clocked system are ultimately limited to the worst case. For modern processors, this is somewhere between 1 and 4 GHz. Therefore, many of the throughput numbers for the clocked comparison points are grossly inflated beyond what they would be in a larger system context including 5 GHz for the counters, 10 GHz for the bitwise operators, 5.56 GHz for the comparison operator, and 3.34 GHz for the shift operators. The same is not true for 227 Operator Throughput per Transistor Energy per Operation 32-bit 64-bit 32-bit 64-bit Counters 0.89 0.49 Add/Subtract (vs Han & Carlson) 1.4 2.8 0.61 0.31 Multiply (vs Dadda Ripple) 1.4 - 4.8 5.4 - 33.8 0.69 0.16 Bitwise 0.56 0.90 0.57 0.35 Shift (vs Custom) 0.34 0.44 0.84 0.46 Compare (vs Gate/Clk Tree) 0.12 - 1.3 0.25 - 2.8 16.4 - 0.45 9.75 - 0.23 Table 30. Comparison of average performance and energy for all operators against their closest clocked counterparts. The multiplier throughput numbers depend on the scheduling efficiency of the dynamic approach, and the compare numbers depend upon the clock overhead associated with the clocked operator comparison point. the circuits presented in this thesis because self-timed circuits, by definition, aren't synchronized to any one signal. Because these operators are digit serial, the implementation of a full arithmetic-logic unit would require between 1000 and 2000 transistors with logic sharing. On a standard modern processor chip with a three billion transistor budget, this could mean up to a million execution nodes. This is orders of magnitude beyond the capacity of industry standard bit-parallel CGRAs and well beyond the couple hundred thousand static instructions necessary to execute any program in the SPEC2006 benchmark. Industry standard digit-serial CGRAs have similar capacities, but sacrifice configurability and flexibility due to the lack of control flow and stream length management. Instead, they are limited to accelerating specific problems with simple systolic architectures in which data flow patterns are carefully baked into the architecture at design time. The circuits presented in this thesis allow for the configurability and flexibility often found in bit-parallel CGRAs while supporting capacities often found in digit-serial CGRAs. For clocked systems, the next instruction must always wait for the next clock cycle to begin execution as long as they are in separate pipeline stages. This is not the case for self-timed systems. Instead self-timed systems have a forward-latency. This is the amount of time necessary to do a full digit computation for the first digit. Note that this value is independent of operating frequency. Because the next operation can begin computation as soon as it receives the first digit, there is a difference between the operation throughput of the circuit and a CGRA's effective throughput of sequential operations as shown in Table 31. Furthermore, when comparing the effective sequential execution throughput, take note once again that the numbers for the clocked comparison points are inflated compared to the throughput of those operators in larger system contexts. 228 Effective Sequential Operator Forward Latency Improvement Execution Throughput Add/Subtract (vs Han & Carlson) 172 ps 5.81 GHz 1.50 Multiply (vs Dadda Ripple) 561 ps 1.78 GHz 4.76 Bitwise 94 ps 10.64 GHz 1.06 Shift (vs Custom) 152 ps 6.58 GHz 1.97 Compare (vs Gate/Clk Tree) 538 ps 1.86 GHz 0.33 Table 31. Comparison of effective sequential execution throughput of all operators against their closest clocked counterparts. The comparison operator is an average of 444 ps from not-equal comparisons 73% of the time and 791 ps from the other comparisons the other 27% of the time. Finally, a memory designed for length adaptive digit-serial systems, will require 8 to 16 more read and write ports. However, those ports will have a much smaller bitwidth to match the digit-serial datapath and will need to be fed less often. This will ultimately introduce some overhead into the memory system. However, due to length-adaptivity, those buses would be able to source a higher throughput compared to a single bit-parallel bus by communicating less data overall. And given that modern memories have a large array of banks, an intelligent memory scheduling algorithm should be able to mitigate the possible contention issues. Memories are already addressed per-byte and it is unlikely that would need to change since the overhead associated with reading an extra token from the bank is not ultimately too much. Even with all of these benefits, there is quite a bit of work to do. 1. Further workload characterization can be done to identify common instruction groupings and allow for a tighter microp fusion within each execution node. This would allow multiple operations to be programmed to a single execution node. 2. The serial to parallel units can be redesigned to remove unnecessary pipelining. This would save significant amounts of energy for the shift and multiply operations when loading the second operand. 3. A constant time buffer could allow for automatic slack matching. This would solve any and all deadlock problems in the CGRA network and enable the implementation of a length-adaptive digit-serial division circuit. 4. A skip condition could be added to the AND and OR operators. This would eliminate any redundant bits generated by the current implementation and nearly eliminate the need for stream compression. 5. MSB first floating point digit-serial logic could be explored. This represents a significant 229 opportunity, particularly for dealing with comparison operators and division. 6. Bundled-data circuitry should be explored using the other reshufflings, particularly PCHB. This might allow the delay lines from two stages to overlap eachother, providing for a potential speedup. 7. Alternative methods for signal delay should be explored. Particularly, because configuring an array of delay lines will end up being fairly expensive. 8. Micropipelining can be used to split the datapath computation in two and distribute those two halves over the two phases of the QDI handshake. This would allow for the use of symmetric delay lines, removing the overhead of the asymmetric delay line implementation. This would also allow the delay lines to be shorter, increasing the overall throughput. These kinds of optimizations could be made easier with sufficient tooling support. In particular, a systematic application of the outlined synthesis procedure would allow for faster iteration of designs and therefore quicker optimization. Verification tools targeting the Integrated QDI/Bundled-Data design could make it easier to resolve various clocking problems. And a more intuitive approach for visualizing the interacting event cycles in a QDI process could make it easier to identify potential optimization steps along the way. Eventually, these operators should be built into a comprehensive CGRA architecture for general compute to explore the remaining hypotheses from Chapter 2. In particular, the process of communicating data into and out from the array is currently inefficient. How should memory be structured for such and architecture? How should memory be addressed for digit-serial processing? These questions will require a significant amount of time to answer. While the design process is complex the resulting circuitry is robust, fast, and extremely power efficient. These circuits present a new opportunity for highly efficient array architectures, and this thesis opens the door to many opportunities for further research. 230 APPENDIX A CHP NOTATION Communicating Hardware Processes (CHP) is a hardware description language used to describe clockless circuits derived from C.A.R. Hoare's Communicating Sequential Processes (CSP) [61]. A full description of CHP and its semantics can be found in [65]. Below is an informal description of that notation listed top to bottom in descending precedence. Dataless vs Datafull Dataless expressions operate on node voltages while Datafull operate on delay insensitive encodings. Mixed expressions implicitly cast the datafull to dataless using the encoding's validity. Specifically, for a datafull expression e its positive sense e is cast to a validity check while its negative sense ¬e is cast to a neutrality check. null is defined to be a neutral state of an encoding. A Channel X consists of a request Xr and either an acknowledge Xa or enable Xe . The acknowledge and enable serve the same purpose, but have inverted sense. With these signals, a channel implements a network protocol to transmit data from one QDI process to another. • Skip skip does nothing and continues to the next command. • Dataless Assignment n↾ sets the voltage of the node n to Vdd and n⇂ sets it to GND . • Assignment v := E waits until the datafull expression, E , is valid, then assigns that value to the variable, v . • Send X!E waits until the datafull expression E has a valid value, then sends that value across the channel X . Ultimately, a send is expanded into a handshake on its underlying signals. The standard four phase send on channel X is Xr := E; [Xa]; Xr := null; [¬Xa] for an acknowledge channel or Xr := E; [¬Xe]; Xr := null; [Xe] for an enable channel. • Receive X?v waits until there is a valid value on the channel X , then assigns that value to the variable v . Ultimately, a receive is expanded into a handshake on its underlying signals. The standard four phase receive on channel X is v := Xr; Xa↾; [¬Xr]; Xa⇂ for an acknowledge channel or v := Xr; Xe⇂; [¬Xr]; Xe↾ for an enable channel. • Dataless Channel Action If X is a dataless channel, then a send with an acknowledge channel is indistinguishable from a receive with an enable channel and a send with an enable channel is indistinguishable from a receive with an acknowledge channel. Therefore, we can simplify the syntax for the dataless send X! or receive X? to X . • Partial Send X := E executes only the first statement in the protocol of a channel send Xr := E on channel X without executing the remaining protocol. The remaining protocol may then be executed by calling the send without providing data X! . • Probe X? is used determine if the channel is ready for a receive action, returning the value waiting on the request Xr without executing the receive. X! is used to determine if the channel is ready for a send action, expanding into either ¬Xa given an acknowledge or Xe given an enable. For dataless channels, the syntax is simplified to X . 231 • Simultaneous Composition S • T executes the programs S and T at the same time. • Internal Parallel Composition S, T executes the programs S and T in any order. • Sequential Composition S; T executes the programs S followed by T . • Parallel Composition S ∥ T executes the programs S and T in any order. • Deterministic Selection [G1 → S1▯…▯Gn → Sn] where Gi , called a guard, is a dataless expression and Si is a program. The selection waits until one of the guards, Gi , evaluates to Vdd , then executes the corresponding program, Si . The guards must be stable and mutually exclusive. The notation [G] is shorthand for [G → skip] . • Non-Deterministic Selection [G1 → S1|…|Gn → Sn] is the same as Deterministic Selection except that the guards don't have to be stable or mutually exclusive. If two or more evaluate to Vdd simultaneously, then one is picked arbitrarily (not necessarily random). • Repetition ∗[G1 → S1▯…▯Gn → Sn] or ∗[G1 → S1|…|Gn → Sn] is similar to the selection statements. However, the action is repeated until no guard evaluates to Vdd . ∗[S] is shorthand for ∗[Vdd → S] . A.1 Examples This is a Buffer. It implements a single pipeline stage. For every cycle, this reads the value v from the channel L and sends it on channel R . ∗[ L?v; R!v ] This is a Conditional Split. It receives a control token from the channel C and a data value from channel L every cycle. The value from C determines which of channels R0 or R1 the received data value is sent. If c = 0 , then v is sent on channel R0 . If c = 1 , then v is sent on channel R1 . ∗[ C?c; L?v; [ c=0 → R0!v ▯ c=1 → R1!v ] ] This is a Conditional Merge. Like the conditional split, this receives a control token from the channel C every cycle. This control token determines from which of channels L0 or L1 to receive the data value v . v is then forwarded on the channel R . ∗[ C?c; [ c=0 → L0?v ▯ c=1 → L1?v ]; R!v ] This is a Non-deterministic Merge. If a token arrives on channel L0 before L1 then the data value v is read from L0 . If it arrives on channel L1 first, then v is read from L1 . If tokens arrive 232 on both channels simultaneously, then one is selected arbitrarily. v is then sent on channel R . ∗[[ L0 → L0?v | L1 → L1?v ]; R!v ] This is a Full Adder. It reads data from channels A , B and Ci and sends the resulting sum on channel S and carry out on channel Co . ∗[ A?a, B?b, Ci?c; s := a+b+c; S!s0, Co!s1 ] 233 Fig. 105: Resulting gate for asymmetric C-element (left) and pass-transistor XOR (right). APPENDIX B PRS NOTATION In a Production Rule Set (PRS), a Production Rule is a compact way to specify a single pull-up or pull-down network in a circuit. An alias a = b aliases two names to one circuit node. A rule G → A represents a guarded action where G is a guard (as described above) and A is a dataless assignment as described above. A gate is made up of multiple rules that describe the up and down assignments. The guard of each rule in a gate represents a part of the pull-up or pull-down network of that gate depending upon the corresponding assignment. If the rules of a gate do not cover all conditions, then the gate is state-holding with a staticizer. For such a gate driving a node X , the internal node before the staticizor is referenced as _X . Finally, given a source S , a pass gate is specified with @S ∧ G → A or ¬@S ∧ G → A depending upon the assignment. Example circuits are presented in Fig. 105. B.1 Examples Fig. 105 shows two examples of gates expressed by production rules. Asymmetric C-Element (¬A ∨ ¬B) ∧ ¬C → S⇂ B ∧ (A ∨ D) → S↾ Pass-Transistor XOR @B1 ∧ ¬A1 ∨ @B0 ∧ ¬A0 → S↾ ¬@B1 ∧ A0 ∨ ¬@B0 ∧ A1 → S⇂ 234 REFERENCES R.1 History [1] Raúl Rojas; Ulf Hashagen, “Reconstruction of the Atanasoff-Berry Computer.” MIT Press, 2002. [2] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. “Parallel Computer Architecture: A Hardware/ Software Approach.” Gulf Professional Publishing, 1999, Pages 15-16. [3] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. “Quantitative analysis of culture using millions of digitized books.” Science. 2011. [4] Stephen Dolan. “mov is Turing-complete.” Personal Publication, 2013. [5] Chris Domas. “M/o/Vfuscator2.” Github, June 2015 [6] Patrice Roussel, et al. “Method and apparatus for staggering execution of a single packed data instruction using the same circuit.” U.S. Patent No. 6,230,257. 8 May 2001. [7] Ronny Ronen, Alexander Peleg, and Nathaniel Hoffman. “System and method for fusing instructions.” U.S. Patent No. 6,675,376. 6 Jan. 2004. [8] Patrice Roussel, et al. “Method and apparatus for staggering execution of an instruction.” U.S. Patent No. 6,425,073. 23 Jul. 2002. [9] Intel. “Energy-Efficient, High Performing and Stylish Intel–Based Computers to Come with Intel® Core™ Microarchitecture.” Intel Developer Forum, San Francisco CA, March 2006. (mirror) [10] John Backus. “Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs.” ACM Annual Conference. Seattle, Oct 1977. [11] Francky Catthoor, et al. “Data access and storage management for embedded programmable processors.” Springer Science & Business Media, 2013. R.2 Process Technology [12] Sanjay Natarajan, et al. “Process Development and Manufacturing of High-Performance Microprocessors on 300mm Wafers.” Intel Technology Journal, Volume 6 Number 2. May 2002. (mirror) [13] Kelin J Kuhn. “CMOS Transistor Scaling Past 32nm and Implications on Variation.” Advanced Semiconductor Manufacturing Conference, 2010. (mirror) [14] Mark Bohr. “Silicon Technology Leadership for the Mobility Era.” Intel Developer Forum, 2012. (mirror) [15] Bill Holt. “Advancing Moore's Law.” Investor Meeting Santa Clara, 2015. (mirror) [16] Eugene S. Meieran. “21st Century Semiconductor Manufacturing Capabilities.” Intel Technology Journal. 4th Quarter 1998. (mirror) [17] Linley Gwennap, “Estimating IC Manufacturing Costs: Die size, process type are key factors in microprocessor cost.” Microprocessor report, Volume 7. August 1993. (data mirror) 235 [18] Robert Chau. “Advanced Metal Gate/High-K Dielectric Stacks for High-Performance CMOS Transistors.” Proceedings of the 5th International Conference on Microelectronics Interfaces. AVS, 2004. [19] Brian Doyle, et al. “Transistor Elements for 30nm Physical Gate Lengths and Beyond.” Intel Technology Journal 6.2, 2002. [20] Chris Auth, et al. “A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors.” 2012 Symposium on VLSI Technology (VLSIT). IEEE, 2012. [21] Stefan Rusu, et al. “Trends and challenges in VLSI technology scaling towards 100 nm.” Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design. IEEE, 2002. R.3 Processor Performance [22] Hadi Esmaeilzadeh, et al. “Dark silicon and the end of multicore scaling.” 38th Annual international symposium on computer architecture (ISCA). IEEE, 2011. [23] Andrew Danowitz, et al. “CPU DB: recording microprocessor history.” Queue 10.4 (2012): 10. [24] John D. McCalpin. “STREAM: Sustainable Memory Bandwidth in High Performance Computers.” Department of Computer Science School of Engineering and Applied Science University of Virginia, 1991. Accessed: August 8, 2019. Available: https://www.cs.virginia.edu/stream/. [25] Andre DeHon. “Comparing computing machines.” Configurable Computing: Technology and Applications. Vol. 3526. International Society for Optics and Photonics, 1998. R.4 Program Workload Analysis [26] SPEC CPU Subcommittee. “SPEC CPU Benchmarks.” 1992. [27] Christian Bienia, et al. “The PARSEC benchmark suite: Characterization and architectural implications.” Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. ACM, 2008. [28] Reena Panda, et al. “Wait of a decade: Did spec cpu 2017 broaden the performance horizon?.” International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018. [29] Christos Sakalis, et al. “Splash-3: A properly synchronized benchmark suite for contemporary research.” International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2016. [30] Raghunath Nambiar, Meikel Poess, Akon Dey, Paul Cao, Tariq Magdon-Ismail, Da Qi Ren, and Andrew Bond. “Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems.” Proceedings of the Conference on Performance Characterization and Benchmarking (TPCTC). Springer, 2014. [31] “CPU Benchmarks.” Passmark Software. Accessed: October 15, 2019. Available: https://www.cpubenchmark.net/. [32] Vishal Aslot, et al. “SPEComp: A new benchmark suite for measuring parallel computer performance.” International Workshop on OpenMP Applications and Tools. Springer, Berlin, Heidelberg, 2001. 236 [33] Ankur Limaye, and Tosiron Adegbija. “A workload characterization of the SPEC CPU2017 benchmark suite.” 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2018. [34] Mark Stephenson, Jonathan Babb, and Saman Amarasinghe. “Bidwidth analysis with application to silicon compilation.” ACM SIGPLAN Notices, Volume 35 Number 5. ACM, 2000. [35] Emre Özer, Andy P. Nisbet, and David Gregg. “Stochastic bit-width approximation using extreme value theory for customizable processors.” International Conference on Compiler Construction. Springer, Berlin, Heidelberg, 2004. [36] Andrew S. Huang, and John Paul Shen. “The Intrinsic Bandwidth Requirements of Ordinary Programs.” Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, October 1996. [37] Alexandru Nicolau, and Joseph A. Fisher. “Measuring the parallelism available for very long instruction word architectures.” Transactions on Computers, vol 11, pages 968-976. IEEE, 1984. [38] Manoj Kumar. “Measuring parallelism in computation-intensive scientific/engineering applications.” Transactions on Computers, vol 37 issue 9 pages 1088-1098. IEEE, 1988. [39] Todd M. Austin, and Gurindar S. Sohi. “Dynamic Dependency Analysis of Ordinary Programs.” Proceedings the 19th Annual International Symposium on Computer Architecture. IEEE, 1992. [40] Heng Liao, and Andrew Wolfe. “Available Parallelism in Video Applications.” Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 1997. [41] Rahul Sathe, and Manoj Franklin. “Available parallelism with data value prediction.” Proceedings of the Fifth International Conference on High Performance Computing. IEEE, 1998. [42] Bernard Goossens, et al. “De quoi est faite une trace d'exécution?.” Dissertation at Lirmm, 2012. [43] Venkatesan Packirisamy, et al. “Exploring speculative parallelism in SPEC2006.” International Symposium on Performance Analysis of Systems and Software. IEEE, 2009. [44] Daniele Folegnani, and Antonio González. “Energy-effective issue logic.” Proceedings of the 28th International Symposium on Computer Architecture (ISCA), Pages 230-239. ACM, Göteborg Sweden, July 2001. [45] James Balfour. “Efficient Embedded Computing.” Stanford University Ph.D. Thesis, 2010. [46] Ritesh A. Patel, et al. “Parallel lossless data compression on the GPU.” Innovative Parallel Computing-Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012). IEEE, 2012. [47] John Michalakes, and Manish Vachharajani. “GPU acceleration of numerical weather prediction.” International Symposium on Parallel and Distributed Processing. IEEE, 2008. [48] Wen-meii Hwu, et al. “Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX.” Hot Chips 19 Symposium (HCS). IEEE, 2007. 237 [49] Karl Ljungkvist. “Matrix-free finite-element computations on graphics processors with adaptively refined unstructured meshes.” Proceedings of the 25th High Performance Computing Symposium. Society for Computer Simulation International, 2017. [50] Mark James Abraham, et al. “GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers.” SoftwareX 1, Pages 19-25. 2015. [51] Wenhua Yu, et al. “Advanced features to enhance the FDTD method in GEMS simulation software package.” International Symposium on Antennas and Propagation (APSURSI). IEEE, 2011. [52] John C. Vernaleo, and Christopher S. Reynolds. “AGN feedback and cooling flows: problems with simple hydrodynamic models.” The Astrophysical Journal, Volume 645, Issue 1, Page 83. 2006. [53] James C. Phillips, John E. Stone, and Klaus Schulten. “Adapting a message-driven parallel application to GPU-accelerated clusters.” Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Press, 2008. [54] Greg Ruetsch, and Massimiliano Fatica. “A CUDA fortran implementation of BWAVES.” PGI Insider, September 2010. [55] Karl A. Wilkinson., et al. “Acceleration of the GAMESS‐UK electronic structure package on graphical processing units.” Journal of Computational Chemistry, Volume 32, Issue 10, Pages 2313-2318. 2011. [56] Guangming Tan, et al. “Single-particle 3d reconstruction from cryo-electron microscopy images on GPU.” Proceedings of the 23rd international conference on Supercomputing. ACM, 2009. [57] Avi Bleiweiss. “GPU accelerated pathfinding.” Proceedings of the 23rd ACM SIGGRAPH/ EUROGRAPHICS symposium on Graphics hardware. Eurographics Association, 2008. [58] Guochun Shi, et al. “Design of MILC lattice QCD application for GPU clusters.” 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 2011. [59] Eladio Gutierrez, et al. “Simulation of quantum gates on a novel GPU architecture.” International Conference on Systems Theory and Scientific Computation. 2007. [60] John Paul Walters, et al. “Evaluating the use of GPUs in liver image segmentation and HMMER database searches.” International Symposium on Parallel and Distributed Processing. IEEE, 2009. R.5 Asynchronous Design [61] Sir Charles Antony Richard Hoare. “Communicating Sequential Processes”. Communications of the ACM, Pages 666-677. ACM, 1978. [62] Steven Burns; Alain Martin. “Syntax-Directed Translation of Concurrent Programs into Self-Timed Circuits.” California Institute of Technology, 1988. [63] Ivan E. Sutherland. “Micropipelines.” Communications of the ACM, Volume 32 Number 6 Pages 720-738. ACM, 1989. [64] Marly Roncken, et al. “Beyond carrying coal to newcastle: dual citizen circuits.” This Asynchronous World - Essays dedicated to Alex Yakovlev on the occasion of his 60th birthday, Pages 241-261. Newcastle University, 2016. 238 [65] Alain J. Martin. “Synthesis of Asynchronous VLSI Circuits”. Computer Science Department at California Institute of Technology: Caltech-CS-TR-93-28, 1991. [66] Kees van Berkel, Joep Kessels, Marly Roncken, Ronald Saeijs, Frits Schalij. “The VLSI-programming language Tangram and its translation into handshake circuits.” Proceedings of the European Conference on Design Automation. IEEE, 1991. [67] Cortadella, Jordi, et al. “Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers.” IEICE Transactions on information and Systems 80.3, 1997. [68] Andrew M. Lines. “Pipelined asynchronous circuits.” Computer Science Department at California Institute of Technology: Caltech-CS-TR-95-21, 1998. [69] Jens Spars, and Steve Furber. “Principles Asynchronous Circuit Design.” Kluwer Academic Publishers, 2002. [70] Ken S. Stevens, Ran Ginosar, and Shai Rotem. “Relative timing [asynchronous design].” Transactions on Very Large Scale Integration (VLSI) Systems, Volume 11 Number 1. IEEE, 2003. [71] Cortadella, Jordi, et al. “Desynchronization: Synthesis of asynchronous circuits from synchronous specifications.” Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 25 Number 10. IEEE, 2006. [72] Christopher LaFrieda, and Rajit Manohar. “Reducing power consumption with relaxed quasi delay-insensitive circuits.” Proceedings of the 15th International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2009. [73] Christopher LaFrieda. “Relaxed Quasi Delay-Insensitive Circuits.” Cornell University, 2010. [74] Steven Nowick, and Montek Singh. “High-performance asynchronous pipelines: An overview.” Design & Test of Computers, Volume 28 Issue 5 Pages 8-22. IEEE, 2011. [75] Ned Bingham and Rajit Manohar. “QDI Constant Time Counters”. Transactions on VLSI. IEEE, 2018. [76] Ned Bingham and Rajit Manohar. “Self-Timed Length-Adaptive Digit-Serial Addition”. Transactions on VLSI. IEEE, 2019. [77] Ned Bingham and Rajit Manohar. “A Systematic Approach for Arbitration Expressions”. IEEE Transactions on Circuits and Systems I. IEEE, 2020. [78] Hans M. Jacobson, et al. “Synchronous interlocked pipelines.” Proceedings Eighth International Symposium on Asynchronous Circuits and Systems. IEEE, 2002. [79] Manohar, Rajit, and Alain J. Martin. “Quasi-Delay-Insensitive Circuits are Turing-Complete.” Proceedings of the International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE, March 1996. [80] Rajit Manohar, Tak-Kwan Lee, and Alain J. Martin. “Projection: A synthesis technique for concurrent systems.” Proceedings of the International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC). IEEE, 1999. [81] Rajit Manohar. “An analysis of reshuffled handshaking expansions.” International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2001. 239 [82] Rajit Manohar, Yoram Moses. “Analyzing Isochronic Forks with Potential Causality.” International Symposium on Asynchronous Circuits and Systems. pp. 69–76. IEEE, 2015. [83] Rajit Manohar. “Folded FIFOs.” Computer Science Technical Reports, California Institute of Technology, 1995. [84] Manohar, Rajit, and Alain J. Martin. “Slack elasticity in concurrent computing.” International Conference on Mathematics of Program Construction. Springer, Berlin, Heidelberg, 1998. [85] Tom Verhoeff. “Delay-insensitive codes—an overview.” Distributed computing 3.1 pages 1-8. 1988. [86] Catherine Wong and Alain Martin. “High-level synthesis of asynchronous systems by data-driven decomposition”. Proceedings of the 40th annual Design Automation Conference. Proceedings of the 40th annual Design Automation Conference (DAC), pp. 508--513, June 2003. [87] Nabil Imam, et al. “A digital neurosynaptic core using event-driven qdi circuits”. Asynchronous Circuits and Systems (ASYNC), 2012 18th IEEE International Symposium on. IEEE, 2012. [88] Navaneeth Jamadagni, and Jo Ebergen. “An asynchronous divider implementation”. Asynchronous Circuits and Systems (ASYNC), 2012 18th IEEE International Symposium on. IEEE, 2012. [89] Rajit Manohar. “ACT Toolset.” https://github.com/asyncvlsi/act, 2019. [90] Alain Martin. “The limitations to delay-insensitivity in asynchronous circuits.” Beauty is our Business, Pages 302-311. Springer, New York, NY, 1990. 302-311. [91] Sean Keller, Michael Katelman, and Alain J. Martin. “A necessary and sufficient timing assumption for speed-independent circuits.” International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2009. R.6 Counters [92] Mircea R. Stan, Alexandre Tenca, and Milos Ercegovac. “Long and fast up/down counters”. IEEE Transactions on computers 47.7 (1998): 722-735. [93] David H. Eby. “Programmable ripple counter having exclusive OR gates”. US Patent 4,612,658, September 16, 1986. [94] Kim H. Eckert. “Ripple counter with reverse-propagated zero detection”. US Patent 5,060,243, October 22, 1991. [95] Larsson, Patrik. “High-speed architecture for a programmable frequency divider and a dual-modulus prescaler”. IEEE Journal of Solid-State Circuits 31.5 (1996): 744-748. [96] Kees Van Berkel. “VLSI programming of a modulo-N counter with constant response time and constant power”. Proceedings of the Working Conference Asynchronous Design Methodologies, Manchester, U.K., 1993, pp. 1-11. [97] Kees Van Berkel. “Handshake Circuits: an Asynchronous Architecture for VLSI programming”. Vol. 5. Cambridge University Press, 1993. [98] Jon Tse and Derek Lockhart. “An Asynchronous Constant-Time Counter for Empty Pipeline Detection”. jontse.com, 2009. 240 R.7 Adders [99] Harris, David. “A taxonomy of parallel prefix networks.” Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on. Vol. 2. IEEE, 2003. [100] Gerald B. Rosenberger. “Simultaneous Carry Adder.” US Patent 2,966,305, December 27, 1957. [101] Peter M. Kogge, Harold S. Stone. “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations.” IEEE Transactions on Computers, 1973, C-22, 783-791 [102] Richard P. Brent and H. T. Kung. “A regular layout for parallel adders.” IEEE transactions on Computers 3 (1982): 260-264. [103] Tackdon Han, and David A. Carlson. “Fast area-efficient VLSI adders.” Computer Arithmetic (ARITH), 1987 IEEE 8th Symposium on. IEEE, 1987. [104] T. Kilburn, D. B. G. Edwards, and D. Aspinall. “Parallel Addition in Digital Computers: A New Fast 'Carry' Circuit.” Proceedings of the IEE-Part B: Electronic and Communication Engineering 106.29 (1959): 464-466. [105] H. Ling. “High Speed Binary Adder.” IBM J. Reasearch. Dev., Vol. 25, No. 3, p.156, May, 1981. [106] Samuel Naffziger. “A Sub-Nanosecond 0.5um 64b Adder Design.” IEEE International Solid-State Circuits Conference 1996. R.8 Multipliers R.8.1 Algorithms [107] Anatolii Alekseevich Karatsuba, and Yu P. Ofman. “Multiplication of many-digital numbers by automatic computers.” Doklady Akademii Nauk, Volume 145, Issue 2, Pages 293-294. Russian Academy of Sciences, 1962. [108] Anatolii Alexeevich Karatsuba. “The complexity of computations.” Proceedings of the Steklov Institute of Mathematics-Interperiodica Translation 211, Pages 169-183. 1995. [109] Peyman Afshani, Casper Benjamin Freksen, Lior Kamma, and Kasper Green Larsen. “Lower bounds for multiplication via network coding.” International Colloquium on Automata, Languages and Programming, Volume 132. 2019. [110] Andrei L. Toom. “The complexity of a scheme of functional elements realizing the multiplication of integers.” Soviet Mathematics Doklady, Volume 3, Issue 4. 1963. [111] Stephen A. Cook, and Stål O. Aanderaa. “On the minimum computation time of functions.” Transactions of the American Mathematical Society, Volume 142, Pages 291-314. 1969. [112] Arnold Schönhage, Volker Strassen. “Schnelle Multiplikation großer Zahlen.” Computing 7, Pages 281–292. Springer, 1971. [113] Martin Fürer. “Faster integer multiplication.” Symposium on Theory of Computing, Volume 39, Pages 57-66. ACM, 2007. [114] David Harvey, Joris van der Hoeven. “Integer multiplication in time O(n log n).” 2019. 241 [115] Ciara Rafferty, Máire O’Neill, and Neil Hanley. “Evaluation of large integer multiplication methods on hardware.” Transactions on Computers, Volume 66, Issue 8, Pages 1369-1382. IEEE, 2017. R.8.2 Bit-Parallel [116] Andrew D. Booth. “A signed binary multiplication technique.” The Quarterly Journal of Mechanics and Applied Mathematics, Volume 4, Issue 2, Pages 236-240. Oxford University Press, 1951. [117] Edward L. Braun. “Digital Computer Design.” Academic Press, New York, 1963. [118] Christopher S. Wallace “A suggestion for a fast multiplier.” Transactions on Electronic Computers 1, Pages 14-17. IEEE, 1964. [119] Luigi Dadda. “Some schemes for parallel multipliers.” Alta Frequenza, Volume 34, Issue 5, Pages 349–356. May 1965. [120] Tien Chi Chen. “A binary multiplication scheme based on squaring.” Transactions on Computers, Volume 100, Issue 6, Pages 678-680. IEEE, 1971. [121] Charles R. Baugh, and Bruce A. Wooley. “A two's complement parallel array multiplication algorithm.” Transactions on computers, Volume C-22, Issue 12, Pages 1045-1047. IEEE 1973. [122] David Kroft. “Comments on “A Two's Complement Parallel Array Multiplication Algorithm”.” Transactions on Computers, Volume C-23, Issue 12, Pages 1327-1328. IEEE, 1974. [123] William J. Stenzel, William J. Kubitz, and Gilles H. Garcia. “A compact high-speed parallel multiplication scheme.” Transactions on Computers, Volume C-26, Issue 10, Pages 948-957. IEEE, 1977. [124] Jun Iwamura, Kazuo Suganuma, Sinji Taguchi, Minoru Kimura, and Kenji Maeguchi. “16-bit CMOS/SOS multiplier-accumulator.” Conference on Circuits and Computers. IEEE, New York, NY, 1982. R.8.2 Digit-Serial [125] A. J. Atrubin, A. J. “A one-dimensional real-time iterative multiplier.” Transactions on Electronic Computers Volume, EC-14, Issue 3, Pages 394-399. IEEE, 1965. [126] Earl E. Swartzlander. “The quasi-serial multiplier.” Transactions on Computers Volume C-22, Issue 4, Pages 317-321. IEEE, 1973. [127] T. G. McDonald, Ratan K. Guha. “The two's complement quasi-serial multiplier.” Transactions on Computers, Volume C-24, Issue 12, Pages 1233-1235. IEEE, 1975. [128] Richard Lyon. “Two's complement pipeline multipliers.” Transactions on Communications, Volume 24, Issue 4, Pages 418-425. IEEE, 1976. [129] I-Ngo Chen, Robert Willoner. “An 0(n) parallel multiplier with bit-sequential input and output.” Transactions on Computers, Volume 100, Issue 10, Pages 721-727. IEEE, 1979. [130] Luigi Dadda. “Some schemes for fast serial input multipliers.” Symposium on Computer Arithmetic (ARITH). IEEE, 1983. [131] Per-Erik Danielsson. “Serial/parallel convolvers.” Transactions on Computers, Volume 100, Issue 7, Pages 652-667. IEEE, 1984. 242 [132] R. Gnanasekaran. “A fast serial-parallel binary multiplier.” Transactions on Computers, Volume 34, Issue 8, Pages 741-744. IEEE, 1985. [133] Paolo Ienne, Marc A. Viredaz. “Bit-serial multipliers and squarers.” Transactions on Computers, Volume 43, Issue 12, Pages 1445-1450. IEEE, 1994. [134] Yun-Nan Chang, Janardhan H. Satyanarayana, and Keshab K. Parhi. “Low-power digit-serial multipliers.” International Symposium on Circuits and Systems (ISCAS), Volume 3. IEEE, 1997. [135] Amar Aggoun, A. Ashur, and M. K. Ibrahim. “Area-time efficient serial-serial multipliers.” International Symposium on Circuits and Systems (ISCAS), Volume 5. IEEE, 2000. [136] M. B. Tosic, et al. “Speed-independent bit-serial multiplier.” Proceedings of International Conference on Microelectronics, Volume 2. IEEE, 1995. R.9 Computer Arithmetic [137] Reto Zimmermann. “Computer Arithmetic: Principles, Architectures, and VLSI Design.” Personal Publication, 1999. [138] Ercegovac, Miloš D., and Tomas Lang. “Digital Arithmetic.” Elsevier, 2004. R.9.1 Bit-Parallel [139] Bojan Jovanović, and Milun Jevtić. “Optimization of the Binary Adder Architectures Implemented in ASICs and FPGAs.” Soft Computing Applications, Pages 295-308. 2013. [140] D. J. Kinniment. “An Evaluation of Asynchronous Addition.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 4.1 (1996): 137-140. R.9.2 Digit-Serial [141] M. Lehman, D. Senzig, and J. Lee. “Serial Arithmetic Techniques.” Proceedings of the AFIPS Joint Computer Conference, Pages 715-725, November 1965. [142] Richard Hartley, and Peter Corbett. “Digit-serial processing techniques.” Transactions on Circuits and Systems, Volume 37 Number 6 Pages 707-719. IEEE, 1990. [143] Stewart G. Smith, and Peter B. Denyer. “Serial-Data Computation.” Volume 39. Springer Science & Business Media, 2012. [144] Richard Hartley, and Keshab Parhi. “Digit-Serial Computation.” Pages 6, 15, and 25. Springer Science & Business Media, 2012. Least Significant Digit First [145] Labros Bisdounis, et al. “VLSI implementation of digit-serial arithmetic modules.” Microprocessing and Microprogramming, Volume 39 Numbers 2-5 Pages 251-254. Elsevier, December 1993. [146] J. Povazanec and C. S. Choy and C. F. Chan “Asynchronous Logic in Bit-Serial Arithmetic.” Proceedings of the International Conference on Electronics, Circuits, and Systems, Pages 175-178. IEEE, September 1998. 243 Most Significant Digit First [147] Kishor S. Trivedi, and Milos D. Ercegovac. “On-line algorithms for division and multiplication.” Transactions on Computers, Volume C-26 Number 7. IEEE, July 1977. [148] Ercegovac, Milos D. “On-line arithmetic: An overview.” Real-Time Signal Processing VII. Vol. 495. International Society for Optics and Photonics, 1984. [149] Mary Jane Irwin, and Robert Michael Owens. “Digit-pipelined arithmetic as illustrated by the paste-up system: A tutorial.” Computer, Volume 20 Number 4 Pages 61-73. IEEE, 1987. R.10 Architecture Surveys [150] Mary Jane Irwin, and Robert Michael Owens. “A Case for Digit Serial VLSI Signal Processors.” Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, Volume 1, Number 4, Pages 321-334, April 1990. [151] Andre DeHon. “Reconfigurable Architectures for General-Purpose Computing.” Artificial Intelligence Laboratory of the Massachusetts Institute of Technology, 1996. [152] DeHon, André. “Fundamental underpinnings of reconfigurable computing architectures.” Proceedings of the IEEE Volume 103 Number 3 Pages 355-378, 2015. [153] Russell Tessier, Kenneth Pocek, and Andre DeHon. “Reconfigurable computing architectures.” Proceedings of the IEEE Volume 103 Number 3 Pages 332-354, 2015. [154] Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. “Coarse grained reconfigurable architectures in the past 25 years: Overview and classification.” 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE, 2016. [155] Ramya Raghavachari, and Moorthi Sridharan. “Review of recent trends in Coarse Grain Reconfigurable Architectures for signal processing applications.” Advances in Systems Science and Applications, Volume 18 Number 1 Pages 41-58. 2018. [156] Bjorn De Sutter, Praveen Raghavan, and Andy Lambrechts. “Coarse-grained reconfigurable array architectures.” Handbook of signal processing systems, Pages 427-472. Springer, 2019 [157] Zaheer Tabassam, et al. “Towards Designing Asynchronous Microprocessors: From Specification to Tape-Out.” IEEE Access Volume 7 Pages 33978-34003, March 2019. R.11 Micro Processors R.11.1 Bit-Parallel Synchronous [158] Donald M. Chiarulli, Walter G. Rudd, and Duncan A. Buell. “DRAFT: A dynamically reconfigurable processor for integer arithmetic.” Proceedings of the 7th Symposium on Computer Arithmetic (ARITH). IEEE, 1985. 244 [159] David Brooks, and Margaret Martonosi. “Dynamically exploiting narrow width operands to improve processor power and performance.” Proceedings of the 5th International Symposium on High-Performance Computer Architecture. IEEE, 1999. [160] Victor Goulart and Kazuaki Murakami. “Dynamic effective precision matching computation.” Proceedings of the 11th Workshop on Synthesis and System Integration of Mixed Information Technologies (SASIMI). Hiroshima, 2003. Self-Timed [161] Alain J. Martin, et al. “The design of an asynchronous microprocessor.” Computer Science Department at California Institute of Technology: CALTECH-CS-TR-89-2, 1989. [162] Mark Edward Dean. “STRiP: A self-timed RISC processor.” Dissertation to the Department of Electrical Engineering at Stanford University, 1992. [163] K-R Cho, Kazum Okura, and Kunihiro Asada. “Design of a 32-bit fully asynchronous microprocessor (FAM).” Proceedings of the 35th Midwest Symposium on Circuits and Systems. IEEE, 1992. [164] Erik Brunvand. “The NSR processor.” Proceedings of the 26th Hawaii International Conference on System Sciences. Vol. 1. IEEE, 1993. [165] Nigel Charles Paver. “The design and implementation of an asynchronous microprocessor.” Dissertation at the University of Manchester, 1994. [166] Stephen B. Furber, et al. “AMULET1: a micropipelined ARM.” Proceedings of the 39th Computer Society International Conference (COMPCON). IEEE, 1994. [167] Takashi Nanya, et al. “TITAC: Design of a quasi-delay-insensitive microprocessor.” Design & Test of Computers, Volume 11 Number 2 Pages 50-63. IEEE, 1994. [168] Jose A. Tierno, et al. “A 100-MIPS GaAs asynchronous microprocessor.” Design & Test of Computers, Volume 11 Number 2 Pages 43-49. IEEE, 1994. [169] Shannon V. Morton, Sam S. Appleton, and Michael J. Liebelt. “ECSTAC: A fast asynchronous microprocessor.” Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies. IEEE, 1995. [170] William F. Richardson, and Erik Brunvand. “Fred: An architecture for a self-timed decoupled computer.” Proceedings of the 2nd International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC). IEEE, 1996. [171] John Corrie Elston, et al. “Hades-An asynchronous superscalar processor.” Colloquium on Design and Test of Asynchronous Systems. IEEE, 1996. [172] Alain J. Martin, et al. “The Design of an Asynchronous MIPS R3000 Microprocessor.” Proceedings of the Conference on Advanced Research in VLSI (ARVLSI), Volume 97. 1997. [173] Alain J. Martin, Mika Nyströem, Paul Penzes, Catherine Wong. “Speed and Energy Performance of an Asynchronous MIPS R3000 Microprocessor.” California Institute of Technology, Pasadena CA, 2001. 245 [174] Akihiro Takamura, et al. “TITAC-2: An asynchronous 32-bit microprocessor based on scalable-delay-insensitive model.” Proceedings of the International Conference on Computer Design VLSI in Computers and Processors. IEEE, 1997. [175] Jamin MC Tse, and Daniel PK Lun. “ASYNMPU: a fully asynchronous CISC microprocessor.” Proceedings of the IEEE International Symposium on Circuits and Systems: Circuits and Systems in the Information Age (ISCAS), Volume 3. IEEE, 1997. [176] Kare T. Christensen, et al. “The design of an asynchronous TinyRISC TR4101 microprocessor core.” Proceedings 4th International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC). IEEE, 1998. [177] Marc Renaudin, Pascal Vivet, and Frédéric Robin. “ASPRO-216: a standard-cell QDI 16-bit RISC asynchronous microprocessor.” Proceedings of the 4th International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC). IEEE, 1998. [178] Hans Van Gageldonk, et al. “An asynchronous low-power 80C51 microcontroller.” Proceedings of the 4th International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE, 1998. [179] Rakefet Kol, and Ran Ginosar. “Kin: A high performance asynchronous processor architecture.” Proceedings of the 12th international conference on Supercomputing. ACM, 1998. [180] Stephen B. Furber, et al. “AMULET2e: An asynchronous embedded controller.” Proceedings of the IEEE, volume 87 number 2 Pages 243-256, 1999. [181] Jim D. Garside, Stephen B. Furber, and S-H. Chung. “AMULET3 revealed.” Proceedings of the 5th International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE, 1999. [182] Je-Hoon Lee, Won-Chul Lee, and Kyoung-Rok Cho. “A novel asynchronous pipeline architecture for CISC type embedded controller, A8051.” Proceedings of the 45th Midwest Symposium on Circuits and Systems (MWSCAS), Volume 2. IEEE, 2002. [183] Alain J. Martin, et al. “The Lutonium: A sub-nanojoule asynchronous 8051 microcontroller.” Proceedings of the Ninth International Symposium on Asynchronous Circuits and Systems. IEEE, 2003. [184] Qianyi Zhang, and Georgios Theodoropoulos. “Towards an asynchronous MIPS processor.” Proceedings of the Asia-Pacific Conference on Advances in Computer Systems Architecture. Springer, Berlin, Heidelberg, 2003. [185] Qianyi Zhang, and G. Theodoropoulos. “Modelling SAMIPS: A synthesisable asynchronous MIPS processor.” Proceedings of the 37th annual symposium on Simulation. IEEE Computer Society, 2004. [186] Clinton Kelly, Virantha Ekanayake, and Rajit Manohar. “SNAP: A sensor-network asynchronous processor.” Proceedings of the 9th International Symposium on Asynchronous Circuits and Systems, IEEE 2003. [187] “Handshake Solutions HT80C51 User Manual.” Handshake Solutions, 2005. Accessed: August 5, 2019. Available: http://www.keil.com/dd/chip/3931.htm [188] Kok-Leong Chang, and Bah-Hwee Gwee. “A low-energy low-voltage asynchronous 8051 microcontroller core.” Proceedings of the International Symposium on Circuits and Systems. IEEE, 2006. 246 [189] Luca Necchi, et al. “An ultra-low energy asynchronous processor for wireless sensor networks.” Proceedings of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2006. [190] Arjan Bink, and Richard York. “ARM996HS: The first licensable, clockless 32-bit processor core.” IEEE Micro Volume 27 Number 2 Pages 58-68. 2007. [191] “TAM16: 16-bit Microcontroller IP core.” Tiempo, 2008. Accessed: August 5, 2019. Available: http://www.tiempo-ic.com/products/ip-cores/TAM16.html [192] Meng-Chou Chang, and Da-Sen Shiau. “Design of an asynchronous pipelined processor.” Proceedings of the International Conference on Communications, Circuits and Systems. IEEE, 2008. [193] Je-Hoon Lee, Young Hwan Kim, and Kyoung-Rok Cho. “A low-power implementation of asynchronous 8051 employing adaptive pipeline structure.” Transactions on Circuits and Systems II: Express Briefs, Volume 55 Number 7 Pages 673-677. IEEE, 2008. [194] Chang-Jiu Chen, et al. “A pipelined asynchronous 8051 soft-core implemented with Balsa.” Proceedings of the Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 2008. [195] Chang-Jiu Chen, et al. “A quasi-delay-insensitive microprocessor core Implementation for Microcontrollers.” Journal of Information Science and Engineering, Volume 25 Number 2 Pages 543-557. 2009. [196] Yijun Liu, et al. “Designing an asynchronous FPGA processor for low-power sensor networks.” Proceedings of the International Symposium on Signals, Circuits and Systems. IEEE, 2009. [197] Tsai Hung-Yue, et al. “A self-timed dual-rail processor core implementation for microcontrollers.” Proceedings of the International Conference on Electronic Devices, Systems and Applications (ICEDSA). IEEE, 2011. [198] Otero, Carlos Tadeo Ortga, et al. “ULSNAP: An ultra-low power event-driven microcontroller for sensor network nodes.” Quality Electronic Design (ISQED), 2014 15th International Symposium on. IEEE, 2014. [199] Moises Herrera, and Francisco Viveros. “Asynchronous 8-bit processor mapped into an FPGA device.” Proceedings of the Colombian Conference on Communications and Computing (COLCOM). IEEE, 2014. [200] Ron Diamant, Ran Ginosar, and Christos Sotiriou. “Asynchronous sub-threshold ultra-low power processor.” Proceedings of the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS). IEEE, 2015. [201] Sean Keller, Alain J. Martin, and Chris Moore. “DD1: A QDI, Radiation-Hard-by-Design, Near-Threshold 18uW/MIPS Microcontroller in 40nm Bulk CMOS.” Proceedings of the 21st International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2015. [202] Dipanjan Bhadra, and Kenneth S. Stevens. “Design of a low power, relative timing based asynchronous MSP430 microprocessor.” Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2017. [203] Charles E. Molnar, Robert F. Sproull, and Ivan E. Sutherland. “Counterflow pipeline processor architecture.” Sun Microsystems, Inc. Mountain View, CA. 1994. 247 [204] Tony Werner, and Venkatesh Akella. “An asynchronous superscalar architecture for exploiting instruction-level parallelism.” Proceedings 7th International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2001. [205] Soumik Ghosh, Jared Tessier, and Magdy A. Bayoumi. “ASPEN: An asynchronous signal processor for energy efficient sensor nodes.” Proceedings of the 17th International Conference on Electronics, Circuits and Systems. IEEE, 2010. [206] Simon Moore, Peter Robinson, and Steve Wilcox. “Rotary pipeline processors.” Proceedings of Computers and Digital Techniques Volume 143 Number 5 Pages 259-265. IEEE, 1996. [207] Philip Brian Endecott. “SCALP: A superscalar asynchronous low-power processor.” Dissertation at the University of Manchester, 1996. [208] Andrew Lines. “The Vortex: A superscalar asynchronous processor.” Proceedings of the 13th International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2007. [209] Michel Laurence. “Introduction to octasic asynchronous processor technology.” Proceedings of the 18th International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2012. [210] Mickael Fiorentino, et al. “A practical design method for prototyping self-timed processors using FPGAs.” Proceedings of the International Symposium on Circuits and Systems (ISCAS). IEEE, 2016. [211] Mickael Fiorentino, Yvon Savaria, and Claude Thibeault. “FPGA implementation of Token-based Self-timed processors: A case study.” Proceedings of the 15th International New Circuits and Systems Conference (NEWCAS). IEEE, 2017. [212] Rajit Manohar and Mark Heinrich. “The Branch Processor Architecture”. Cornell Computer Systems Lab Technical Report CSL-TR-1999-1000, November 1999. [213] Rajit Manohar. “Width-adaptive data word architectures.” Proceedings of the Conference on Advanced Research in VLSI. IEEE, 2001. R.11.2 Digit-Serial Synchronous [214] John Von Neumann. “First Draft of a Report on the EDVAC.” Annals of the History of Computing, Volume 15 Number 4 Pages 27-75. IEEE, 1993. [215] J. Presper Eckert Jr, et al. “The UNIVAC system.” Proceedings of the International Workshop on Managing Requirements Knowledge. IEEE, 1951. [216] Albert A. Auerbach, et al. “The Binac.” Proceedings of the IRE, Volume 40 Number 1 Pages 12-29. IEEE, January 1952. [217] A. D. Beard, D. L. Nettleton, L. S. Bensky, G. E. Poorte. “Characteristics of the RCA BIZMAC Computer.” Proceedings of the Joint ACM-AIEE-IRE Western Computer Conference. 1956. [218] Z. G. Vranesic, V. C. Hamacher, Y. Y. Leung. “Design of a Fully Variable-Length Structured Minicomputer.” Proceedings of the 1st International Symposium on Computer Architecture (ISCA), Pages 251-255. December 1973. 248 [219] Mary Jane Irwin. “An Arithmetic Unit for On-line Computation.” University of Illinois at Urbana-Champaign, 1977. [220] Catherine Y Chow. “A Variable Precision Processor Module.” University of Illinois at Urbana-Champaign, Ann Arbor MI, 1980. [221] James Robertson. “Design of the Combinational Logic for a Radix-16 Digit-Slice for a Variable Precision Processor Module.” Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, IEEE 1983. [222] Marty S. Cohen, Thomas E. Hull, and V. Carl Hamacher. “CADAC: A controlled-precision decimal arithmetic unit.” Transactions on Computers, Volume 4 Pages 370-377. IEEE, 1983. [223] Kaoru Uchida, Tsutomu Temma. “Pipelined Dataflow Processor Architecture Based on a Variable Length Token Concept.” Proceedings of the International Conference on Parallel Processing, IEEE 1988. [224] Tony M. Carter. “Cascade: Hardware for high/variable precision arithmetic.” Proceedings of 9th Symposium on Computer Arithmetic. IEEE, 1989. [225] Alain Guyot, Yvan Herreros, and J-M. Muller. “JANUS, an on-line multiplier/divider for manipulating large numbers.” Proceedings of 9th Symposium on Computer Arithmetic. IEEE, 1989. [226] Michael J. Schulte, and Earl E. Swartzlander Jr. “Hardware design and arithmetic algorithms for a variable-precision, interval arithmetic coprocessor.” Proceedings of the 12th Symposium on Computer Arithmetic (ARITH), July 1995. [227] Michael J. Schulte, and Earl E. Swartzlander. “Variable-precision, interval arithmetic coprocessors.” Reliable Computing, Volume 2 Number 1 Pages 47-62. Springer, 1996. [228] Alexandre Ferreira Tenca. “Variable Long-Precision Arithmetic (VLPA) for Reconfigurable Coprocessor Architectures.” PhD Thesis, University of California, Los Angeles, 1998. [229] Ramon Canal, Antonio González, and James E. Smith. “Very low power pipelines using significance compression.” Proceedings of the 33rd International Symposium on Microarchitecture. ACM/IEEE, 2000. [230] Javier Hormigo, Julio Villalba, Emilio L. Zapata. “CORDIC Processor for Variable-Precision Interval Arithmetic.” The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, 2004. [231] Xin Cai. “A Serial Bitstream Processor for Smart Sensor Systems.” Dissertation at Duke University, 2010. Self-Timed [232] Virantha N. Ekanayake, Clinton Kelly, and Rajit Manohar. “Bitsnap: Dynamic Significance Compression for a Low-energy Sensor Network Asynchronous Processor.” Proceedings of the 11th International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2005. [233] Virantha Namal Ekanayake. “BitSNAP: A low-energy sensor network asynchronous processor with dynamic significance compression.” Dissertation at Cornell University, 2005. 249 R.12 Reconfigurable Arrays R.12.1 Bit-Parallel Synchronous [234] Reiner Hartenstein, Jurgen Becker, Rainer Kress, and Helmut Reinig. “High‐performance computing using a reconfigurable accelerator.” Concurrency and Computation Practice and Experience, Volume 8 Number 6 Pages 429-443. 1996. [235] Takashi Miyamori and Kunle Olukotun. “REMARC: Reconfigurable multimedia array coprocessor.” Transactions on Information and Systems. IEICE, 1999. [236] A. Alsolaim, J. Becker, M. Glesner, and J. Starzyk, “Architecture and application of a dynamically reconfigurable hardware array for future mobile communication systems.” Proceedings of the International Symposium on Field-Programmable Custom Computing Machines. IEEE, 2000. [237] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho, “MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications.” Transactions on Computers, Volume 49, Number 5. IEEE, May 2000. [238] U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, “The Imagine stream processor.” Proceedings of the International Conference on Computer Design. IEEE, September 2002. [239] R. David, D. Chillet, S. Pillement, and O. Sentieys, “DART: a dynamically reconfigurable architecture dealing with future mobile telecommunications constraints.” Résumé, 2003. [240] V. Baumgarte, G. Ehlers, F. May, A. Nückel, M. Vorbach, and M. Weinhardt, “PACT XPPA self-reconfigurable data processing architecture.” The Journal of Supercomputing, Volume 26, Number 2. 2003. [241] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “Adres: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix.” Field Programmable Logic and Application. Springer, 2003. [242] Paul M. Heysters, Gerard JM. Smit, and Egbert Molenkamp. “Energy-efficiency of the MONTIUM reconfigurable tile processor.” Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Pages 38-44. Las Vegas NV, 2004. [243] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ranganathan, D. Burger, S. W. Keckler, R. G. McDonald, and C. R. Moore, “Trips: A polymorphous architecture for exploiting ilp, tlp, and dlp.” Transactions on Architecture and Code Optimization. ACM, 2004. [244] Masayasu Suzuki, et al. “Stream applications on the dynamically reconfigurable processor.” Proceedings of the International Conference on Field-Programmable Technology. IEEE, 2004. [245] S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers, “The WaveScalar architecture.” Transactions on Computer Systems, Volume 25 Number 2. ACM, May 2007. 250 [246] Marco Lanuzza, Stefania Perri, Pasquale Corsonello, and Martin Margala. “A new reconfigurable coarse-grain architecture for multimedia applications.” Conference on Adaptive Hardware and Systems, Pages 119-126. NASA/ESA, Edinburgh UK, 2007 [247] Sami Khawam, et al. “The reconfigurable instruction cell array.” Transactions on very large scale integration (VLSI) systems, Volume 16 Number 1 Pages 75-85. IEEE, 2007. [248] H. Park, Y. Park, and S. Mahlke, “Polymorphic pipeline array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications.” Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE/ACM 2009. [249] Y. Park, H. Park, and S. Mahlke, “CGRA express: Accelerating execution using dynamic operation fusion.” Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES). ACM, 2009. [250] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A fully pipelined and dynamically composable architecture of CGRA.” Proceedings of the 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2014. [251] J. D. Souza, L. C. M. B. Rutzig, and A. C. S. B. Filho1, “A reconfigurable heterogeneous multicore with a homogeneous ISA.” Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE 2016. [252] D. C. Chen and J. M. Rabaey, “A reconfigurable multiprocessor ic for rapid prototyping of algorithmic-specific high-speed DSP data paths,” Journal of Solid-State Circuits, Volume 27, Number 12. IEEE, December 1992. [253] A. K. Yeung and J. M. Rabaey, “A reconfigurable data-driven multi-processor architecture for rapid prototyping of high throughput DSP algorithms,” Proceedings of the 26th Hawaii International Conference on System Sciences, Volume 1. IEEE, 1993. [254] E. Mirsky and A. DeHon. “MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources.” Proceedings of the Symposium on FPGAs for Custom Computing Machines. IEEE, 1996. [255] Ray Bittner, Peter M. Athanas, Mark Musgrove. “Colt: An experiment in wormhole run-time reconfiguration.” Photonics East Conference on High-Speed Computing, Digital Signal Processing, and Filtering Using FPGAs, 1996. [256] J. R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a reconfigurable coprocessor,‘ Proceedings of the 5th Annual Symposium on Field-Programmable Custom Computing Machines. IEEE, Apr 1997.’” [257] D. C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. Ebeling. “Architecture design of reconfigurable pipelined datapaths.” Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI. IEEE, 1999. [258] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R. Taylor, “PipeRench: a reconfigurable architecture and compiler.” Computer Volume 33 Number 4, April 2000. 251 [259] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit.” Proceedings of the 27th International Symposium on Computer Architecture, June 2000. [260] Michael Taylor, et al. “The raw microprocessor: A computational fabric for software circuits and general-purpose programs.” IEEE Micro Volume 22 Number 2 Pages 25-35, 2002. Self-Timed [261] Khodor Ahmad Fawaz, et al. “A dynamically reconfigurable asynchronous processor for low power applications.” Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP). IEEE, 2010. R.12.2 Digit-Serial Synchronous [262] Lakshmi N. Goyal. “Design of an arithmetic element for serial processing in an iterative structure.” Proceedings of the 3rd Symposium on Computer Arithmetic (ARITH), IEEE 1975. [263] Lakshmi Goyal. “A Study in the Design of an Arithmetic Element for Serial Processing in an Iterative Structure.” Dissertation at the Department of Electrical Engineering, University of Illinois, Urbana. 1976. [264] Buric, Misha R., and Carver A. Mead. “Bit-Serial Inner Product Processors in VLSI.” Pages 155-164. 1981. [265] Hillis, W. Daniel. “The Connection Machine: A Computer Architecture Based on Cellular Automata.” Physica D: Nonlinear Phenomena, Volume 10 Number 1 Pages 213-228. 1984. [266] Robert Michael Owens, and Mary Jane Irwin. “The arithmetic cube.” Transactions on Computers, Volume 100 Number 11 Pages 1342-1348. IEEE, 1987. [267] Mary Jane Irwin, Robert Michael Owens. “Digit Pipelined Processors.” The Journal of Supercomputing, Volume 1, Pages 61-86, 1987. [268] Robert Michael Owens, et al. “The design and implementation of the Arithmetic Cube II, a VLSI signal processing system.” Transactions on Very Large Scale Integration (VLSI) Systems, Volume 1 Number 4 Pages 491-502. IEEE, 1993. [269] Thomas P. Kelliher. “The arithmetic cube II and memory-based architectures for data structure manipulation.” Dissertation at Pennsylvania State University. 1993. [270] Peter John Waldernar Graumann. “Implementing digital signal processing algorithms using serial arithmetic.” Calgary, 1996. [271] Tsuyoshi Isshiki. “High-Performance Bit-Serial Datapath Implementation for Large-Scale Configurable Systems.” Penn State University, Page 33. April 1996. [272] Jyh-Huei Guo, and Chin-Liang Wang. “A novel digit-serial systolic array for modular multiplication.” Proceedings of the International Symposium on Circuits and Systems (ISCAS), Volume 2. IEEE, 1998. [273] Amar Aggoun, Mohammad K. Ibrahim, and Ahmed Ashur. “Bit-level pipelined digit-serial array processors.” Transactions on Circuits and Systems II: Analog and Digital Signal Processing Volume 45 Number 7 Pages 857-868. IEEE, 1998. 252 [274] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutchings, “A reconfigurable arithmetic array for multimedia applications.” Proceedings of the Seventh International Symposium on Field programmable Gate Arrays. ACM/SIGDA, 1999. [275] Hanho Lee. “Reconfigurable VLSI architecture and design for digit-serial DSP applications.” Dissertation at the University of Minnesota, 2000. [276] Phillip A. Marshall, Vincent C. Gaudet, and Duncan G. Elliott. “Deeply pipelined digit-serial LDPC decoding.” Transactions on Circuits and Systems I, Volume 59 Number 12 Pages 2934-2944. IEEE, 2012. [277] Yue Lu, Tom J. Kazmierski, and Lianxi Liu. “A Bit-Serial Variable-Accuracy FFT Processor For Energy-Harvesting Systems.” Proceedings of the Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 2018. Self-Timed [278] Achim Rettberg, et al. “A fully self-timed bit-serial pipeline architecture for embedded systems.” Design, Automation and Test in Europe Conference and Exhibition. IEEE, 2003. [279] Achim Rettberg, et al. “A new approach of a self-timed bit-serial synchronous pipeline architecture.” Proceedings of the 14th International Workshop on Rapid Systems Prototyping. IEEE, 2003. [280] Jianchuan Li. “Self-timed bit-serial architectures for digital signal processing.” Dissertation at University of Calgary (Canada), 2005. [281] Florian Dittmann, Achim Rettberg, and Raphael Weber. “Optimization techniques for a reconfigurable, self-timed, and bit-serial architecture.” Proceedings of the 20th annual conference on Integrated circuits and systems design. ACM, 2007. [282] Andrew Przybylski, Kashfia Haque, and Paul Beckett. “The Bel array: An asynchronous fine-grained co-processor for DSP.” Proceedings of the 10th International Conference on Signal Processing and Communication Systems (ICSPCS). IEEE, 2016. 253