SELF-TIMED LENGTH-ADAPTIVE ARITHMETIC
A Thesis
Presented to the Faculty of the Graduate School
of Cornell University
In Partial Fulfillment of the Requirements for the Degree of
Doctor of Engineering
by
Edward Arthur Bingham
December 2020
1
© 2020 Edward Arthur Bingham
ALL RIGHTS RESERVED
2
SELF-TIMED LENGTH-ADAPTIVE ARITHMETIC
Edward Arthur Bingham, Ph.D.
Cornell University, 2020
Diminishing returns in technology scaling has motivated a resurgence of exploration into
new computer architectures. While Coarse Grained Reconfigurable Arrays show promise in
accelerating commonly used complex operations, their overall capacity remains fairly limited.
While there is pressure on general purpose systems to support wide operations, the typical
workload mostly exercises the lower 10 to 15 bits. This leaves most of the array on and unused
during normal operation. This thesis presents adaptive digit-serial arithmetic as a plug-and-play
method to support a variety of bitwidth requirements, showing decreased energy and area
alongside increased throughput.
3
BIOGRAPHICAL SKETCH
Ned Bingham is a Computer Engineering PhD specializing in self-timed circuits. He received
his B.S. (2013), M.S. (2017), and PhD (2020) from Cornell with significant time spent at
Yale. During his Masters, he designed a set of tools for working with self-timed systems using
a control-flow specification called Handshaking Expansions. During his PhD, he researched
self-timed systems as a method of leveraging average workload characteristics in reconfigurable
architectures for general compute. Between his studies, he has worked at Intel on Pre-Silicon
Validation (Hudson MA, 2011, 2012), at Qualcomm on arithmetic architectures (San Diego
CA, 2014), at Google on self-timed systems (Mountain View CA, 2016) and at Google, again,
on Hangouts Chat (Sunnyvale CA, 2017). In his free time, he maintains a variety of interests
in the field, working on Compilers, Computer Graphics, and Natural Language Processing.
(www.nedbingham.com)
4
This thesis represents a relatively small selection from 7 long years of work, none of which
would have been possible without the steadfast anchor of love, support, and companionship.
Through the excitement, abjection, stress, and uncertainty, you have been there to keep me
going. I dedicate this thesis to you, Analeigha Olivia Ortega.
5
ACKNOWLEDGEMENTS
I'd like to acknowledge the diligent work and support from my advisor, Rajit Manohar, toward
my completion of this program along with the invaluable feedback from my committee as a whole,
Rajit Manohar, Christopher Batten, and Zhiru Zhang. I'd further like to acknowledge my wife
Analeigha Olivia Ortega, my parents, Lisa Blomgren Amsler and Geoffrey Bingham, and my
brother, Daniel Bingham, for their love and support. No one is self-made, and I wouldn't be the
person I am today without my mentors, my family, and my friends.
6
TABLE OF CONTENTS
Preface 19
1. Technology, Architecture, and the Clock 20
1. Process Technology 20
2. Accelerator Architectures 22
3. Digit-Serial Arithmetic 25
4. Self-Timed Circuits 26
5. Contributions 27
6. Collaboration, Previous Publications, and Funding 29
2. Workload Characterization 30
1. Parallelism 31
2. Locality 37
3. Memory and Bandwidth 40
4. Bitwidth 45
5. Lessons Learned 48
3. Design Methodology 50
1. Digit-Serial Adaptivity 50
2. Circuit Topology 51
3. QDI Control Circuits 52
4. Synthesis Strategy 56
5. Microarchitectural Optimizations 57
6. Half-Cycle Timing Assumption 68
7. QDI Treatment for Pass Transistor Logic 69
8. Example: Single Bit Register 75
9. Integrated QDI/BD Circuits 79
10. Toolset and Circuit Evaluation 86
4. Counters 90
1. Behavioral Specification 90
2. Increment and Decrement 95
3. Clear 99
4. Read 101
7
5. Write 106
6. Evaluation 115
5. Stream Manipulation 118
1. Sign Extension 118
2. Sign Compress by N 122
3. Sign Compress by One 127
4. Serial to Parallel 135
5. Evaluation 141
6. Bitwise Operations 154
1. AND, OR, XOR 154
2. Shift Left 154
3. Shift Right 160
4. Evaluation 167
7. Addition and Subtraction 177
1. Addition 177
2. Subtraction 181
3. Evaluation 181
8. Comparison and Conditionals 188
1. Compare to Zero 188
2. Conditional Sink 191
3. Evaluation 191
9. Multiplication 197
1. Behavioral Specification 198
2. Datapath 207
3. Control 210
4. Evaluation 214
10. Example Array Architecture 220
1. Sum Process 221
2. Mul Process 222
3. Evaluation 222
11. Conclusion 227
A. CHP Notation 231
8
1. Examples 232
B. PRS Notation 234
1. Examples 234
References 235
1. History 235
2. Process Technology 235
3. Processor Performance 236
4. Program Workload Analysis 236
5. Asynchronous Design 238
6. Counters 240
7. Adders 241
8. Multipliers 241
9. Computer Arithmetic 243
10. Architecture Surveys 244
11. Micro Processors 244
12. Reconfigurable Arrays 250
9
LIST OF FIGURES
1. History of the clock frequency of Intel's processors. 21
2. History of the power density in Intel's processors. Frequency, Thermal 21
Design Point (TDP), and Die Area were scraped for all Intel processors.
Frequency and TDP/Die Area were then averaged over all processors in
each technology. Switching Energy was roughly estimated from [19] and
[14] and combined with Frequency and Die Area to compute Power
Density.
3. Wire and Gate Delay across process technology nodes. These were 21
roughly estimated from [14] and [21]
4. History of SpecINT base mean, with benchmarks scaled appropriately 22
[26].
5. History of Intel process technology defect density. Intel's defect density 22
trends were very roughly estimated from [12][13][14][15][16] and [17].
6. History of transistor count in Intel chips. Transistor density was averaged 22
over all Intel processors developed in each technology.
7. Instruction address locality in Spec Benchmarks. 24
8. Average bitwidth distribution in Spec Benchmarks. 24
9. Example dependency graph. 33
10. Trace of available parallelism for perlbench.0. 35
11. Distribution of available parallelism with respect to cycles in the Spec 35
benchmark programs.
12. Distribution of available parallelism with respect to operations in the 35
Spec benchmark programs.
13. Attainable speedup in the Spec benchmark programs. 37
14. Distribution of workloads for runs of cycles. 37
15. Reported application speedup with GPU. bzip2[46], wrf[47], h264ref and 38
lbm[48], dealII[49], gromacs[50], GemsFDTD[51], zeusmp[52],
namd[53], bwaves[54], gamess[55], mcf[56], astar[57], milc[58],
libquantum[59], hmmer[60].
10
16. Microarchitectural breakdown of energy usage in an out of order 39
superscalar processor [44].
17. Microarchitectural breakdown of energy usage in an embedded RISC 39
processor [45].
18. Instruction address locality in Spec Benchmarks. 39
19. Distribution of maximum local correlation of parallelism within 100 41
cycles.
20. Distribution of instruction usage as a percent of total execution averaged 41
over all programs.
21. Distribution of instruction usage categories as a percent of total execution 41
for all programs in Spec2006.
22. Trace of memory requirements for perlbench.0. 43
23. Distribution of total storage requirements for each program. 43
24. Distribution of Storage Reuse Distance for an average cycle in each 43
program.
25. Distribution of Read Bandwidth over all of the cycles in each program. 44
26. Read Reuse Distance for an average cycle in each program. 44
27. Read Lifetime Distance for an average cycle in each program. 45
28. Distribution of Write Bandwidth over all of the cycles in each program. 45
29. Write Reuse Distance for an average cycle in each program. 45
30. Write Lifetime Distance for an average cycle in each program. 46
31. Distribution of ALU operation usage across all programs. 47
32. Average bitwidth distribution of integer operations across all programs. 47
33. Average bitwidth of integer operations in each program. 47
34. Alignment of integer addition operations. This is ultimately the location 48
of the first non-zero bit.
35. Distribution of run-length vs start-bit for the run. 48
36. Wires required to represent a bit using a specific MofN code (top), 53
relative energy required to communicate each bit (middle), and
transistors per bit required to implement a validity gate for a specific
MofN code assuming simple transistor sharing trees (bottom). Each
curve shows a single selection of M while sweeping N.
11
37. Channel protocols for QDI buffers. 55
38. A basic template for QDI control with bundled data. 80
39. The channel protocol for the input and output channels of a pipeline stage 80
evaluated over two packets of data.
40. Circuit diagram for an asymmetric delay line. The upgoing transition is 80
delayed by six inverters while the downgoing by only two.
41. Communicating a QDI request signal to the bundled datapath using an 82
SR latch.
42. Communicating a QDI internal memory to the bundled datapath using an 82
n-latch.
43. Breaking a communication cycle with a p-latch on the output request. 84
44. Clocking a memory internal to the datapath. 84
45. Clocking a memory internal to the datapath. 85
46. The p-and-p latch (left) passes the value when both clocks are high, the 85
n-or-n latch (right) passes the value when either clock is low.
47. QDI internal memory to datapath memory communication. 87
48. Clocking an exchange channel. 87
49. The p-or-p latch (left) passes the value when either clock is high, the 87
n-and-n latch (right) passes the value when both clocks are low.
50. The basic template for datapath memory with exchange channels (left) vs 88
after all of the previously discussed memory optimizations (right).
51. The idzn counter decomposed into processes. 94
52. Read counter components. 103
53. Write counter components. 107
54. Bundled-Data write counter interface. 111
55. Chunked Bundled-Data write counter interface. 113
56. The distribution of carry chain lengths for increment and decrement 117
commands in the SPEC2006 benchmark.
57. Measured Performance and Energy for an array of counters. 117
58. The architecture of the integrated QDI/BD stream full compression unit. 128
59. The architecture of the integrated QDI/BD Stream Compress One unit. 134
12
60. Structure of multi-node operations for first approach for serial to parallel 137
conversion.
61. Structure of multi-node operations for second approach for serial to 137
parallel conversion.
62. Overview of the sign-extension unit performance. 142
63. Combined probability distribution of the input bitwidths to the addition, 144
subtraction, and bitwise operations.
64. Overview of the sign-extension unit performance. 146
65. Overview of the compression unit performance. 146
66. Combined probability distribution of the output run lengths for a given 148
start bit for the addition, subtraction, multiplication, and bitwise
operations.
67. Average bitwidth distribution of the output for the addition, subtraction, 148
multiplication, and bitwise operations.
68. Average number of redundant bits introduced into the encoding by the 149
addition, subtraction, multiplication, and bitwise operations.
69. Overview of the compress unit performance. 152
70. Throughput and energy per serial token of the upflow and downflow 153
serial to parallel units as pipeline length increases.
71. Block diagram of the shift left operator. 156
72. Block diagram of the flop driving X. 156
73. Block diagram of the shift right operator. 162
74. Probability distribution for the bitwidth of the left operand A and right 168
operand B for AND, OR, and XOR.
75. Throughput/transistor (left) and energy/op (right) metrics scaled by 170
maximum bitwidth.
76. Probability distribution for the number of redundant bits introduced per 170
operation by this implementation of the bitwise operators.
77. Performance and energy averaged over the distribution in Fig. 78 and 170
Fig. 80 vs Transistor Count.
78. Probability distribution for the bitwidth of the shifted value A and the 172
shift amount B for the left shift operator.
13
79. Throughput/transistor (left) and energy/op (right) metrics scaled by 174
maximum bitwidth.
80. Probability distribution for the bitwidth of the shifted value A and the 174
shift amount B for the left shift operator.
81. Throughput and energy metrics scaled by maximum bitwidth. 176
82. The architecture of the Adaptive Adder. 179
83. Transistor diagram of LSB adder control circuitry. 181
84. The architecture of the Adaptive Adder/Subtractor. 182
85. Joint probability distribution for the two input bitwidths. 184
86. Performance and energy averaged over the distribution in Fig. 85 vs 186
Transistor Count.
87. Each point corresponds to the simulated energy per add averaged for 187
multiple adds over the distribution in Fig. 85 for a given maximum
bitwidth.
88. Probability distribution for the number of redundant bits introduced per 187
operation by the adder.
89. Probability distribution for the input bitwidth. 194
90. Performance and energy averaged over the distribution in Fig. 89 vs 196
Transistor Count.
91. Each point corresponds to the simulated energy per compare averaged for 196
multiple compare operations over the distribution in Fig. 89 for a given
maximum bitwidth.
92. The underlying multiplier architecture used in this chapter. 199
93. Datapath architecture for each digit unit of the multiplier. 209
94. Performance and energy averaged over the distribution in Fig. 95 vs 215
Transistor Count.
95. Probability distribution for the bitwidth of the left operand A and right 217
operand B for multiplication.
96. Throughput/transistor (left) and energy/op (right) metrics scaled by 218
maximum bitwidth.
97. Probability distribution for the number of redundant bits introduced per 219
operation by the multiplier.
98. Operation implemented by the arithmetic cube. 221
14
99. Architecture of the Arithmetic Cube. 221
100. Forwarding B. 223
101. Sum process with bypassing multiplexers. 223
102. Mul process with bypassing multiplexers. 223
103. Walk through of simulation of channels along the first column of the 224
Arithmetic Cube.
104. Waveform (left) of channels along the first column (right) of the 226
Arithmetic Cube.
105. Resulting gate for asymmetric C-element (left) and pass-transistor XOR 234
(right).
15
LIST OF TABLES
1. The state space of the dual differential pass transistor XOR. Rows are 72
highlighted when a and _a are the same value.
2. The state space of the pass transistor AND. 73
3. The state space of the pass transistor OR. 74
4. Raw performance measurements for the sign extension units. 143
5. Utilization of each condition for the addition, subtraction, AND, OR, and 146
XOR operators.
6. Raw performance measurements for the compressN units. 147
7. Raw performance measurements for the compress1 units. 147
8. Utilization of each compressN condition for the addition, subtraction, 151
multiplication, and bitwise operators.
9. Utilization of each compress1 condition for the addition, subtraction, 152
multiplication, and bitwise operators.
10. Utilization of each condition for the AND, OR, and XOR operators. 168
11. Performance measurements for the bit-parallel bitwise operators. 168
12. Raw performance measurements for the digit-serial bitwise operators. 169
13. Performance measurements for the bit-parallel shift operators. 170
14. Raw performance measurements for the shift operations. 171
15. Utilization of each condition for left shift. 174
16. Utilization of each condition for right shift. 176
17. Performance measurements for the bit-parallel addition operators. 183
18. Raw performance measurements for the digit-serial addition operators. 183
19. Utilization of each condition for the addition operator. 186
20. Raw performance measurements for the bit-parallel comparison 193
operators.
21. Raw performance measurements for the digit-serial comparison operator. 193
22. Utilization of each condition for the comparison operator. 196
23. Raw performance measurements for the digit-serial conditional sink 196
operator.
24. Short history of multiplication algorithms and their complexity. 198
16
25. Booth encoding for a four-bit digit multiply. 208
26. Performance measurements for the bit-parallel bitwise operators. 215
27. Raw performance measurements for the digit-serial multiply operator. 216
28. Utilization of each condition for a multiply. 217
29. Utilization of each condition for the least significant digit of the multiply 218
circuit.
30. Comparison of average performance and energy for all operators against 228
their closest clocked counterparts. The multiplier throughput numbers
depend on the scheduling efficiency of the dynamic approach, and the
compare numbers depend upon the clock overhead associated with the
clocked operator comparison point.
31. Comparison of effective sequential execution throughput of all operators 229
against their closest clocked counterparts. The comparison operator is an
average of 444 ps from not-equal comparisons 73% of the time and 791
ps from the other comparisons the other 27% of the time.
17
LIST OF ABBREVIATIONS
• ALU Arithmetic Logic Unit
• API Application Programming Interface
• ASIC Application Specific Integrated Circuit
• BD Bundled-Data
• CGRA Coarse Grained Reconfigurable Array
• CHP Communicating Hardware Processes
• CISC Complex Instruction Set Computer
• CMOS Complementary Metal Oxide Semiconductor
• CPU Central Processing Unit
• CSP Communicating Sequential Processes
• DI Delay-Insensitive
• DSA Dynamic Single Assignment
• DSP Digital Signal Processing
• FIFO First-In First-Out queue
• FPGA Field Programmable Gate Array
• GCC GNU C Compiler
• GPU Graphics Processing Unit
• HCTA Half-Cycle Timing Assumption
• ILP Instruction Level Parallelism
• IPC Instructions Per Cycle
• ISA Instruction Set Architecture
• LSB Least Significant Bit
• LSD Least Significant Digit
• MSB Most Significant Bit
• NMOS N-type Metal Oxide Semiconductor
• NoC Network on Chip
• PCFB Pre-Charge Full Buffer
• PCHB Pre-Charge Half Buffer
• PMOS P-type Metal Oxide Semiconductor
• PRS Production Rule Set
• QDI Quasi Delay-Insensitive
• RISC Reduced Instruction Set Computer
• SIMD Single Instruction Multiple Data
• SR Set-Reset
• TDP Thermal Design Point
• WCHB Weak-Condition Half Buffer
• uP Micro-Processor
18
PREFACE
While technology scaling has driven progress in computational power since 1970, that trend
has slowed to a halt over the last 15 years. This is motivating research into alternative architectures
that explore parallelism, specialization, and reconfigurability. In particular, Coarse Grained
Reconfigurable Arrays (CGRA) seem to show great promise as general purpose accelerators.
However, their efficacy and therefore adoption has been hindered by capacity limitations and
current approaches to rectify this still fall short.
In this thesis, I propose to leverage self-timed circuits to implement length-adaptive digit-serial
arithmetic operators for use in configurable array architectures. These circuits significantly reduce
the area requirements while maintaining support for arbitrary bitwidths, greatly increasing the
capacity of the CGRA at no extra cost. They are implemented in completely self-contained
modules that require no extra considerations in larger architectural contexts while elegantly
avoiding unnecessary computation. I elaborate on the work I have already done in this space
[75][76] and contribute the remaining arithmetic operations, showing significant improvements
on all metrics. Finally, I show how these operators may be used in a simple CGRA.
19
CHAPTER 1
TECHNOLOGY, ARCHITECTURE, AND THE CLOCK
The concepts introduced by Von Neumann in 1945 [214], remain the centerpiece of computer
architectures to this day. His programmable model for general purpose compute combined with
a relentless march toward increasingly efficient devices cultivated significant long-term
advancement in the performance and power-efficiency of general-purpose computers. For a long
time, chip area was the limiting factor and raw instruction throughput was the goal, leaving
energy largely ignored. However, technology scaling has demonstrated diminishing returns, and
the technology landscape has shifted quite a bit over the last 15 years.
1.1 Process Technology
Around 2007, three things happened. First, Apple released the iPhone opening a new industry
for mobile devices with limited access to power. Second, chips produced with technology nodes
following Intel's 90nm process ceased scaling frequency (Fig. 1) as the power density collided
with the limitations of air-cooling (Fig. 2). For the first time in the industry, a chip could not
possibly run all transistors at full throughput without exceeding the thermal limits imposed by
standard cooling technology. By 2011, up to 80% of transistors had to remain off at any given
time [22].
Third, the growth in wire delay relative to frequency introduced new difficulties in clock
distribution. Specifically, around the introduction of the 90nm process, global wire delay was just
long enough relative to the clock period to prevent reliable distribution across the whole chip (Fig.
3).
As a result of these factors, the throughput of sequential programs stopped scaling after
2005 (Fig. 4). The industry adapted, turning its focus toward parallelism. In 2006, Intel's Spec
Benchmark scores jump by a 135% with the transtion from NetBurst to the Core
microarchitecture, dropping the base clock speed to optimize energy and doubling the width
of the issue queue from two to four, targeting Instruction Level Parallelism (ILP) instead of
raw execution speed of sequential operations [9]. Afterward, performance grows steadily as
architectures continue to optimize for ILP. While Spec2000 focused on sequential tasks, Spec2006
introduced more parallel tasks [43].
By 2012, Intel had pushed most other competitors out of the Desktop CPU market, and chips
following Intel's 32nm process ceased scaling total transistor counts. While smaller feature sizes
supported higher transistor density, it also brought higher defect density (Fig. 5) causing yield
20
Fig. 1: History of the clock frequency of Intel's processors.
Fig. 2: History of the power density in Intel's processors. Frequency, Thermal Design Point (TDP), and Die Area
were scraped for all Intel processors. Frequency and TDP/Die Area were then averaged over all processors in each
technology. Switching Energy was roughly estimated from [19] and [14] and combined with Frequency and Die
Area to compute Power Density.
Fig. 3: Wire and Gate Delay across process technology nodes. These were roughly estimated from [14] and [21]
losses that make larger chips significantly more expensive (Fig. 6).
Today, energy has superceded area as the limiting factor and architects must balance
throughput against energy per operation. Furthermore, improvements in parallel programs have
slowed due to a combination of factors (Fig. 4). First, all available parallelism has already been
exploited for many applications. Second, limitations in power density and device counts have put
an upper bound on the amount of computations that can be performed at any given time. And
21
Fig. 4: History of SpecINT base mean, with benchmarks scaled appropriately [26].
Fig. 5: History of Intel process technology defect density. Intel's defect density trends were very roughly estimated
from [12][13][14][15][16] and [17].
Fig. 6: History of transistor count in Intel chips. Transistor density was averaged over all Intel processors developed
in each technology.
third, memory bandwidth has lagged behind compute throughput, introducing a bottleneck that
limits the amount of data that can be communicated at any given time [24].
1.2 Accelerator Architectures
These new constraints have rekindled exploration into alternative architectures that reduce
energy requirements in order to increase total throughput. In light of diminishing returns
exploiting parallelism, the industry has been exploring specialization and configurability for
potential improvements in energy. While these approaches have increased performance by orders
22
of magnitude for specific applications [25], they have remained largely separated from general
compute.
However, one approach shows particular promise in bridging this gap. The Von-Neumann
architecture has a significant energy overhead that seems to be a good fit for specialization and
configurability. Reconfiguring the chip to execute each new dynamic instruction requires quite a
bit of energy. However, the vast majority of dynamic instructions often come from a relatively
small selection of the program specification (Fig. 7). For information processing applications like
search, simulation, and compression, generally fewer than 100 static instructions are required
to account for 50% or more of the program execution. Though others, like compilation, require
upwards of 3000 static instructions.
It is well known that Field Programmable Gate Arrays (FPGA) do not introduce the same type
of overhead because the configuration remains static. However, this particular feature also limits
FPGAs to smaller programs with a different programming model. Coarse Grained Reconfigurable
Arrays (CGRA) [154] specialize the circuitry at each node, replacing the lookup tables in FPGAs
with minimal Arithmetic Logic Units (ALU) or Micro-Processors (uP) implementing boolean
operations, addition, subtraction, and sometimes even multiplication. This increases overall
capacity and allows them to be used in conjunction with standard Von Neumann architectures to
accelerate commonly used complex operations as in [234] and [212].
The vast majority of designs from 1996 to 2016 implemented fairly simple CGRA
architectures, clocked with standard bit-parallel [234-251] or bit-serial [264-273] ALUs or uPs
connected via routers to a crossbar or mesh Network on Chip (NoC). Architectural advancements
within that baseline tend to focus on the integration of that array with other compute and
memory elements, and run-time reconfiguration to maximize utilization. Furthermore, many of
these architectures are highly specialized to particular problems, typically in the signal processing
domain due to limitations in overall capacity as most of the architectures published have less than
200 execution nodes.
However, there are facets of the workload that have not been sufficiently exploited. In
particular, as shown in Fig. 8 and discussed in [160] and [159], it is well-known that most
arithmetic operations in a given application do not require the full width of the datapath.
Von-Neumann architectures have been taking some steps to exploit this particular feature by
implementing multiple datapaths of various sizes, aggressive clock-gating, operator packing
[159], and staggered execution [6][8].
However, array architectures have not really been able to do the same. None of the designs in
[234-251] or [264-273] make any considerations for bitwidth allocation, implementing datapaths
that are rigidly restricted to a specific bitwidth, wasting area and energy for most operations
23
Fig. 7: Instruction address locality in Spec Benchmarks.
Fig. 8: Average bitwidth distribution in Spec Benchmarks.
in order to support the worst-case. Two older designs, [252] and [253], implement a chip-wide
configuration of 16-bit or 32-bit modes, and [256-259] have small ALUs that can be statically
combined into arbitrarily wide parallel datapaths. However, this approach is altogether lacking
because it assumes that the bitwidth can be determined at compile-time, putting a lot of
responsibility on the programmer, the language they use, and its compiler. Generally, the
programmer is unreliable, tending toward data-types that are far too large for the typical data they
store. Modern programming languages do not help, lacking native support for granularity smaller
than 8 bits and lacking clean syntax to track how bitwidth requirements of a single variable might
shift throughout execution. And, compilers are severely limited, unable to dependably propagate
computed data ranges through memory boundaries or determine how multithreading might effect
those data ranges. [34][35]
Run-time configuration of such an architecture is ultimately challenging. First, it places
constraints on the network architecture to ensure all inter-digit dependencies are correctly routed.
If this is resolved by routing all digits together, this excludes any networks in which the paths
of those digits require different hop counts. This constraint could be resolved by using all-to-all
circuit-switched networks much like FPGAs, layered networks much like modern
machine-learning ASICs, or adding pipeline stages to delay the faster digits. If the digits are
24
allowed to be routed separately, then inter-digit dependencies must be routed as their own packets,
requiring single-bit paths for carry propagation. All of this causes very high routing overhead for
the architecture as a whole.
Second, the algorithm for mapping and routing operations on an array architecture has
exponential complexity and is not something easily solved at run-time on bare metal [156]. Any
software API for this would require self-modifying code, which is the approach taken by [274].
Using that feature to dynamically adapt bitwidth would ultimately introduce more overhead than
it is worth.
Third, suppose the operator configuration is static but run-time configuration for the extra
resources required by wide operations is allowed. That run-time configuration step might put the
array over 100% utilization, risking deadlock as a wide operation waits for access to execution
nodes it may never get. Ensuring correct operation in this environment is typically expensive
[156].
1.3 Digit-Serial Arithmetic
Alternatively, a digit-serial architecture completely eliminates these problems. Inter-digit
dependencies simply stay in place and consecutive digits are all routed along the same path.
Instead of having to dynamically allocate resources for different lengths, a single resource is
allocated for a longer period of time.
Before 1970, many computer architectures were digit-serial and quite a few had implemented
length-adaptivity [214-217]. However, their approach to length-adaptive arithmetic was baked
into the control circuitry of the surrounding Von-Neumann architecture. Their digit-serial ALU
was simply a smaller version of their bit-parallel counterparts with minor adjustments for
inter-digit dependencies. The register file was often implemented with shift registers that stored
multiple digits and streamed them one-at-a-time through the ALU. [214-231]
After 1970, digit serial arithmetic had been largely relegated to array architectures targeting
Digital Signal Processing (DSP) applications and linear algebra acceleration
[150][266-277][280][282]. Focus tends to land on MSB first redundant arithmetic [147-149] and
length-adaptivity is generally restricted to a static chip-wide configuration [262-277]. However,
there have been a few non-array architectures that implement the token-based approach necessary
for true length-adaptivity in array architectures [223].
Ultimately, adaptive digit-serial arithmetic has largely been forgone, and the reasons are fairly
simple. First, a modern Von-Neumann architecture has very deep linear pipelines. It starts as a
single trunk, branches out to multiple execution nodes after the issue queue and merges back
into a single trunk some time before write-back. However, a serial digit stream can be quite
25
long, blocking the pipeline at the trunks while consecutive digits are being processed. This
makes it near impossible to efficiently issue multiple instructions and limits the throughput of the
whole architecture significantly. It is no surprise then that, except in the domain of low power
low performance devices, digit serial Von-Neumann architectures could not compete with their
bit-parallel counterparts after around 1960.
Array architectures have quite wide networks and therefore do not suffer from the same
bottleneck. However, they have a problem with timing. Adaptivity in this case means that two
digit-streams can have different lengths. Consider a heavily used node executing a sequence of
consecutive operations. If one input is shorter than the other, that input must delay its pipeline
while the remaining digits of the longer input are being processed. In a clocked architecture,
propagating these pipeline delay signals back through the array along the route followed by
consecutive data words quickly becomes complex, particularly if the array supports loops or
conditions. This is generally done with control flow using valid and ready bits to implement a
handshake [78], at which point you might as well use asynchronous circuitry [278][279][281].
Furthermore, there are quite a few cases in which the length and structure of a digit-stream
might need to be manipulated. For example, sign extension can add digits to the back of the
stream; shifting can add digits to or remove digits from the front of the stream; rotation can move
digits from the front to the back or visa-versa; addition can result in shorter and longer streams
depending upon the carry; and multiplication requires serial-to-parallel conversion and dynamic
allocation and deallocation of execution nodes. Overall, these operations require extremely
complex control behaviors. Once again, it is no surprise that length-adaptivity has been relegated
to a static chip-wide configuration.
1.4 Self-Timed Circuits
Self-timed circuits can solve these problems in a simple and elegant way. First, nodes are
connected via channels, each with a request and acknowledge. Each digit is transmitted from
one node to another as a request in the channel protocol, and the next request will not be sent
until the previous one has been acknowledged. In this way, they implement flow-control with
back-pressure causing consecutive digits, and therefore words, to wait as needed. Furthermore,
because self-timed circuits are event-based systems, they are able to implement complex control
behaviors with little overhead.
Relative to clocked design, self-timed circuits have suffered poor visibility in the industry.
There is some sense that generalized asynchronous circuits with a strong framework for timing
assumptions [70][72] could be extremely powerful [173]. However speaking from experience,
complex asynchronous circuits are painfully difficult to design and typically require multiple
26
attempts before stumbling onto the correct approach.
This has motivated various attempts to make the process even the slightest bit easier. Methods
to formally derive asynchronous circuits from a program specification as in [65] are not quite
complete. For example, [67] starts from an intermediate graph specification and their compiled
circuits rely upon fast inverters for correct operation. Methods to directly map the syntax of a
program specification onto circuit primitives as in [62] and [66] are entirely robust, but result in
circuits with high overhead. Overall, the only approach that seems to produce efficient devices
has effectively been no approach at all. Start from a circuit template as in [68], and then design
the rest of the behaviors by hand.
Because of this difficulty, expeditions into self-timed circuit design have expressed a bias
toward simple pipeline behaviors and familiar architectures. The vast majority of self-timed
projects yield bit-parallel Von-Neumann architectures [161-202], many of which are simply
desynchronizations of existing synchronous architectures [71]. Similarly simple reconfigurable
array architectures are seen as well [282].
However, some start to explore what complex control can do. [203-205] implement a
counterflow architecture in which instructions flow down the pipeline while results flow up,
effectively implementing a linear bypassing pipeline. The Rotary Pipeline architecture in [206]
suggests a ring of connected ALUs around which results are constantly flowing. The Vortex
architecture in [207] and later [208] is very similar, but with a crossbar network instead of the ring.
Routes from node to node are explicitly dictated by each instruction. The Octasic architecture in
[209-211] flips the idea on its head, keeping the data within each ALU while routing arbitration
tokens that each represent a different stage of the pipeline (fetch, decode, execute, memory,
write-back) around a ring. When a particular ALU has a token, it is granted access to the bus
dedicated to that external resource. [212] proposes a decomposition of the standard architecture
into two separate processors. One specifically handles branches, while the other executes blocks
of dataflow instructions, fixed-length loops, and simple conditions.
These alternative architectures are certainly interesting, but only two seem to target the feature
at hand. [213] suggests bit-parallel width-adaptive architectures that effectively clock-gate the
unused bits. [232] implements a length-adaptive bit-serial pipeline. However, it relies upon
the control circuitry in a Von-Neumann architecture much like the Von-Neumann synchronous
approach. Ultimately, a lot of architectural possibilities in the self-timed space have yet to be
explored.
1.5 Contributions
In light of these considerations, this thesis explores the application of self-timed circuits
27
toward the implementation of length-adaptive digit-serial arithmetic operators for configurable
array architectures, elaborating on the work I previously completed toward this end [75][76] and
exploring complex control circuitry through templated synthesis.
In summary, this thesis contributes to the domain of Adaptive Self-Timed Arithmetic Circuitry
as a first step toward maximizing the density of compute resources in Coarse Grained
Reconfigurable Arrays. Doing so would allow for more significant sections of a program to be
mapped to the configurable architecture thereby making that architecture applicable to a larger
set of real world problems. This thesis supplies digit-serial arithmetic operators with built-in
flow control that adapts to varying digit-stream lengths. In the state of the art, the approach
that most closely resembles this work is found in [232]. Qualitatively, the existing approach
is not modular, does not sufficiently explore the design space, and exhibits poor performance.
Meanwhile, this thesis comprehensively explores the design space, presenting highly modular
operators that provide significant compute density improvements by doubling the throughput per
transistor and halving the energy per operation on average compared to their industry standard
counterparts. Specifically, each chapter contributes the following to the state of the art:
Chapter 2: Workload Characterization
does an in-depth analysis of the workloads from the Spec benchmarks, covering available
parallelism, instruction locality, memory usage and bandwidth, and integer arithmetic
features.
Chapter 3: Design Methodology
describes several high-level approaches for adaptivity and discusses the strengths and
weaknesses of available circuit design methodologies. Settling on an integrated
QDI/Bundled-data approach, it covers how to communicate data between the QDI control
and Bundled-Data datapath and how to approach the synthesis of QDI control circuitry.
Chapter 4: Counters
goes in depth on QDI counters, covering increment, decrement, clear, read, and write
commands along with constant time zero detection. This chapter offers significant
performance improvements beyond my previous work in [75].
Chapter 5: Stream Manipulation
covers circuitry necessary for basic digit stream manipulation, including sign extension
as a prerequisite for all multi-input operations, sign compression to reduce a digit stream
28
to its minimum length, and serial-to-parallel conversion as a prerequisite for shifting and
multiplication operators.
Chapter 6: Bitwise Operators
shows how simple bitwise operators can be grafted into the sign extension circuitry, and
how the counters and serial-to-parallel circuitry may be used to implement stream shifting.
Chapter 7: Addition and Subtraction
elaborates on the work I completed in [76], covering digit-serial addition and subtraction.
Chapter 8: Comparison and Conditionals
describes operators that compare a value with zero, and that conditionally pass or sink a
value. These circuits may be used to implement conditional moves.
Chapter 9: Multiplication
explores multiplication architectures and their trade-offs and difficulties.
Chapter 10: Example Array Architecture
shows how these circuits interact in the context of a simple array architecture called the
Arithmetic Cube [266][268].
Finally, the outcome of this work is summarized and further research opportunities are
discussed.
1.6 Collaboration, Previous Publications, and Funding
Without the guidance and collaboration of my advisor, Rajit Manohar, none of this work could
have come to fruition.
I have published three papers at the time of writing this thesis. "QDI Constant Time Counters"
[75] covers my initial work on counter circuits. I improve upon this work in Chapter 4.
"Self-Timed Adaptive Digit-Serial Addition" [76] covers my work on the addition operator. This
corresponds to Chapter 7 of this thesis. My third paper, "A Systematic Approach for Arbitration
Expressions" [77] is orthogonal to this work and is therefore not included.
Various funding agencies allowed me to explore this work including the National Science
Foundation (CCF-1065307, CCF-1617945), the Office of Naval Research (N00014-13-1-0419),
and the Air Force Research Laboratory (FA8750-15-1-0173).
29
CHAPTER 2
WORKLOAD CHARACTERIZATION
Before designing any circuitry, it is important to make a close examination of the functionality
being targeted. Why specifically would a CGRA accelerator provide any benefit beyond modern
architectures, and why is it important for its datapath to be digit-serial? This chapter endeavors
to produce detailed and specific answers to these questions with a quantitative examination of
important features underlying the workload.
There are a few industry standard benchmarking suites and countless domain specific
benchmark applications that could be used to expose underlying features in common workloads.
Ultimately, the industry standard suites tend to examine the breadth and depth of their workloads
more rigorously. Passmark [31] has become the most popular consumer facing benchmark.
However, Parsec [27], Splash [29], TPC [30], Spec-OMP [32], and Spec-CPU [26] are the most
popular for computer architecture research. Until around 2012, SPEC was the industry accepted
benchmark for this purpose. Around that time, the breadth of application domains increased
dramatically with the introduction of mobile, big-data, and machine learning systems. Today,
there is no default correct choice for a benchmark suite.
Parsec, Splash, and Spec-OMP focus on the performance of many-core systems while TPC
focuses on database sytems. These benchmarks emphasize the performance of the on-chip
network and memory systems. While this is good for understanding the performance of the whole
system-on-chip, Spec-CPU focuses heavily on the performance of a single core.
Relative to the Spec-CPU2006 benchmarks, Spec-CPU2017 adjusts the covered domain space
by removing some applications and adding others. Ultimately, Spec-CPU2017 includes many of
the same applications while adding more applications for machine learning problems [28].
As discussed in Chapter 1 Section B, CGRA accelerators can optimise the performance
of a core's sequential instruction execution, particularly in low-power environments. Of the
benchmarks, Spec-CPU2006 should exhibit the least parallelism, the most data-interdependency,
and the least locality. This represents the most conservative application of a CGRA to real-world
problems. While it should also exhibit the lowest total memory bandwidth requirements, memory
system design is not in the scope of this work.
Therefore, this chapter analyzes the programs found in the Spec-CPU2006 [26] benchmark
using Intel PIN. The 29 applications listed below were carefully selected by the Spec Benchmark
Committee to be an approximation of realistic workloads (perlbench, bwaves, milc, cactusADM,
30
gobmk, povray, sjeng, h264ref, omnetpp, sphinx3, bzip2, gamess, zeusmp, leslie3d, dealII,
calculix, GemsFDTD, tonto, astar, xalancbmk, gcc, mcf, gromacs, namd, soplex, hmmer,
libquantum, lbm, wrf).
There are many caveats to this approach. The workload requirements of a given program
are heavily influenced by the machines available at the time of development, because people
will not develop a program that no machine can execute to completion. Therefore, designing
an architecture strictly optimized toward observed workload requirements is unlikely to expose
new opportunities in software. Also, many features of the measured workload are dramatically
affected by the compiler and the target machine used to execute the program. Ultimately, complete
isolation and characterization of these effects is extremely difficult and beyond the scope of this
analysis.
Intel PIN is simultaneously extremely helpful and problematic. It facilitates a deep dive into
aspects of the executed program in ways that other tracing tools cannot. However, it also forces
the use of Intel's x86 64-bit architecture, which is ultimately a RISC core combined with a
powerful microcoding engine to implement a complex Instruction Set Architecture (ISA). Any
analysis in this environment will have drastically different results from analyses in standard RISC
environments. Loads and stores are represented by addressing modes rather than instructions,
several instructions are Turing Complete all on their own [4][5], and there are complex
instructions like sqrt , sin , and cos that hide calls to the more basic operators.
Therefore, extracting workload data with respect to a generalized Von-Neumann RISC
architecture requires some post-processing. Memory loads and stores should be separated from
their parent instruction, complex operators should be split into their microcodes using known
best implementations, SIMD instructions should be split up, and instruction variants should be
grouped. Fortunately, the GCC compiler produces fairly reasonable assembly, avoiding most of
the complex microcoded instructions and obfuscated compilation strategies available in the ISA,
leaving those effects negligible.
The results are presented as collection of distributions with one distribution per program.
These distributions are color-coded such that red means "or more", green means "exactly", and
blue means "or less". Therefore, red and blue represent different kinds of cumulative distribution
functions while green represents a probability distribution function.
2.1 Parallelism
Overall, there are two arguments for CGRA architectures that dominate the literature. One cites
increased programmability beyond ASICs resulting in cheaper fabrication with lower design time
and time to market [236][239][244][247][249][252][253][254][258][259].
31
The other focuses heavily on their capabilities regarding available parallelism beyond CPUs
[234][235][237][238][240][243][245][246][248][250][255][256][257][261]. While speeding up
sequential execution of instructions via clock speed is off the table due to technology-node
constraints, it has been argued that there is still quite a bit to be gained from accelerating
embarrassingly parallel programs, particularly regarding the domain of mobile or embedded DSP
applications. However, many programs in this category have already been moved off of the
CPU and onto the GPU. So, when making this argument it is necessary to show both that there
is parallelism available for further speedup, and that the GPU is not capable of capturing that
speedup.
The underlying concept of available parallelism is fairly simple. To be concise, it represents the
speedup gained from being able to execute some number of instructions in parallel. Unfortunately,
this definition is fairly vague, and measuring this can be deceptively complex.
There have been quite a few papers that analyze the parallelism available in a given program,
and they all take about the same approach. A computation consists of a collection of basic
operations (add, subtract, multiply, divide, etc) in which one operation's output is another's input.
Such data dependencies form the arcs within a directed acyclic graph called the dependency graph.
The operations in this graph are then organized chronologically into steps such that an operation
executes as soon as its input operands are ready. The available parallelism of a program is then
computed with the application of Amdahl's Law to the number of instructions in each step.
Unfortunately, there are quite a few confounding factors, driving each paper to make different
assumptions about the underlying data. By definition, the dependency graph structure assumes
infinite hardware, perfect branch prediction, and perfect memory disambiguation. This ignores
write-after-read and write-after-write dependencies along with structural hazards and control
dependencies. [37] and [38] each analyze a self-selected collection of programs, use this ideal
machine to estimate the length of the critical path, and report the ratio of the total instruction count
against the critical path length. [38] goes a step further to show the trace of available parallelism
across the program. [39] analyzes a much longer segment of the program execution, but is limited
by a window size, considering a limited number of instructions in the trace at a time. This allows it
to do a complete analysis of the Spec benchmarks from 1989. [41] takes speculation step further,
by analyzing available parallelism under data-value prediction models.
However, there are also many circumstances that might artificially sequentialize an otherwise
parallel program even with an ideal machine, and they are often difficult to identify. For example,
loop iterators create a dependency chain from one iteration to the next that can often be unrolled,
and long expressions can be accumulated sequentially or computed as a tree. Many of these
program transformations are often done by the compiler, but there are limits. For example, a
32
Fig. 9: Example dependency graph.
compiler may only partially unroll a loop, ultimately leaving that dependency chain somewhat in
place. [40] applies constant value propagation and tree height reduction to deal with loop iterators
and long sequential expressions. Finally, [42] brings many of the techniques together to do an
analysis on modern benchmarks in Spec.
Unlike the RISC ISA, x86-64 is particularly complex, and this approach must take that into
account. First, every byte of data throughout the execution of the program is recorded in a linked
list with pointers to the instructions that read and write it. Instructions are also recorded in a linked
list with their assigned step. An array of x86 registers are mapped to the set of bytes they store,
accounting for overlapping registers like RAX, EAX, AX, AL, and AH. Second, the program
maintains a hash-map of 1024-byte pages of memory in which each location is mapped to a byte
of data.
As discussed in Section D of this chapter, mov instructions represent a significant overhead
in the x86-64 ISA that should be removed for the purpose of an ideal machine. Therefore, any
instruction that unconditionally moves or copies data should take immediate effect and should
not be counted as an operator. Unfortunately, unconditional data moves or copies within memory
require special treatment. While they should not be counted as an operator, the data dependencies
from the base and index registers of the memory address must be taken into account for any
operator that reads the result.
Furthermore, Vector and SIMD instructions are divided into multiple operators. Branches and
jumps are not counted nor do they create any dependencies per the assumption of perfect branch
prediction. While PIN does not track instructions through the kernel, it does provide the values of
all input operands. Therefore, system calls are are handled manually, counting as a single operator
and correctly affecting memory, data dependencies, and parallelism as necessary. Any further
instructions each count as a single operator. Finally, to prevent large initialization spikes as found
33
in [42], any operation without input dependencies is scheduled to execute the cycle before its
output is read. All recorded operations take exactly one cycle to complete.
Ultimately, rigorous automatic parallelization as in [40] is out of the scope of this work.
While it can be safely assumed that modern compilers take care of the majority of known
transformations, it is still necessary to break the long dependency chains created by loop iterators
using constant value propagation. Any instruction in which all of its operands are constant can
be evaluated at compile time. Therefore, they do not count as operators and their results are
also constant. While this does not dive deep into automatic parallelization of expressions, it does
implement loop unrolling.
Overall, this strategy still miscounts the available parallelism from instructions like sqrt, cos,
sin, etc. However only sqrt is ever used throughout the spec benchmark, and its execution counts
are negligible.
This PIN tool will quickly run out of memory if certain precautions are not taken. First, each
byte will keep count of its references in the register file and in memory. When there are no more
references to that byte, it will be counted in the output statistics and deleted. Instructions will keep
count of all bytes they write. As soon as all of those bytes have been deleted from the list, the
operation will also be counted in the output statistics and deleted. The hash-map of 1024-byte
pages will limit its size to 1000 pages, swapping any pages beyond that count into a file. The
output statistics will keep track of the number of operations, the data reuse distances and counts,
and the data lifetime distances and counts in each step, swapping data out to a file as necessary to
maintain a small memory footprint. Finally, the run time of the analysis is limited 24 hours.
The outcome for a given program is the total number of operations that can be executed
in parallel for each cycle. For example, Fig. 10 shows trace of available parallelism from a
120 million instruction execution of perlbench. Keep in mind the subtle difference between
instructions and operations. An operation is the unit of computation left after post processing an
x86 instruction. Overall, this analysis only covers a small sampling of each program.
Two distributions show different features of the available parallelism. The first in Fig. 11
shows the distribution of available parallelism with respect to cycles. For example, 50% of cycles
in perlbench.0 execute 24 or more operations in parallel. This showcases the sequential nature of
each program, emphasizing the median available parallelism in the trace.
Alternatively, Fig. 12 shows the distribution with respect to operations. 50% of operations in
perlbench.0 were executed in cycles with at least 745 operations in parallel. This showcases the
parallel nature of each program, emphasizing the workload achieved in highly parallel cycles.
The programs in these figures are ordered by their speedup with lower speedup on the left and
34
Fig. 10: Trace of available parallelism for perlbench.0.
Fig. 11: Distribution of available parallelism with respect to cycles in the Spec benchmark programs.
Fig. 12: Distribution of available parallelism with respect to operations in the Spec benchmark programs.
higher on the right, and both of these distributions are necessary for a thorough understanding
of this ordering. For example, the reason for the order of wrf.0 and soplex.1 may be unclear if
looking at only one of these distributions. However, it quickly becomes clear that while a large
number of operations in wrf.0 are executed in one or two highly parallel cycles, there are also a
significant number of extremely sequential cycles that limit its overall speedup. On the otherhand,
the parallelism in soplex.1 is fairly evenly distributed across all cycles, allowing for higher overall
speedup.
35
Fig. 13 shows the maximum achievable speedup of each program given some number of
execution units. While there are a few very sequential programs with between 10x and 100x
potential speedup, most programs offer more than 100x. [33] found an IPC between 1 and 3 for
the execution of the Spec Benchmarks on an Intel Xeon processor in 2018, suggesting that there
is still plenty of available parallelism to take advantage of.
It is possible that these parallel programs cannot easily be moved to the GPU because the
operations involved are interdependent meaning they cannot easily be split into threads.
Unfortunately, there is not a particularly clean way to measure operator interdependence.
However, It may be indirectly measured by looking at the reliability of available parallelism. If
one cycle has 100 operators and the next 1000, then every operator in the second cycle must be
dependant on at least one operator in the first. Likely every operator in the first will fan out to
around 10 in the second. These kinds of interdependencies are not easily mapped onto a GPU.
Meanwhile if multiple consecutive cycles all have similar available parallelism, then it is possible
for those operators to have fairly independent threads.
Therefore, it is necessary to identify runs of cycles. Assuming that each thread has a max IPC
of 3, then if the next cycle has less than 1/3 of the maximum parallelism or more than 3 times the
minimum parallelism of the run, then a new run is started. Therefore, the minimum parallelism of
the run represents a loose upper bound on the possible thread count for that run. A modern CPU
has 8 cores supporting 2 threads per core with an IPC of 3. Therefore if a run has a minimum
parallelism less than 48, it can likely be executed on the CPU. Runs with a minimum parallelism
greater than 48 are reasonable candidates for the GPU.
Now there are two classes of runs for the GPU: long runs with relatively low parallelism, and
short runs with extremely high parallelism. Ultimately, since all of these runs can be accelerated
with a GPU, it is only necessary to determine the workload achieved by each run, or the total
number of operations. The distribution of these workloads can be seen in Fig. 14.
From this, it is possible to measure how threadable a program might be by computing the
weighted average of the workload across the program. This is the workload of each run times its
percentage of the total workload of the program. In Fig. 14, the programs to the right score higher.
Therefore, they are more likely to be threadable and get a good speedup with a GPU.
There are quite a few research papers that set about the task of porting those programs.
Unsurprisingly, Fig. 15 shows that programs measured as threadable tend to achieve good
speedups on a GPU. Ultimately, this measure accounts for 57% of the variance when compared
to real GPU speedup. This is more predictive power than the limitations of the underlying data
might suggest. Parallelism does not directly encode interdependency, and the real GPU speedups
are reported by multiple authors who use different GPU hardware and compare to varying CPU
36
Fig. 13: Attainable speedup in the Spec benchmark programs.
Fig. 14: Distribution of workloads for runs of cycles.
hardware. It is ultimately a little more predictive than the maximum theoretical speedup which
accounts for 44% of the variance.
Overall, these statistics suggest significant speedup available. However, the max IPC of an
Intel Xeon in 2017 is around 10. The fact that the achieved IPC is between 1 and 3 for these
applications suggests that parallelism might not be the limiting factor [33]. While some programs
with higher available parallelism already are or can be made well suited for a GPU, there is still
theoretically quite a bit of speedup to be had from the more parallel applications. The fact that
the achieved speedups even for the GPU are nowhere near the predicted max suggests that current
architectures might be limited by another characteristic of the workload, and many of the GPU
papers suggest that this might be the Von-Neumann bottleneck [10].
2.2 Locality
Parallelism is not the only reason one might want a CGRA, especially when most modern
processors are energy-bound. [44] ran a full analysis of energy consumption in the context of
a processor architecture similar to the Pentium III. This paper was published before frequency
stopped scaling in 2003 due to the power wall, so a lot of things have changed. Unfortunately, the
37
Fig. 15: Reported application speedup with GPU. bzip2[46], wrf[47], h264ref and lbm[48], dealII[49], gromacs[50],
GemsFDTD[51], zeusmp[52], namd[53], bwaves[54], gamess[55], mcf[56], astar[57], milc[58], libquantum[59],
hmmer[60].
tool used by the paper is no longer available. However, its data is still informative.
In particular, the energy breakdown remains fairly consistent across the programs from the
Spec2000 benchmark. On average, the control logic for scheduling instructions (i.e. L1 Instruction
Cache, Instruction Decode, Instruction Queue, and Reorder Buffer) account for around 62% of
the energy consumption of a superscalar processor core. This ratio has likely been reduced by
micro-op fusion [7], but instruction scheduling logic probably still represents the majority of the
energy consumption. On top of this, the Register Renaming Table, which provides a mapping
of the underlying memory system to a large set of virtual registers, represents around 13.6% of
energy consumption. The activity required by all of these functionalities accounts for 75% of
energy.
This trend also holds for a much simpler embedded RISC processor in Fig. 17, where 48%
of energy is consumed by instruction scheduling while 30% is consumed by data delivery. This
accounts for a total of 78%.
As mentioned in Chapter 1 Section B most programs have very high execution locality with at
most a couple thousand static instructions accounting for the majority of the dynamic execution
(Fig. 18).
Therefore, much of the work done to schedule instructions is redundant. Unfortunately, the
argument of locality [241][251] has been largely overlooked. It is well known that most
instructions can be scheduled in a group called a basic-block. Whenever this basic-block is
executed, its internal data dependencies and execution ordering remains the same. Therefore, a
statically mapped CGRA as a configurable CISC ALU might be able to save a significant amount
of energy by scheduling blocks of instructions together as an extreme adaptation of micro-op
fusion [249].
There is one caveat though. If only one copy of an instruction block that is heavily used in
38
Fig. 16: Microarchitectural breakdown of energy usage in an out of order superscalar processor [44].
Fig. 17: Microarchitectural breakdown of energy usage in an embedded RISC processor [45].
Fig. 18: Instruction address locality in Spec Benchmarks.
close proximity is mapped to the CGRA, then the performance might be hampered by structural
dependencies. This is likely to happen in strictly iterative algorithms like the Newton-Raphson
method. Fig. 19 shows the distribution of the lengths of two consecutive groups of cycles with
identical patterns of available parallelism. Cycles are not included in this distribution if no
repeated pattern could be identified. This is ultimately an indirect measure, and there are many
caveats for using this data. A repeated pattern of available parallelism could be made up of
two completely different functions. Alternatively, two identical functions could be scheduled
39
differently, showing different patterns of available parallelism. However, it is reasonable to
assume an instruction block that is repeated for a long time will settle onto a repetitive schedule.
The resulting data suggests that a few programs might be hampered by these structural
dependencies. Therefore, it is likely a good idea for the CGRA to be able to duplicate a given
configuration in multiple locations to avoid structural hazards.
Finally, basic blocks tend to be defined by the surrounding branch instructions which require
the instruction scheduling control circuitry to make a decision. However, not all branch
instructions are made the same. They represent a barrier through which a collection of data routing
decisions are made. Some decide the routing of a significant amount of data while others decide
the routing of just one value. Therefore, it could be possible to expand basic block size by merging
simple branches, conditional moves, and for loops into their neighboring basic blocks. This would
increase the size of a basic block, free up space in the branch predictor, and further reduce
scheduling costs [212]. Further analysis of brand width distributions should be done to determine
just how much of an effect this could have on basic block size.
2.3 Memory and Bandwidth
This form of micro-op fusion ultimately routes an operation's intermediate results directly
to their next operation within the array. Doing so removes these intermediate values from the
memory system entirely, reducing demand on the register rename table and the L0 and L1 data
caches. This reduction is not linear. A lot of overhead is introduced by capacity limitations of the
physical register file, and a CGRA effectively increases its size. With a limited physical register
file, values must be swapped in and out of main memory more often, introducing a lot of move
instructions that otherwise would not be necessary. The instruction usage breakdown in Fig. 20
shows that move instructions are the single most used instruction in the x86-64 ISA.
Overall, Fig. 21 shows that routing instructions account for 43% of all instructions executed
on average across all program runs. This is a massive overhead that would not exist in an ideal
machine, which is why it was removed from the parallelism and bandwidth measurements.
To my knowledge, only one paper has endeavored to measure memory and bandwidth
requirements [36]. However, it is unclear whether they took this overhead into account to
determine what an ideal system might see. They measured these requirements in the context of a
particular architecture, focusing heavily on how cache size affects the memory bandwidth.
In many ways, the memory and bandwidth requirements of a program mirrors its available
parallelism. More instructions in parallel also means more data. Therefore, measuring them
requires only minor modification to the available parallelism measurement tool. These
requirements are examined from from six different perspectives:
40
Fig. 19: Distribution of maximum local correlation of parallelism within 100 cycles.
Fig. 20: Distribution of instruction usage as a percent of total execution averaged over all programs.
Fig. 21: Distribution of instruction usage categories as a percent of total execution for all programs in Spec2006.
Read Reuse Distance
The distance in cycles to the previous read or write for each byte of data read in a given
cycle.
Read Lifetime Distance
The distance in cycles to the write for each byte of data read in a given cycle.
41
Write Reuse Distance
The distance in cycles to the first read for each byte of data written in a given cycle.
Write Lifetime Distance
The distance in cycles to the last read for each byte of data written in a given cycle.
Store Reuse Distance
The distance in cycles to the next read for each byte of data currently in storage in a given
cycle.
Store Lifetime Distance
The distance in cycles to the last read for each byte of data currently in storage in a given
cycle.
The first four summarize read and write bandwidth while the last two summarize total memory
storage requirements at a given cycle. Fig. 22 shows an example outcome of the store reuse
distance. For example in cycle 50, 20000 bytes currently in storage will be read within the next
4096 cycles.
Fig. 23 shows the distribution of storage requirements. For most programs like gcc, h264ref,
soplex, etc, the maximum storage requirements stay fairly consistent, showing tight distributions
for 75% of cycles. However, there are a few like gobmk, lbm, wrf, and mcf in which the memory
requirements vary wildly. This is simply due to a few highly parallel cycles. Ultimately, it is hard
to know when a program might be at a particular location in this distribution. Given how tight
many of the distributions are, it is unclear how much benefit could be gained by trying to take
advantage of this.
Fig. 24 shows the distribution of storage reuse distance for an average cycle. For most
programs, the vast majority of this data is simply sitting around waiting more than 4096 cycles to
be read. This is consistent with the program behaviors assumed by cache systems with cache size
relative to ALU proximity. The programs that more often interact with a much greater percentage
of their data are also the programs that require significantly less total storage, like bzip2 and
gobmk. The bytes that will be read in 1 cycle represent the average read bandwidth. Unfortunately,
the limiting factor for speedup of wrf and libquantum on a GPU will most definitely be bandwidth.
The distribution of read bandwidth across cycles in Fig. 25 is not as tight. In general the read
bandwidth for 50% of cycles stays between 100 and 1000 bytes per cycle. Though every program
42
Fig. 22: Trace of memory requirements for perlbench.0.
Fig. 23: Distribution of total storage requirements for each program.
Fig. 24: Distribution of Storage Reuse Distance for an average cycle in each program.
has outlying cycles that have a lot of parallelism and therefore consume a lot of bandwidth.
Fig. 26 shows the distribution of the read reuse distance for an average cycle. The vast majority
of the read bandwidth is due to operators that were completed in the previous cycle. This accounts
for no less than 40% of the read bandwidth requirement across all programs. For all but one
program, around 70% of values were written or read at most 4 cycles ago. This is also consistent
with the program behaviors assumed by cache systems implementing the least-recently-used
swapping strategy.
43
Fig. 25: Distribution of Read Bandwidth over all of the cycles in each program.
Fig. 26: Read Reuse Distance for an average cycle in each program.
Fig. 27 shows the distribution of the read lifetime distance for an average cycle. Between 20%
and 40% of the read bandwidth comes from bytes that were written exactly one cycle ago. These
values can be routed directly through the CGRA, completed avoiding the memory system. Bytes
with slightly longer lifetime distances can passively wait on the CGRA's routing network until the
other dependencies have resolved.
The write bandwidth distributions in Fig. 28 are ultimately similar to the read bandwidth.
However, due to value fanout the write bandwidth is reduced by a small factor with respect to the
read bandwidth.
However, what is more interesting in Fig. 29 is that for most programs around 80% of written
values are read in the next cycle. This is consistent with the performance boost offered by
bypassing networks. Likely the difference between this and the read reuse distance comes from
value fanout and ultimately values stored during program initialization.
Finally, as shown by the write lifetime distance in Fig. 30, over 60% of written values for
most programs last exactly 1 cycle and around 80% of written values last at most 4 cycles. This
bandwidth represents the low-hanging fruit for optimization with a CGRA. These values would
be routed directly through the CGRA to their next operation without ever touching the memory
44
Fig. 27: Read Lifetime Distance for an average cycle in each program.
Fig. 28: Distribution of Write Bandwidth over all of the cycles in each program.
Fig. 29: Write Reuse Distance for an average cycle in each program.
system. This can be particularly helpful with programs like libquantum and wrf that are likely to
struggle with bandwidth on a GPU.
2.4 Bitwidth
The measured instruction locality suggests the CGRA's capacity will need to be fairly large for
it to be useful. Furthermore, any method of reducing memory bandwidth requirements would help
to alleviate the Von-Neumann bottleneck in the face of high parallelism. Luckily, the bitwidth
45
Fig. 30: Write Lifetime Distance for an average cycle in each program.
distribution can be used to manage both.
Fig. 31 shows 6 categories of arithmetic operations and their usage. The flags category contains
bit test, set, scan, and count operations. The shift category contains logical shift, arithmetic shift,
and rotation. The remaining categories are self explanatory. Add and subtract operations are
responsible for 43% of all integer arithmetic operation executions averaged over all program
runs. Including the subtraction operation required for the compare instructions, then it is clear to
see that add and subtract operations are representative of 82% of all executed integer arithmetic
operations. Furthermore, they are a heavily-used sub-operation for multiply and divide.
Fig. 32 shows the distribution of the bitwidth of 2 trillion integer add and subtract operations
measured by taking the maximum bitwidth of its inputs, averaged across all programs. There are
four distributions centered around 6, 20, 26, and 29 bits. Then, there are spikes at 33 bits, and 48
bits.
Given that the memory bus in an x86-64 architecture is 48 bits and the surrounding bitwidths
have negligible occurrence counts, it is safe to assume that all of the 47 and 48 bit operations come
from memory address calculations. Because those computations have a predictable bitwidth and
require heavy utilization of multiplier circuitry, they should not be executed within any CGRA
accelerator. Fig. 33 re-examines the data with this in mind, showing the average bitwidth both
ignoring 47 and 48 bit operations in blue, and including them in red. For a few programs like
calculix, povray, and omnetpp, memory address calculations seem to account for a significant
fraction of total executed instructions. Meanwhile, others like sphinx3, gamess, and mcf have very
few suggesting either that they have low total memory requirements, or that memory locations can
be determined at compile time.
The peak around 29 bits is likely from 32-bit masks. The peak around 6 bits is likely due to
iterators and flags on top of the typical arithmetic distribution centered at 8 bits. It is unclear what
causes the spikes at 20, 26, and 33 bits. Overall, the average bitwidth ignoring memory address
46
Fig. 31: Distribution of ALU operation usage across all programs.
Fig. 32: Average bitwidth distribution of integer operations across all programs.
Fig. 33: Average bitwidth of integer operations in each program.
calculations is about 15 bits.
While the bitwidth is highly variable, their alignment is not. Fig. 34 shows that those values
tend to be aligned very close to 0. This means that an adaptive LSB-first datapath can be very
simple while an MSB-first datapath will require complex control to get anything out of the
bitwidth distribution.
Finally, as shown in Fig. 35, run lengths follow an exponential drop in occurrence rate with
vanishing probability past 6 bits.
47
Fig. 34: Alignment of integer addition operations. This is ultimately the location of the first non-zero bit.
Fig. 35: Distribution of run-length vs start-bit for the run.
2.5 Lessons Learned
There are quite a few lessons to be pulled from this data. Regarding micro-architecture, a
CGRA accelerator as a statically mapped configurable CISC ALU could:
1. dramatically reduce the energy required for instruction configuration;
2. reduce demand on the data memory system and register renaming table, effectively
expanding the width of the physical register file by directly routing intermediate values
through the CGRA;
48
3. reduce demand on the instruction memory system by fetching blocks of instructions at a
time;
4. reduce demand on the branch predictor by encoding simple branches and loops directly on
the CGRA fabric; and
5. increase achieved IPC by reducing demand on all of the bottlenecked systems while
increasing parallel execution resources.
To limit the scope of this thesis, these benefits will have to be explored in future work.
However, in the context of such a CGRA, this data also suggests that quite a few benefits could
be realized from a digit-serial datapath that adapts to variable length operations. Specifically, an
adaptive digit-serial datapath could:
1. reduce energy consumption and increase throughput per transistor by skipping unnecessary
work;
2. natively implement multiple precision arithmetic;
3. require significantly fewer transistors per execution node, increasing the overall capacity of
the CGRA;
4. slightly increase sequential instruction execution speed, because per-digit operations are
likely to have lower latency;
5. reduce memory bandwidth, reading or writing only one digit of each value per cycle, and
reducing the total number of digits read or written per value.
The following chapters explore self-timed adaptive digit-serial arithmetic operators with this
context in mind, directly demonstrating 1 and 2 with functional circuitry and indirectly
demonstrating 3 with a simple prototype. 4 and 5 require implementation of the whole system,
which is not in the scope of this work.
49
CHAPTER 3
DESIGN METHODOLOGY
Next, there must be solid answers for three further questions. How should the digit stream be
adapted to best take advantage of underlying features in the data, what are the optimal circuit
families for such a task, and how should data be encoded get the most out those circuit families?
3.1 Digit-Serial Adaptivity
The end goal for adapting digit streams to characteristics in the underlying data is to reduce
the amount of computation required to complete arithmetic operations typically used throughout
the execution of an application. There is a tenuous balance between increasing control complexity
to take advantage of some underlying characteristic and minimizing the overhead of that control
to reduce the per-digit cost of the remaining computation. Ultimately, without a full design-space
exploration, there is no clear answer on what that balance should be. However, there are several
features in the underlying data that should be considered.
First, in add and subtract operations, the carry signal propagates from the least significant digit
to the most significant. While it is possible to implement serial most significant digit first addition,
doing so introduces overhead. Importantly, this overhead should not be overlooked. As noted in
Fig. 31, addition and subtraction account for the vast majority of integer arithmetic operations. It
is part of the comparison operator and a heavily used sub-operator of multiplication and division.
Therefore, that overhead will add up very quickly.
Second, another commonly used operator, compare to zero, can often be resolved by simply
looking at the most significant digit of each input operand. An MSB first datapath might make
it possible to cancel any unnecessary computation on the remaining digits. Because compare to
zero operations are so common, this will affect a significant number of operations. However, it
is unclear how efficient it would be to exploit this feature and what the overall benefit may be in
relation to the overhead of carry chain propagation.
Taking advantage of the bitwidth in Fig. 32 and the alignment in Fig. 34 is easily accomplished
with an LSB first datapath. Meanwhile, an MSB first datapath will have to determine a maximum
possible bitwidth and keep track of the offset to the most significant bit of each value.
Undoubtedly, this means that an MSB first datapath cannot easily take advantage of this feature
without extremely high control overhead.
Fig. 35 suggests that run-length encoding is unlikely to help much, particularly if the digits are
any wider than 1 bit. For example, a run of 8 bits only occurs in one out of every 100 operations.
50
Given these features, there are three possible approaches. The first approach is to have a
fixed-point LSB first digit-serial datapath in which bitwidth is compressed by marking the end of
the stream. This takes advantage of the natural flow of the carry chain, the bitwidth distribution,
and the alignment distribution to make the control circuitry as simple as possible while extracting
what is likely to be the majority of energy and throughput benefit.
The second approach is a floating-point arbitrary-precision redundant-encoding MSB first
digit-serial datapath. This aggressively targets all of the available features. Providing support
for floating point inherently compresses the bitwidth on the MSB side while arbitrary precision
compresses it on the LSB side. Likely, floating point support would need to be implemented
by encoding the exponent as its own value alongside the significand. Meanwhile, the arbitrary
precision support can be implemented by simply marking the end of the stream. Redundant
encoding makes it unnecessary to wait for the whole carry chain before resolving a digit when
executing anything other than a compare operator. However, since compare operators only decide
a boolean value, they can implement this wait without actively tracking the length of the carry
chain. Because the most significant digit is the first in the stream, compare operators can finish
quickly and cancel the rest of the incoming digits. The primary downside of this approach
will likely be the complexity of implementing floating point support. In the end, the redundant
encoding does not require an unreasonable amount of overhead, and the overhead of the arbitrary
precision support is replicated in the LSB first approach as bitwidth compression.
The third approach is a variable-length and variable precision fixed-point combined LSB/MSB
first datapath. This removes the need for MSB-first floating point control circuitry and adds
arbitrary precision support to the datapath. Therefore, one could migrate some of the
floating-point arithmetic over to such a datapath and reduce demand on the floating-point ALU.
Given that the second and third approaches have significantly higher complexity likely
reducing the overall capacity of the CGRA, this work will concentrate heavily on the first
approach.
3.2 Circuit Topology
To implement the first approach, each value is encoded into a stream of tokens. Each token
communicates the value of a digit alongside a flag marking whether it is the last one. The first
token communicates the least significant digit while the last communicates the most significant.
Ultimately, there are a large number of circuit topologies to choose from. Historically, the
most successful choice has been Synchronous. However as discussed in Chapter 1, synchronous
circuits are inefficient at implementing the irregular data and compute patterns required by an
adaptive digit-serial CGRA. In the end the implemented request and acknowledge signals are
51
nearly identical to self-timed circuitry.
There are quite a few self-timed circuit families to choose from. The Quasi Delay-Insensitive
(QDI) circuit families are the most reliable Turing complete option [79]. They have demonstrated
correct operation through wide temperature and voltage swings, smoothly scaling operating
frequency along the way [173]. They are also very flexible, implementing all kinds of complex
behaviors, such as those found in [83] and [75]. As long as the acknowledgement requirement is
maintained [82], this flexibility allows you to take advantage of irregular data and control patterns
to skip unnecessary work, increasing throughput and saving energy.
However, these features come at a cost. Delay insensitive data encodings [85] must
communicate both the data and its validity. This requirement introduces overhead as shown in Fig.
36. Ultimately, the necessary validity and neutrality checks can also require quite a few devices,
consuming lots of area on die compared to typical synchronous encodings. Because of this, people
generally avoid the more complex codes in favor of the smaller one-hot codes.
Unlike clocked signals, data transmission absolutely requires a transition, burning energy
even if the same value is sent multiple times. Steps can be taken to mitigate this problem,
but all generalized approaches require a not-insignificant amount of circuitry. Finally, the
acknowledgement requirement can wreak havoc on performance in certain scenarios. In
particular, signals with large fan-out or fan-in require a rather large c-element tree. In the end,
using QDI circuits in a parallel datapath with regular data patterns is extremely expensive. Though
these features also make them well-suited for control circuitry.
Decades of industry effort have demonstrated how to implement wide datapath logic that is
efficiently synchronized to a central control. Bundled-data circuits [63][74] demonstrate how
this may be achieved for a QDI control rather than a clock. Historically, bundled-data circuits
have maintained extremely simple control, staying as close to synchronous microarchitectures as
possible. This effectively turns the asynchronous pipeline control into a reliable clock-distribution
network that does not suffer from clock skew or jitter.
This work uses the strengths of both QDI and bundled-data to implement an efficient datapath
with a flexible and expressive control. To my knowledge, no other work has ventured in this
particular direction.
3.3 QDI Control Circuits
Under the delay insensitive (DI) delay model, a circuit should operate correctly independent of
gate and wire delays. Correct operation means that the circuit remains stable, non-interfering, and
deadlock-free. An instability, or glitch, can cause data-loss or lead to interference; interference, or
a short, can cause permanent circuit damage; and deadlock halts the computation prematurely. To
52
Fig. 36: Wires required to represent a bit using a specific MofN code (top), relative energy required to communicate
each bit (middle), and transistors per bit required to implement a validity gate for a specific MofN code assuming
simple transistor sharing trees (bottom). Each curve shows a single selection of M while sweeping N.
achieve this goal, every transition must acknowledge every input to its driving gate. A transition
a acknowledges another b if there is a causal sequence of transitions from a to b that prevents
b from firing until after a has completed.[90]
In order for this model to be Turing Complete, the quasi-delay insensitive (QDI) delay model
makes one exception to acknowledgement called the Isochronic Fork Assumption. If there is a
wire fork to multiple gates, and one of those gates does not acknowledge all of the transitions
on that wire, then the delay from the driver to the non-acknowledging gate is assumed to be
bounded. In this model it is always safe to place an inverter before the wire fork. However,
because gates have unbounded delay, placing an inverter after an isochronic fork and before the
non-acknowledging gate can cause an instability. Because the isochronic fork timing assumption
is easy to guarantee and maintain, real QDI circuits are robust by construction to temperature
variation, process variation, sizing, noise, etc [161]. For a more detailed discussion on the QDI
model and this timing assumption, see [90], [91], and [82].
Implementing control circuitry following the QDI delay model is often a difficult undertaking
by itself. QDI circuits are often written in a control-flow language called Communicating
Hardware Processes (CHP) described in Appendix A and then synthesized into a Production Rule
Set (PRS) described in Appendix B using two basic methods.
The first, Syntax-Directed Translation [66][62], maps the program syntax onto a predefined
library of clockless processes through structural induction creating a circuit that strictly respects
the control flow behavior of the original program. Well formulated examples of this method may
be found in [96] and [97].
53
The second, Formal Synthesis [80][65], iteratively applies a small set of formal program
transformations like projection and process decomposition, decomposing the program until the
resulting processes each represent a single pipeline stage. Then, these stages are synthesized
using Martin Synthesis into production rules. This approach respects data dependencies, but not
necessarily the original control-flow behavior of the specification [84].
This work uses a well-known hybrid approach, Templated Synthesis [68]. First, formal
transformations are applied to decompose a CHP description into a collection of simple Dynamic
Single Assignment (DSA) [86] CHP processes. Then, various template patterns and
micro-architectural optimizations are applied to synthesize PRS which are automatically verified
and compiled into circuits. Overall, [68] describes different ways to handle the standard pipeline
stage or “buffer”.
∗[ L?v; R!v ]
Ultimately, each pipeline stage is implemented over four signals as shown in Fig. 37. The
input requests Lr , the input enable Le , the output requests Rr , and the output enable Re . The
C-elements used to drive the output requests are called the “forward drivers”. Each channel goes
through two phases. In the “set phase”, the input requests have gone high along with the output
enable. The forward drivers drive the output requests high and the input enable “acknowledges”
the input requests by transitioning low. In the “reset phase”, the input requests are “reset” meaning
they transition low. The output enable acknowledges the output requests, transitioning low as
well. As a result, the forward drivers reset the output requests and the input enable transitions
high “enabling” the input channel. These events may be implemented with one of three primary
orderings, or “reshufflings”.
Weak-Condition Half Buffer (WCHB)
The WCHB is the smallest and fastest of the three reshufflings. It is ultimately a
symmetric handshake. The forward drivers always wait for both the output enable and the
input request before transitioning. This makes conditional acknowledgement of the inputs
and conditional output requests easier to handle because the reset of the forward drivers are
mapped directly to the reset of the input requests and the acknowledgement of the output
requests. However, this feature also means that the WCHB reshuffling tends to struggle with
long transistor stacks in the reset phase of the forward drivers. If this happens, then there are
a few strategies to mitigate the problem. The primary of which is to go back to the CHP and
decompose the process further.
∗[[Re ∧ Lr]; Rr↾; Le⇂; [¬Re ∧ ¬Lr]; Rr⇂; Le↾]
54
Fig. 37: Channel protocols for QDI buffers.
Synthesis of most processes should start with a WCHB reshuffling. A 1-bit WCHB buffer
is 20 transistors and has the smallest cycle time. With the optimization rules presented in the
next section, it is generally possible to implement most functionality with a cycle time of 10
to 12 transitions.
Re ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Lr0 → Rr0⇂
Re ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Lr1 → Rr1⇂
Rr0 ∨ Rr1 → Le⇂ ¬Rr0 ∧ ¬Rr1 → Le↾
Pre-Charge Half Buffer (PCHB)
∗[[Re ∧ Lr]; Rr↾; Le⇂; [¬Re]; Rr⇂; [¬Lr]; Le↾]
The PCHB reshuffling allows the output request to reset before the input request. That
means it should generally be used if the input channel is significantly slower than the output
channel in one or more cases. A 1-bit PCHB buffer is 34 transistors. The cycle time is 12 to
14 transitions.
Re ∧ Le ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Le → Rr0⇂
Re ∧ Le ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Le → Rr1⇂
Lr0 ∨ Lr1 → Ln⇂ ¬Lr0 ∧ ¬Lr1 → Ln↾
Rr0 ∨ Rr1 → Rn⇂ ¬Rr0 ∧ ¬Rr1 → Rn↾
¬Ln ∧ ¬Rn → Le⇂ Ln ∧ Rn → Le↾
Pre-Charge Full Buffer (PCFB)
∗[[Re ∧ Lr]; Rr↾; Le⇂; en⇂; ([¬Re]; Rr⇂ ∥ [¬Lr]; Le↾); en↾]
The PCFB reshuffling lets the output request and input enable reset in parallel. This allows
a token to wait at every stage of the pipeline instead of every other stage. In general, a PCFB
reshuffling should not be used. However, if the design is latency or energy sensitive and the
output channel is significantly slower than the input channel in one or more cases, then a
55
PCFB allows the input enable to reset before the output enable. If the design is not latency or
energy sensitive then two WCHB buffers should take its place. This has fewer transistors and
operates at higher throughput. A 1-bit PCFB buffer is 42 transistors with a cycle time of 14
transitions.
en ∧ Re ∧ Lr0 → Rr0↾ ¬en ∧ ¬Re → Rr0⇂
en ∧ Re ∧ Lr1 → Rr1↾ ¬en ∧ ¬Re → Rr1⇂
Rr0 ∨ Rr1 → Rn⇂ ¬Rr0 ∧ ¬Rr1 → Rn↾
Lr0 ∨ Lr1 → Ln⇂ ¬Lr0 ∧ ¬Lr1 → Ln↾
¬_en ∧ ¬Ln ∧ ¬Rn → Le⇂ _en ∧ Ln → Le↾
¬Le → en⇂ Rn ∧ Le → en↾
3.4 Synthesis Strategy
This synthesis approach builds upon Andrew Lines' Templated Synthesis method, starting with
a flattened DSA CHP specification of a single pipeline stage process and deriving energy-efficient
high-throughput PRS.
1. Characterize the environment and state: Learn as much as possible about the
circumstances under which this circuit will operate. Specifically, what are the possible input
values? What are the possible output values? What kind of information needs to be stored?
How frequently is each input value, output value, or state used? When is the internal state
switching and when is it stable? Are there any relations between their usages? For example,
might there be a common sequence of these values?
2. Encodings: What are all the possible ways in which the input requests, output requests, and
internal state can be used to represent the function you are trying to implement? Pick one of
those encodings. Concentrate on encodings that seem to play well with the characterizations
from step 1.
3. Constraints and orderings: For this encoding, what are all of the different constraints?
What are all the possible ways to order events? There is often a constraint that turns out
to be not as strict as initially thought. Pick an ordering that seems to play well with the
characterizations from step 1.
4. Group functionally equivalent behaviors: Group signals that have similar behaviors
throughout the handshake. There are three parts of the handshake to consider in this process:
the output requests, the input acknowledgement, and the internal memory. This helps to
reduce the total number of forward drivers and simplify the transistor stacks in the reset
phase of the forward drivers. In many cases, this can have a dramatic affect on all
56
performance metrics. However, this often requires a lot of trial and error. Ultimately,
C-elements are an expensive gate, and all QDI handshake protocols dictate that the output
request lines require C-elements for implementation. The goal is to reduce the number of
necessary C-elements as much as possible.
5. Make an attempt: Start to place all of these behaviors out into the WCHB template. Learn
from this process. What features make for a good encoding? What features make for a good
ordering?
6. Push complexity out: If a particular part of the interface is making the reshuffling
particularly complex, switch the interface to make it easier. This pushes the complexity out
of the unit being developed. It will need to be dealt with when developing the interfacing
units. Often times, this is easier but requires the interfacing units to be specialized for that
particular situation. If not, then push complexity out of those units as well.
7. Avoid staticizers: Use combinational gates when possible. It is often possible to convert a
C-element to a combinational gate by adding a few vacuous transitions to the handshake.
This removes the need for a staticizer, benefiting all metrics. Ultimately in a WCHB
handshake, only the gates driving the output request wires need to be implemented with
C-elements. Keep in mind that a gate need not be combinational to avoid a staticizer, but it
must be driven in every possible state of the handshake.
8. Iterate: Go back to step 2 or 3 as necessary. As more encodings and orderings are attempted,
there will be a much better understanding the encodings and orderings that work, and the
ones that do not.
3.5 Microarchitectural Optimizations
[68] provides a good starting point for QDI circuit development. However, these templates
also make it easy to introduce a significant amount of overhead in the circuit without realizing
it. Fortunately, there are quite a few micro-architectural optimizations that make it easier to fit
more complex computation in a single pipeline stage and greatly simplify stages without such
computation. With these optimization rules, it is easier take advantage of irregularities in the data
for performance gains and energy savings.
All of the provided examples try to keep the optimizations separated from each other. Applying
all of the optimizations to all of the examples will yield much better circuits overall.
Validity Trees
When the rules for the input enable become too long, then a validity tree can be used to
break the computation into multiple gates. If the validity tree uses the outputs of the forward
57
drivers, Rr0 and Rr1 , for its inputs then the critical cycle will be increased by at least two
transitions.
If this process has an internal memory, then the validity trees may also be used to simplify
the set logic for the internal memory. Doing so depends heavily upon the compatibility
between the logic for the input enable and internal memory.
Re ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Lr0 → Rr0⇂
Re ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Lr1 → Rr1⇂
Rr0 ∨ Rr1 → Rv↾ ¬Rr0 ∧ ¬Rr1 → Rv⇂
Rv → Le⇂ ¬Rv → Le↾
Intermediate Forward Drivers
Similarly, if the logic in the output requests does not align well to the input enable or
internal memory, then intermediate forward drivers may be used. Instead of allocating a
single forward driver to each output request, multiple forward drivers will cover different
conditions for a single output request. Then a gate tree is used to do the final combination.
Once again, if the gate tree uses the outputs of the forward drivers R0 or R1 , then the critical
cycle will be increased by at least two transitions.
Re ∧ Lr0 → R0↾ ¬Re ∧ ¬Lr0 → R0⇂
Re ∧ Lr1 → R1↾ ¬Re ∧ ¬Lr1 → R1⇂
R0 ∨ R1 → Rr↾ ¬R0 ∧ ¬R1 → Rr⇂
R0 ∨ R1 → Le⇂ ¬R0 ∧ ¬R1 → Le↾
nLatch Internal Memory
Internal state is often an integral part of any complex computation. Unfortunately, the
strategies covered in [68] are fairly limited. The most basic implementation is covered in
[68]. In this case, the internal memory is added as an n-latch with its signals transitioning up
then down. If the internal memory is not written in the same cycle it is read, then the write
rules simply acknowledge the down-going transition of the latch and the read rules use the
output of the latch directly.
This is demonstrated pretty well with a single-bit register. The input channel L has three
requests. The write requests Lw0 and Lw1 set the value of the internal memory while the read
request Lr reads the value of the internal memory to the read channel R .
v:=0;
58
*[L?l; [lr → R!v [] lw → v:=lw]]
This could be implemented using the strategies in [68] as follows. The write requests
are stored with two C-elements driving R0 and R1 . Once stored, the input channel is
acknowledged by lowering Le and the internal memory v0 and v1 are set using the internal
nodes of the forward drivers _R0 and _R1 . In the reset phase of the handshake, the forward
drivers wait for the downgoing transition of the latch before resetting. Once the forward
drivers have been reset, the input channel is enabled by raising Le . Meanwhile, the read
request emits the value of the internal memory directly to the requests on R . This allows the
input channel to be acknowledged by lowering Le . When the output channel is acknowledged
and the input channel is reset, the forward drivers for Rr0 and Rr1 may be reset and the input
channel enabled.
Re ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬Lw0 ∧ ¬v1 → R0⇂
Re ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬Lw1 ∧ ¬v0 → R1⇂
Re ∧ Lr ∧ v0 → Rr0↾ v1 ∧ _R0 → v0⇂ ¬Re ∧ ¬Lr → Rr0⇂
Re ∧ Lr ∧ v1 → Rr1↾ v0 ∧ _R1 → v1⇂ ¬Re ∧ ¬Lr → Rr1⇂
R0 ∨ R1 ∨ Rr0 ∨ Rr1 ¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧ ¬Rr1 →
→ Le⇂ Le↾
3-Valued Internal Memory (Positive)
The first thing to note is that QDI circuits are not limited to only 2-valued latches.
Sometimes, a process only has three internal states. Encoding this with two 2-valued latches
adds unnecessary transitions on the internal memory and an unnecessary state. In this case, it
is more energy efficient and easier to use a 3-valued latch. While this saves energy, it does not
reduce the size of the circuit because a 3-valued latch ultimately requires the same number of
transistors as two 2-valued latches.
This is demonstrated again using the 3-valued register. All the behaviors remain
unchanged relative to the 2-value register, but a new write request Lw2 and a new read value
Rr2 is added to the handshake.
59
Re ∧ Lw0 → R0↾ ¬v1 ∧ ¬v2 ∨ ¬_R0 → ¬Lw0 ∧ ¬v1 ∧ ¬v2 → R0⇂
Re ∧ Lw1 → R1↾ v0↾ ¬Lw1 ∧ ¬v0 ∧ ¬v2 → R1⇂
Re ∧ Lw2 → R2↾ ¬v0 ∧ ¬v2 ∨ ¬_R1 → ¬Lw1 ∧ ¬v0 ∧ ¬v1 → R2⇂
Re ∧ Lr ∧ v0 → Rr0↾ v1↾ ¬Re ∧ ¬Lr → Rr0⇂
Re ∧ Lr ∧ v1 → Rr1↾ ¬v0 ∧ ¬v1 ∨ ¬_R2 → ¬Re ∧ ¬Lr → Rr1⇂
Re ∧ Lr ∧ v2 → Rr2↾ v2↾ ¬Re ∧ ¬Lr → Rr2⇂
(v1 ∨ v2) ∧ _R0 → v0⇂
R0 ∨ R1 ∨ R2 → Wv↾ (v0 ∨ v2) ∧ _R1 → v1⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Wv⇂
Rr0 ∨ Rr1 ∨ Rr2 → Rv↾ (v0 ∨ v1) ∧ _R2 → v2⇂ ¬Rr0 ∧ ¬Rr1 ∧ ¬Rr2 → Rv⇂
Rv ∨ Wv → Le⇂ ¬Wv ∧ ¬Rv → Le↾
There are two things to notice in this example. First, the rule driving Le↾ was 6 transistors
long, requiring a validity tree Wv and Rv to break up the transistor stack. Methods to solve
this problem will be presented later in this section.
Second, take note that the reset rules for the forward drivers of the write R0 , R1 , and R2
now have to check the downgoing transitions of two states in the internal memory. This can
cause trouble when the computation requires longer reset rules in the forward drivers to begin
with. Ultimately, there are strategies to mitigate this. For example, suppose that state 0 always
transitions to state 1, state 1 always transitions to state 2, and state 2 always transitions to state
0 in a ring. In this case, R0⇂ would only need to acknowledge ¬v2 , R1⇂ would only need
to acknowledge ¬v0 and R2⇂ would only need to acknowledge ¬v1 . In general, if there are
constraints on the state transitions, they can be used to reduce these transistor stacks.
3-Valued Internal Memory (Negative)
Alternatively, if the transistors stacks in the reset phase of the forward drivers become too
long and there are not any usable constraints, then it is possible to flip the sense of the internal
memory from positive to negative. Instead of encoding state 0 as v0↾, v1⇂, v2⇂ , it is
encoded as v0⇂, v1↾, v2↾ . This is reflected across the other two states as well, reducing
the transistor stack length of the reset of the forward drivers for the write, but increasing the
transistor stack length of the forward drivers for the read.
60
Re ∧ Lw0 → R0↾ ¬Lw0 ∧ ¬v0 → R0⇂ ¬v1 ∨ ¬v2 ∨ ¬_R1 ∨
Re ∧ Lw1 → R1↾ ¬Lw1 ∧ ¬v1 → R1⇂ ¬_R2 → v0↾
Re ∧ Lw2 → R2↾ ¬Lw1 ∧ ¬v2 → R2⇂ ¬v0 ∨ ¬v2 ∨ ¬_R0 ∨
Re ∧ Lr ∧ v1 ∧ v2 → ¬Re ∧ ¬Lr → Rr0⇂ ¬_R2 → v1↾
Rr0↾ ¬Re ∧ ¬Lr → Rr1⇂ ¬v0 ∨ ¬v1 ∨ ¬_R0 ∨
Re ∧ Lr ∧ v0 ∧ v2 → ¬Re ∧ ¬Lr → Rr2⇂ ¬_R1 → v2↾
Rr1↾ v1 ∧ v2 ∧ _R1 ∧ _R2 →
Re ∧ Lr ∧ v0 ∧ v1 → ¬R0 ∧ ¬R1 ∧ ¬R2 → Wv⇂ v0⇂
Rr2↾ ¬Rr0 ∧ ¬Rr1 ∧ ¬Rr2 → v0 ∧ v2 ∧ _R0 ∧ _R2 →
Rv⇂ v1⇂
R0 ∨ R1 ∨ R2 → Wv↾ v0 ∧ v1 ∧ _R0 ∧ _R1 →
Rr0 ∨ Rr1 ∨ Rr2 → Rv↾ ¬Wv ∧ ¬Rv → Le↾ v2⇂
Rv ∨ Wv → Le⇂
Internal Memory Completion Signal
If none of these strategies are working to reduce the length of the transistor stacks in the
reset phase of the forward drivers, then a completion detection gate We can be applied to
the internal memory. This adds two transitions to the path that writes the internal memory.
However, those two transitions do not increase the critical cycle beyond 10 transitions since
they overlap with the handshake on L and R .
Re ∧ We ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬We ∧ ¬Lw0 → R0⇂
Re ∧ We ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬We ∧ ¬Lw1 → R1⇂
Re ∧ Lr ∧ v0 → Rr0↾ v1 ∧ _R0 → v0⇂ ¬Re ∧ ¬Lr → Rr0⇂
Re ∧ Lr ∧ v1 → Rr1↾ v0 ∧ _R1 → v1⇂ ¬Re ∧ ¬Lr → Rr1⇂
R0 ∨ R1 ∨ Rr0 ∨ Rr1 → ¬_R0 ∧ ¬v1 ∨ ¬_R1 ∧ (_R0 ∨ v1) ∧ (_R1 ∨
Le⇂ ¬v0 → We⇂ v0) → We↾
¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧
¬Rr1 → Le↾
pLatch Internal Memory
If there are other signals that would benefit from the internal memory completion
detection gate, but their up-going sense must be checked instead of their down-going, then
the internal memory can be flipped from an n-latch to a p-latch to accommodate them. This
does not change the critical cycle. Instead it removes the inverter that was previously on
the completion detection signal and uses the inverter on the C-element of the forward driver
61
instead.
Re ∧ We ∧ Lw0 → R0↾ ¬v1 ∧ ¬R1 → v0↾ ¬Lw0 ∧ ¬We → R0⇂
Re ∧ We ∧ Lw1 → R1↾ ¬v0 ∧ ¬R0 → v1↾ ¬Lw1 ∧ ¬We → R1⇂
Re ∧ Lr ∧ v0 → Rr0↾ v1 ∨ R1 → v0⇂ ¬Re ∧ ¬Lr → Rr0⇂
Re ∧ Lr ∧ v1 → Rr1↾ v0 ∨ R0 → v1⇂ ¬Re ∧ ¬Lr → Rr1⇂
R0 ∨ R1 ∨ Rr0 ∨ Rr1 → R0 ∧ v0 ∨ R1 ∧ v1 → (¬R0 ∨ ¬v0) ∧ (¬R1 ∨
Le⇂ We⇂ ¬v1) → We↾
¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧
¬Rr1 → Le↾
C-element Internal Memory
Of course, all of these approaches can be mixed. This makes the internal memory a
C-element instead of a latch. In this example, R0 drives v0+; v1⇂ as an n-latch would and
R1 drives v0-; v1↾ as a p-latch would. Notably, this C-element is not combinational, but
also does not require an explicit staticizer. This is because R0 and R1 are guaranteed to be
mutually exclusive. Specifically, if _R0 is low driving v0↾ , then R1 is guaranteed to be
low. This means that the rule driving v0⇂ is off. If R1 is high driving v0⇂ , then _R0 is
guaranteed to be high. This means that the rule driving v0↾ is off. If the handshake is in a
neutral state and _R0 is high while R1 is low, then v0 is staticized based upon the value of
v1 .
Re ∧ Lw0 → R0↾ ¬v1 ∧ ¬R1 ∨ ¬_R0 → ¬Lw0 ∧ ¬v1 → R0⇂
Re ∧ We ∧ Lw1 → R1↾ v0↾ ¬Lw1 ∧ ¬We → R1⇂
Re ∧ Lr ∧ v0 → Rr0↾ ¬v0 → v1↾ ¬Re ∧ ¬Lr → Rr0⇂
Re ∧ Lr ∧ v1 → Rr1↾ v1 ∧ _R0 ∨ R1 → v0⇂ ¬Re ∧ ¬Lr → Rr1⇂
v0 → v1⇂
R0 ∨ R1 ∨ Rr0 ∨ Rr1 → ¬R1 ∨ ¬v1 → We↾
Le⇂ R1 ∧ v1 → We⇂
¬R0 ∧ ¬R1 ∧ ¬Rr0 ∧
¬Rr1 → Le↾
Note that the validity tree We need only check the up-going transition on v1 . The
down-going transition of v1 is acknowledged by the reset rule for R0 .
Protecting Forward Drivers (Mutex)
Suppose that the internal memory is written and read in the same cycle. For example, the
following process records the previous data from the input channel L and sends the XOR of
62
the previous and current data on the output channel R .
*[ L?l; R!(l^v); v:=l]
In this example, there are four forward drivers covering each case of the XOR. R0 covers
v0, Lr0 , R1 covers v1, Lr0 , R2 covers v0, Lr1 , and R3 covers v1, Lr1 . For R0 and R3
the result of the XOR is zero, so Rr0 is driven high. For R1 and R2 , the result of the XOR
is one, so Rr1 is driven high. For R0 and R3 , the input request has the same value as the
internal memory, so the internal memory is left unchanged. Meanwhile, R1 and R2 swap the
value of the internal memory.
Re ∧ v0 ∧ Lr0 ∧ _R1 → R0↾ ¬v1 ∨ ¬_R1 → v0↾ ¬Re ∧ ¬Lr0 → R0⇂
Re ∧ v1 ∧ Lr0 → R1↾ ¬v0 ∨ ¬_R2 → v1↾ ¬Re ∧ ¬v1 ∧ ¬Lr0 → R1⇂
Re ∧ v0 ∧ Lr1 → R2↾ v1 ∧ _R1 → v0⇂ ¬Re ∧ ¬v0 ∧ ¬Lr1 → R2⇂
Re ∧ v1 ∧ Lr1 ∧ _R2 → R3↾ v0 ∧ _R2 → v1⇂ ¬Re ∧ ¬Lr1 → R3⇂
R0 ∨ R3 → Rr0↾ ¬R0 ∧ ¬R3 → Rr0⇂
R1 ∨ R2 → Rr1↾ ¬R1 ∧ ¬R2 → Rr1⇂
R0 ∨ R1 ∨ R2 ∨ R3 → Le⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 ∧ ¬R3
→ Le↾
If _R1 did not gate R0↾ or _R2 did not gate R3↾ , then the forward driving rules would
be unstable. For example, suppose that Lr0 transitions high and v1 is high. This means that
R1 would go high as a result. Then, before L is acknowledged, R1 drives v0 high. This
enables R0↾ while R1 is already high. This breaks the mutual exclusion requirement of the
delay insensitive encoding. When L is acknowledged and Lr0 transitions low, R0↾ will be
disabled, causing a glitch to propagate out Rr0 .
There are three techniques to avoid this instability. This first is to manually guarantee
mutual exclusion of the offending forward drivers using the internal nodes of the others
as demonstrated above. This is the fastest method, maintaining a 10 transition cycle time.
However, it also makes the transistor stacks of the forward drivers one transistor longer.
Protecting Forward Drivers (Output Enable)
The second technique to protect the forward drivers from this instability is to wait until the
output request is acknowledged before changing the internal state. Once the output request
has been acknowledged, Re is low cutting off the up-going rules of the forward drivers.
Because the internal memory now has to wait for the handshake on the output channel, the
critical cycle is increased to 12 transitions. This technique should only be used if the transistor
stacks in the forward drivers are already too long for the mutex approach.
63
This has the added benefit of reducing the transistor stacks in the reset phase as well. Since
v0↾ acknowledges Re⇂ , R1 does not have to. This transistor stack length optimization only
works if the internal memory is guaranteed to transition as a result of this forward driver.
Re ∧ v0 ∧ Lr0 → R0↾ ¬v1 ∨ ¬Re ∧ ¬_R1 → v0↾ ¬Re ∧ ¬Lr0 → R0⇂
Re ∧ v1 ∧ Lr0 → R1↾ ¬v0 ∨ ¬Re ∧ ¬_R2 → v1↾ ¬v1 ∧ ¬Lr0 → R1⇂
Re ∧ v0 ∧ Lr1 → R2↾ v1 ∧ (Re ∨ _R1) → v0⇂ ¬v0 ∧ ¬Lr1 → R2⇂
Re ∧ v1 ∧ Lr1 → R3↾ v0 ∧ (Re ∨ _R2) → v1⇂ ¬Re ∧ ¬Lr1 → R3⇂
R0 ∨ R3 → Rr0↾ ¬R0 ∧ ¬R3 → Rr0⇂
R1 ∨ R2 → Rr1↾ ¬R1 ∧ ¬R2 → Rr1⇂
R0 ∨ R1 ∨ R2 ∨ R3 → ¬R0 ∧ ¬R1 ∧ ¬R2 ∧
Le⇂ ¬R3 → Le↾
Protecting Forward Drivers (Input Request)
If a forward driver writes the internal memory, but does not make an output request, then
the output request is not available to gate the transitions on the internal memory and protect
the forward drivers from instability. In this case, the input requests may be used to do so.
Unfortunately, if the input requests are not easily mapped to the transitions in the internal
memory, then this will become very messy. Like the output enable approach, this increases
the critical cycle to 12 transitions.
Re ∧ v0 ∧ Lr0 → R0↾ ¬v1 ∨ ¬Lr0 ∧ ¬_R1 → v0↾ ¬Re ∧ ¬Lr0 → R0⇂
Re ∧ v1 ∧ Lr0 → R1↾ ¬v0 ∨ ¬Lr1 ∧ ¬_R2 → v1↾ ¬Re ∧ ¬v1 → R1⇂
Re ∧ v0 ∧ Lr1 → R2↾ v1 ∧ (Lr0 ∨ _R1) → v0⇂ ¬Re ∧ ¬v0 → R2⇂
Re ∧ v1 ∧ Lr1 → R3↾ v0 ∧ (Lr1 ∨ _R2) → v1⇂ ¬Re ∧ ¬Lr1 → R3⇂
R0 ∨ R3 → Rr0↾ ¬R0 ∧ ¬R3 → Rr0⇂
R1 ∨ R2 → Rr1↾ ¬R1 ∧ ¬R2 → Rr1⇂
R0 ∨ R1 ∨ R2 ∨ R3 → ¬R0 ∧ ¬R1 ∧ ¬R2 ∧
Le⇂ ¬R3 → Le↾
Exchange Channels (Positive)
Sometimes, the implementation of some functionality requires data to be communicated
in both directions. The sending process sends data as part of the request, and the receiving
process sends data back as part of the acknowledge or enable. Effectively, the two processes
exchange data every cycle. Often the data sent by the receiving process will be status data
informing the sending process whether the computation has completed.
64
There are two ways of achieving this depending upon which phase of the handshake
the receiving process communicates its data. In the first approach, the receiving process
communicates its data during the reset phase as part of the enable of the input requests. The
input requests are acknowledged when the encoding is in a neutral state and enabled when the
encoding has a valid value. This means that the value being returned to the sender remains
valid until another token is sent.
The example used here will be a simple exchange channel buffer. The receiving channel
L receives l from the input requests, then returns r . The sending channel R forwards the
received value l and is returned a new value for r . Because the new value for r is received
from the channel R after the old value has been sent across L , it must be stored in an internal
memory for a cycle.
r:=0;
*[ L?l!r; R!l?r ]
The implementation turns out to be fairly straightforward. Only one of the output enables
will be high, signifying the data that is being returned. This is combined with the input request
to form four intermediate forward drivers. R0 forwards the output request Rr0 and sets the
internal memory to 0 . R1 forwards the output request Rr0 and sets the internal memory to
1 . Similarly, R2 and R3 both forward the output request Rr1 and set the internal memory to
0 and 1 respectively.
Re0 ∧ Lr0 → R0↾ ¬Re0 ∧ ¬Lr0 ∧ ¬v1 → R0⇂ ¬v1∨¬Le0∧(¬_R0∨¬_R2) →
Re1 ∧ Lr0 → R1↾ ¬Re1 ∧ ¬Lr0 ∧ ¬v0 → R1⇂ v0↾
Re0 ∧ Lr1 → R2↾ ¬Re0 ∧ ¬Lr1 ∧ ¬v1 → R2⇂ ¬v0∨¬Le1∧(¬_R1∨¬_R3) →
Re1 ∧ Lr1 → R3↾ ¬Re1 ∧ ¬Lr1 ∧ ¬v0 → R3⇂ v1↾
R0 ∨ R1 → Rr0↾ ¬R0 ∧ ¬R1 → Rr0⇂ v1∧(Le0∨_R0∧_R2) → v0⇂
R2 ∨ R3 → Rr1↾ ¬R2 ∧ ¬R3 → Rr1⇂ v0∧(Le1∨_R1∧_R3) → v1⇂
v1∨Rr0∨Rr1 → Le0⇂ ¬v1 ∧ ¬Rr0 ∧ ¬Rr1 → Le0↾
v0∨Rr0∨Rr1 → Le1⇂ ¬v0 ∧ ¬Rr0 ∧ ¬Rr1 → Le1↾
The forward drivers lower both of the input enables, setting the input enable to a neutral
state. Because the internal memory is used by the input enable, it must wait until the input
enable is in a neutral state before transitioning. If the internal memory transitions before this
happens, then the value held on the input enable will start to switch. When the input enable is
finally driven neutral by the forward drivers, it would cause a glitch.
Once the input enable is lowered, the internal memory is set according to the value stored
in the forward drivers. Each forward driver then waits for its respective reset conditions and
65
resets. When the input requests are enabled, the internal memory is used to gate the input
enable and ensure that the correct value is returned. Le0 is blocked if the internal memory is
set to 1 and Le1 is blocked if the internal memory is set to 0 . Then the gates driving Le0
and Le1 are made combinational using v0 and v1 .
Exchange Channels (Negative)
In the second approach, the receiving process communicates its data in the set phase as
part of the acknowledge of the input requests. The input requests are enabled when the input
enables are in a neutral state and acknowledged when they hold a valid value. This means
that the value being returned by the receiving process is only valid while the token is being
sent, and it is up to the sending process to record that value until the next cycle.
Going back to the example of the exchange channel buffer, the forward drivers record
the four cases derived from the combination of the internal memory and the input request.
Both output enables will be high, but the internal memory records the last enable to go low.
Therefore, the forward drivers only need to acknowledge the up-going transition of the output
enable that last went low. Now, R0 and R1 set the output request Rr0 and acknowledge with
Le0 and Le1 respectively. R2 and R3 set the output request Rr1 and acknowledge with Le0
and Le1 respectively.
Re0 ∧ v0 ∧ Lr0 → R0↾ ¬v1 ∨ ¬Re0 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R0⇂
Re1 ∧ v1 ∧ Lr0 → R1↾ v0↾ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R1⇂
Re0 ∧ v0 ∧ Lr1 → R2↾ ¬v0 ∨ ¬Re1 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R2⇂
Re1 ∧ v1 ∧ Lr1 → R3↾ v1↾ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R3⇂
R0 ∨ R1 → Rr0↾ v1 ∧ Re0 → v0⇂ ¬R0 ∧ ¬R1 → Rr0⇂
R2 ∨ R3 → Rr1↾ v0 ∧ Re1 → v1⇂ ¬R2 ∧ ¬R3 → Rr1⇂
R0 ∨ R2 → Le0⇂ ¬R0 ∧ ¬R2 → Le0↾
R1 ∨ R3 → Le1⇂ ¬R2 ∧ ¬R3 → Le1↾
The internal memory is set directly from the output enable, and the reset of the forward
drivers no longer maps to the value returned by the output enable or the resulting value of
the internal memory. This means that an XOR gate is required to check the completion of
the transition on the internal memory. Following the reset of the forward drivers, the input is
enabled with both Le0 and Le1 transitioning high to a neutral state.
Storing a 1of2 Request (Inverted Singlerail Out)
If an internal memory simply records the value of a delay insensitive encoding, then it
is often possible to have that happen directly. In this example, a dualrail request r0, r1 is
66
stored by the p-latch v0, v1 . An XOR gate is then used to generate the completion signal.
The downgoing transition of _o signals that the data on r0, r1 is valid and has been
successfully stored in v0, v1 . The upgoing transition signals that r0, r1 has transitioned
to a neutral state.
If the input request changes the value of the internal memory, then there are three
transitions from that input request to the down-going transition on _o . For example r0↾
causes v1-; v0+; _o⇂ . If the input request is the same value as the internal memory, then
there is only one transition _o⇂ .
v1 ∨ r1 → v0⇂
v0 ∨ r0 → v1⇂
¬v1 ∧ ¬r1 → v0↾
¬v0 ∧ ¬r0 → v1↾
v0 ∧ r0 ∨ v1 ∧ r1 → _o⇂
(¬v0 ∨ ¬r0) ∧ (¬r1 ∨ ¬v1) → _o↾
Memory Gated Forward Drivers
This optimization is a combination of the intermediate forward drivers and the internal
memory unit. Suppose multiple forward drivers share the same behaviors regarding the input
enable and the internal memory, but drive different output requests. Further suppose that
the internal memory is able to differentiate between the two output request cases in a stable
way throughout the handshake. Then, the forward drivers for those output requests may
be combined into one, and the intermediate forward drivers method may be used with the
internal memory to generate the separated output requests. This may also combine forward
drivers that have output requests with forward drivers that do not.
In this example, there are three input requests. Lw0 sets the internal memory to 0 , Lw1 sets
the internal memory to 1 , and Lr is conditionally forwarded on R depending upon the value
of v . If v is 0 , then Lr is simply acknowledged with no further action. If v is 1 , then the
request is forwarded across R .
*[L?l; [lw → v:=lw; [] lr∧v=0 → skip [] lr∧v=1 → R!]]
The write commands Lw0 and Lw1 are implemented following the internal memory
strategies discussed previously. However, the read command Lr only requires a single
C-element driving R2 . Because the internal memory remains stable during the handshake
from Lr , it can be used to gate the intermediate forward driver R2 to conditionally raise the
request on Rr . The reset of R2 must wait for the output request to be acknowledged if an
67
output request was forwarded. Once again, the internal memory can be used to differentiate
the two cases.
Re ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬Lr0 ∧ ¬v1 → R0⇂
Re ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬Lr1 ∧ ¬v0 → R1⇂
Re ∧ Lr → R2↾ v1 ∧ _R0 → v0⇂ (¬Re ∨ ¬v1) ∧ ¬Lr → R2⇂
R2 ∧ v1 → Rr↾ v0 ∧ _R1 → v1⇂ ¬R2 ∨ ¬v1 → Rr2⇂
R0 ∨ R1 ∨ R2 → Le⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Le↾
3.6 Half-Cycle Timing Assumption
A very constrained timing assumption beyond the Isochronic Fork may be used to remove
extra transitions introduced by some of these approaches. Specifically, the Half-Cycle Timing
Assumption (HCTA) introduced in [72] and [73] helps to keep the cycle time within 10
transitions.
Specifically, a C-element has an internal node that is staticized using the output. The HCTA
assumes that the internal node will be successfully staticized before the inputs to the C-element
cut off the main driver. This allows for the internal node to be used separate from the output node.
There are three primary cases where this can help.
Validity Trees
Instead of building the validity tree from the output nodes Rr0 and Rr1 and adding two
transitions to the handshake, the HCTA would use the internal nodes of the forward drivers
_Rr0 and _Rr1 . This keeps the critical cycle time unchanged, and while this is a timing
assumption, it is not altogether difficult to guarantee in layout. Specifically, this assumes that
Rr0 and Rr1 resolve, staticizing _Rr0 and _Rr1 before Lr0 or Lr1 are lowered, cutting off
the main driver. Ultimately, this means that one transition internal to a process must complete
before four across an external channel. If the input requests are lowered before the outputs of
the forward drivers resolve, then the staticizers of the forward drivers may drive the internal
node back up causing a glitch and likely deadlock.
Re ∧ Lr0 → Rr0↾ ¬Re ∧ ¬Lr0 → Rr0⇂
Re ∧ Lr1 → Rr1↾ ¬Re ∧ ¬Lr1 → Rr1⇂
¬_Rr0 ∨ ¬_Rr1 → Rv↾ _Rr0 ∧ _Rr1 → Rv⇂
Rv → Le⇂ ¬Rv → Le↾
68
Intermediate Forward Drivers
Similarly, the Half-Cycle Timing Assumption can be used to generate a combined output
from the forward drivers without introducing any extra transitions. The tree is started from the
internal nodes of the forward drivers _R0 and _R1 . This keeps the critical cycle unchanged,
assuming that R0↾ or R1↾ resolves before Re is lowered. Once again, this means that one
transition internal to a process must complete before four across a channel. If the output
enable is lowered before the forward drivers resolve, the staticizers can once again cause a
glitch or deadlock.
Re ∧ Lr0 → R0↾ ¬Re ∧ ¬Lr0 → R0⇂
Re ∧ Lr1 → R1↾ ¬Re ∧ ¬Lr1 → R1⇂
¬_R0 ∨ ¬_R1 → Rr↾ _R0 ∧ _R1 → Rr⇂
R0 ∨ R1 → Le⇂ ¬R0 ∧ ¬R1 → Le↾
Memory Gated Forward Drivers
For the memory gated forward drivers, the latch implementing the memory guarantees the
existence of the negated signal as well. This means that the internal nodes of the forward
drivers and the negated signal of the internal memory can be used to generate the output
request.
Re ∧ Lw0 → R0↾ ¬Lr0 ∧ ¬v1 → R0⇂
Re ∧ Lw1 → R1↾ ¬Lr1 ∧ ¬v0 → R1⇂
Re ∧ Lr → R2↾ (¬Re ∨ ¬v1) ∧ ¬Lr → R2⇂
¬_R2 ∧ ¬v0 → Rr↾ _R2 ∨ v0 → Rr2⇂
R0 ∨ R1 ∨ R2 → Le⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Le↾
¬v1 ∨ ¬_R0 → v0↾
¬v0 ∨ ¬_R1 → v1↾
v1 ∧ _R0 → v0⇂
v0 ∧ _R1 → v1⇂
3.7 QDI Treatment for Pass Transistor Logic
Pass transistor logic represents an interesting opportunity to dramatically expand the
expressibility of a single QDI pipeline stage by reducing circuit size and complexity. There
are two ways that pass transistor logic can help. First, the WCHB reshuffling often results in
duplicated logic in each of the forward drivers in either the set phase, the reset phase, or both. In
69
such cases, it is often desirable to factor that shared logic into its own gate. However with CMOS
logic, this adds extra transitions to the handshake which can slow the circuit significantly, forcing
a choice between large load capacitances or long cycle times. Implementing these gates with
pass transistor logic can reduce the load capacitances by factoring out the logic without adding
any extra transitions to the handshake. Second, many of the optimizations that depend upon the
Half-Cycle Timing Assumption can be replaced with pass transistor logic that makes no such
assumption. This makes the circuit more robust without any sacrifice to performance.
Furthermore, pass transistor logic often requires differential signals for correct operation. This
means that both the signal and its inverse are available. In a WCHB reshuffling, most signals
either come from a C-element or a latch which both produce differential signals. The only signals
that do not come from one of these two gates are the input enables. This means that pass transistor
logic often fits easily into the handshake without much extra effort.
Finally, pass transistor logic can get special treatment regarding the QDI delay model. As
discussed in the previous section, the QDI delay model has an acknowledgement requirement.
Because pass transistor logic does not introduce a gate delay, it can ultimately be viewed as a low
quality wire. This means that acknowledging the passed signal can also acknowledge the output
of the pass transistor gate. Any approach that does this should not be taken lightly however. The
output load on the pass transistor gate and the sizing ultimately determines the delay of the gate. If
the pass gate is sized too small relative to the load, then this feature is no longer safe. Furthermore,
the pass transistor gate should be placed in layout in the same cell as the signal it passes. This
reduces the delay associated with the logic.
There are a few basic pass transistor gates that can be easily applied to QDI circuits: the XOR
gate, the AND gate, and the OR gate. Each of these gates introduce only one transistor to any
transistor stack. This keeps the delay introduced by these gates to a minimum.
Pass Transistor XOR
There are ultimately hundreds of possible XOR gate constructions. Each one has a
collection of undesirable states that result in either interference or a weakly driven output.
The interfering states can be avoided depending upon the neutral states of the inputs or
mitigated depending upon the length of time spent in the neutral states and the time in
which the result is needed. Meanwhile, the weak driving states can be fixed with staticizing
transistors using other signals in the handshake.
Only two pass-transistor XOR gates are generally found in the literature. Of the two, only
one is useful for QDI circuits. This approach represents a relatively safe well-rounded XOR
gate which is reasonably applicable to most scenarios. For this approach, both inputs must be
70
differential. One of the differential inputs, b and _b , are passed through the XOR gate while
the other, a and _a , drive the gates of the transistors. This means that the up-going transition
on c acknowledges the up-going transitions on b , _b and the down-going transitions on
a , _a .
@b ∧ ¬a ∨ @_b ∧ ¬_a → c↾
¬@b ∧ _a ∨ ¬@_b ∧ a → c⇂
Unfortunately, there are six undesirable states in this circuit. Four of these happen when
a , _a are switching. When the C-element or latch driving a , _a switches, there is a transient
state in which a and _a are the same value. When this happens, b and _b are connected to
each other. If b and _b are in a differential state, then it will cause interference. Specifically,
if both a and _a are 0, then the pull-up network of either b or _b will drive the other
up through this pass gate, fighting the pull-down network of the other. If both a and _a
are 1, then the pull-down network of either b or _b will drive the other down through this
pass gate, fighting the pull-up network of the other. Table 1 elaborates the outcomes of each
possible state for this XOR.
This creates a few constraints. First, a and _a cannot be driven by a delay insensitive
encoding, because the neutral state of the delay insensitive encoding is not transient and will
burn power through this short. This means that a and _a should either be driven by a latch
or a C-element. Second, b and _b are ideally driven by a delay insensitive encoding that is
in the neutral state when a and _a switch. This allows b and _b to be briefly connected
without causing any interference.
If both signals must be driven by latches, then the offending transistors of the transient
state causing interference should be sized similar to that of a weak keeper for a C-element,
meaning that part of the gate is no longer allowed special treatment in the QDI delay model
since the resulting delay will no longer be negligible.
The other two undesirable states only happen when both a , _a and b , _b are driven by
latches and both latches are switching. During these transient states, all of the input signals
to this gate are the same value, causing the output node c to be dynamic. These transient
states may be covered by staticizing c using other signals from the handshake. However,
this should be done only if the dynamic transient state is reachable in the handshake, the
completion of the transition to the static value of c after the transient state is not guaranteed
by the QDI delay model by the time it is used in the handshake, and staticizing c would not
cause weak interference with another staticizer on b or _b .
71
_a a _b b c Drivers
1 0 1 0 0 b⇂
0 1 1 0 1 _b↾
1 0 0 1 1 b↾
0 1 0 1 0 _b⇂
1 1 1 1 weak 1 weak b↾ ,weak _b↾
1 1 1 0 X- b⇂ ,weak _b↾
1 1 0 1 X- weak b↾ , _b⇂
1 0 1 1 1 b↾
0 1 1 1 1 _b↾
0 0 1 1 1 b↾ , _b↾
1 1 0 0 0 b⇂ , _b⇂
0 0 1 0 X+ weak b⇂ , _b↾
0 0 0 1 X+ b↾ ,weak _b⇂
1 0 0 0 0 b⇂
0 1 0 0 0 _b⇂
0 0 0 0 weak 0 weak b⇂ ,weak _b⇂
Table 1. The state space of the dual differential
pass transistor XOR. Rows are highlighted when
a and _a are the same value.
Pass Transistor AND
@a ∧ ¬_b → c↾
¬@a ∧ b → c⇂
_b → c⇂
The pass transistor AND gate presents an easy solution for gating signals in a handshake.
This can be used to reduce transistor stack length in some cases and remove Half-Cycle
Timing Assumptions in others. There are ultimately two undesirable states. First, when all
of the input signals are 0, c is no longer driven. In a transient state, this is not much of a
problem. However, if noise becomes an issue, then another signal from the handshake may
be used to staticize c . Second, when all of the input signals are 1, there is weak interference
as the CMOS pull-up network on c fights the pull-down network on a through the NMOS
pass transistor. This state should ultimately be avoided. If that is not possible, then the NMOS
72
a _b b c Drivers
0 1 0 0 GND⇂
1 1 0 0 GND⇂
0 0 1 0 a⇂
1 0 1 1 a↾
0 0 0 weak 0 weak a⇂
1 0 0 1 a↾
0 1 1 0 GND⇂ , a⇂
1 1 1 X- GND⇂ ,weak a↾
Table 2. The state space of the pass
transistor AND.
pass transistor should be sized like a weak staticizer and the offending state must be transient.
Pass Transistor OR
@a ∧ ¬b → c↾
¬@a ∧ _b → c⇂
¬_b → c↾
The pass transistor OR gate is similar to the AND gate, flipping the undesirable states.
These pass transistor gates can have a significant effect on performance, and can be applied
to many of the previously listed micro-architectural optimizations. This work provides three
examples.
Storing a 1of2 Request (Non-inverted Singlerail Out)
First, if values on a stored 1of2 request are repeated often, then the latch will switch very
rarely. In these cases, it would be desirable to cut the transitions out of the handshake. This
can be done with a pass transistor XOR gate. This uses the same number of transistors to
provide a non-inverted output compared to the inverted output in the CMOS approach.
At the start of the handshake, r0 and r1 are in the neutral state with both signals low.
As long as v0 and v1 are in a differential state, o will be driven low. If r0 transitions
high, then v1 will transition low followed by v0 high. This means that _v1 will transition
high and _v0 low in any order, but preferring _v1 first. This will briefly put o in an
undesirable state when both _v0 and _v1 are high. Upon resolution, o will transition high
signalling the completion of the store. Then, the 1of2 input r0 and r1 may transition back
73
a _b b c Drivers
0 1 0 0 a⇂
1 1 0 1 a↾
0 0 1 1 Vdd↾
1 0 1 1 Vdd↾
0 0 0 X+ Vdd↾ ,weak a⇂
1 0 0 1 Vdd↾ , a↾
0 1 1 0 a⇂
1 1 1 weak 1 weak a↾
Table 3. The state space of the pass
transistor OR.
to a neutral state. This gives o an entire cycle to transition low again. That means that the
NMOS transistors in the pass transistor XOR may be sized like a weak staticizer without
affecting the performance of the handshake. Furthermore, Both the inverted variables _v0
and _v1 are fully acknowledged in all cases just by using o . Therefore, this approach does
not rely on any special treatment in the QDI delay model.
v1 ∨ r1 → v0⇂
v0 ∨ r0 → v1⇂
¬v1 ∧ ¬r1 → v0↾
¬v0 ∧ ¬r0 → v1↾
v0 → _v0⇂
¬v0 → _v0↾
v1 → _v1⇂
¬v1 → _v1↾
@r1 ∧ ¬_v1 ∨ @r0 ∧ ¬_v0 → o↾
¬@r0 ∧ _v1 ∨ ¬@r1 ∧ _v0 → o⇂
Memory Gated Forward Drivers
When the forward drivers do not map well to the output requests, the general strategy is to
use the memory to gate the forward drivers to generate the correct output requests. However,
the CMOS approach to this problem either introduces transitions to the handshake or relies
upon the Half-Cycle Timing Assumption. With a pass transistor AND gate, the gating rule
on the internal node of the forward driver can be replaced with a pass transistor gate on the
external node. This removes the extra transitions. Furthermore, because the passed signal is
the output of a C-element, it is driven by an inverter. This is the optimal driving gate for
74
the inputs to pass transistor logic because it allows the driving gate to be sized up quite
significantly.
Re ∧ Lw0 → R0↾ ¬v1 ∨ ¬_R0 → v0↾ ¬Lr0 ∧ ¬v1 → R0⇂
Re ∧ Lw1 → R1↾ ¬v0 ∨ ¬_R1 → v1↾ ¬Lr1 ∧ ¬v0 → R1⇂
Re ∧ Lr → R2↾ v1 ∧ _R0 → v0⇂ (¬Re ∨ ¬v1) ∧ ¬Lr → R2⇂
@R2 ∧ ¬v0 → Rr↾ v0 ∧ _R1 → v1⇂
¬@R2 ∧ v1 → Rr⇂ ¬R0 ∧ ¬R1 ∧ ¬R2 → Le↾
v0 → Rr⇂
R0 ∨ R1 ∨ R2 → Le⇂
Intermediate Forward Drivers
The same can be done for the intermediate forward drivers. Because the dualrail encoding
ensures mutual exclusivity between the two requests, an XOR gate can be used in place of an
OR gate.
When both drivers R0 and R1 are low, both internal nodes are high. This means that
both NMOS transistors are on, passing both drivers. When one of the forward drivers R0 is
active, its internal node _R0 transitions low. This disconnects the NMOS transistor to R1 and
connects the PMOS transistor to R0 . Then, the transition on R0 is passed directly through
the gate.
In the reset phase, _R0 transitions high. There is a transient state in which R0 and R1
are connected to each other through the two NMOS transistors of the pass transistor XOR.
This will start to weakly pull R1 high. However, because _R0 transitioned high, the inverter
driving R0 will switch at the same time. Now, the inverters for both R0 and R1 will actively
pull R0 low. Practically, this does not cause a glitch on R1 unless the load capacitance on
Rr is extremely high and the drivers on R0 and R1 are extremely weak.
Re ∧ Lr0 → R0↾ ¬Re ∧ ¬Lr0 → R0⇂
Re ∧ Lr1 → R1↾ ¬Re ∧ ¬Lr1 → R1⇂
@R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Rr↾
¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Rr⇂ ¬R0 ∧ ¬R1 → Le↾
R0 ∨ R1 → Le⇂
These optimizations will be used throughout the circuits proposed in this work.
3.8 Example: Single Bit Register
As an example of some of these optimizations, this section walks through a step-by-step
75
optimization of the single bit register provided in [68], which is the current standard for circuit
design of such a unit.
Ce ∧ Cr ∧ Re ∧ v0 → Rr0↾ Ce ∧ Cw ∧ (v0 ∧ Lr0 ∨ v1 ∧ Lr1) → Le⇂
Ce ∧ Cr ∧ Re ∧ v1 → Rr1↾ ¬Le ∨ ¬Rn → Ce⇂
Rr0 ∨ Rr1 → Rn⇂ ¬Ce ∧ ¬Re → Rr0⇂
¬Ce ∧ ¬Re → Rr1⇂
Ce ∧ Cw ∧ Lr0 → v1⇂
Ce ∧ Cw ∧ Lr1 → v0⇂ ¬Rr0 ∧ ¬Rr1 → Rn↾
¬Lr0 ∧ ¬v0 → v1↾
¬Lr1 ∧ ¬v1 → v0↾ ¬Ce ∧ ¬Lr0 ∧ ¬Lr1 → Le↾
Le ∧ Rn → Ce↾
1. WCHB Single Bit Register: First, the circuit is converted to a standard WCHB reshuffling.
This simplifies the handshake, making it easier to think about.
Cw ∧ Ld0 → W0↾ ¬Cw ∧ ¬Ld0 ∧ ¬v1 → W0⇂
Cw ∧ Ld1 → W1↾ ¬Cw ∧ ¬Ld1 ∧ ¬v0 → W1⇂
Re ∧ Cr ∧ v0 → Rr0↾ ¬Re ∧ ¬Cr → Rr0⇂
Re ∧ Cr ∧ v1 → Rr1↾ ¬Re ∧ ¬Cr → Rr1⇂
W0 ∨ W1 ∨ Rr0 ∨ Rr1 → Ce⇂ ¬W0 ∧ ¬W1 ∧ ¬Rr0 ∧ ¬Rr1 → Ce↾
W0 ∨ W1 → Le⇂ ¬W0 ∧ ¬W1 → Le↾
¬v1 ∨ ¬_W0 → v0↾
¬v0 ∨ ¬_W1 → v1↾
v1 ∧ _W0 → v0⇂
v0 ∧ _W1 → v1⇂
2. Gated Forward Drivers: Noticing that the output requests from the read behave similarly,
and that there is a stable way of differentiating them through v0 and v1 , pass transistor
memory gated forward drivers is applied to remove one of the read C-elements.
76
@Rr ∧ ¬v0 → Rr1↾ ¬v1 ∨ ¬_W0 → v0↾
¬@Rr ∧ v1 ∨ v0 → Rr1⇂ ¬v0 ∨ ¬_W1 → v1↾
v1 ∧ _W0 → v0⇂
@Rr ∧ ¬v1 → Rr0↾ v0 ∧ _W1 → v1⇂
¬@Rr ∧ v0 ∨ v1 → Rr0⇂
¬Cw ∧ ¬Ld0 ∧ ¬v1 → W0⇂
Cw ∧ Ld0 → W0↾ ¬Cw ∧ ¬Ld1 ∧ ¬v0 → W1⇂
Cw ∧ Ld1 → W1↾ ¬Re ∧ ¬Cr → Rr⇂
Re ∧ Cr → Rr↾
¬W0 ∧ ¬W1 ∧ ¬Rr → Ce↾
W0 ∨ W1 ∨ Rr → Ce⇂ ¬W0 ∧ ¬W1 → Le↾
W0 ∨ W1 → Le⇂
3. Stored Dualrail Request: Then, the write command is simplified using the stored dualrail
request method. This removes another C-element. However, this optimization relies upon
the special treatment for pass transistor logic in the QDI model for the pass transistor rules
driving Rr0 and Rr1 low.
@Rr ∧ ¬v0 → Rr1↾ Cw ∧ (Lw0 ∧ v0 ∨ Lw1 ∧ v1) → Wr↾
¬@Rr ∧ v1 ∨ v0 → Rr1⇂ Re ∧ Cr → Rr↾
@Rr ∧ ¬v1 → Rr0↾ Wr → Le⇂
¬@Rr ∧ v0 ∨ v1 → Rr0⇂ Wr ∨ Rr → Ce⇂
v1 ∨ Cw ∧ Lw1 → v0⇂ ¬Cw ∧ ¬Lw0 ∧ ¬Lw1 → Wr⇂
v0 ∨ Cw ∧ Lw0 → v1⇂ ¬Re ∧ ¬Cr → Rr⇂
¬v1 ∧ (¬Cw ∨ ¬Lw1) → v0↾
¬v0 ∧ (¬Cw ∨ ¬Lw0) → v1↾ ¬Wr → Le↾
¬Wr ∧ ¬Rr → Ce↾
4. Pass Transistor AND: The read command C-element can be completely removed,
removing the pipeline stage as well. This makes the read pass through to the read channel.
77
@Cr ∧ ¬v0 → Rr1↾ v1 ∨ Cw ∧ Lw1 → v0⇂
¬@Cr ∧ v1 ∨ v0 → Rr1⇂ v0 ∨ Cw ∧ Lw0 → v1⇂
¬v1 ∧ (¬Cw ∨ ¬Lw1) → v0↾
@Cr ∧ ¬v1 → Rr0↾ ¬v0 ∧ (¬Cw ∨ ¬Lw0) → v1↾
¬@Cr ∧ v0 ∨ v1 → Rr0⇂
Cw ∧ (Lw0 ∧ v0 ∨ Lw1 ∧ v1) → Wr↾
@Re ∧ ¬Wr → Ce↾
¬@Re ∧ _Wr ∨ Wr → Ce⇂ Wr → Le⇂
¬Cw ∧ ¬Lw0 ∧ ¬Lw1 → Wr⇂
¬Wr → Le↾
5. Combined L, C: From here, further simplification requires modification of the behavioral
specification. If it is possible to guarantee mutual exclusion of the read and write in the
environment, then the C and L channels can be combined into one, making L have a 1of3
request. This entirely removes the need for a write C-element. Again, this relies upon the
special treatment for the pass transistors driving Rr0 and Rr1 low.
@Rr ∧ ¬v0 → Rr1↾ Re ∧ Lr → Rr↾
¬@Rr ∧ v1 ∨ v0 → Rr1⇂
v0 ∧ Lw0 ∨ v1 ∧ Lw1 ∨ Rr → Le⇂
@Rr ∧ ¬v1 → Rr0↾
¬@Rr ∧ v0 ∨ v1 → Rr0⇂ ¬Re ∧ ¬Lr → Rr⇂
v1 ∨ Lw1 → v0⇂ (¬v0 ∨ ¬Lw0) ∧ (¬v1 ∨ ¬Lw1) ∧ ¬Rr
v0 ∨ Lw0 → v1⇂ → Le↾
¬v1 ∧ ¬Lw1 → v0↾
¬v0 ∧ ¬Lw0 → v1↾
6. Combined L, C, R Counterflow (Positive): If the environment can further handle the
acknowledgement of the read, then it is possible to remove the read C-element by
implementing a counterflow handshake. Now the input enable keeps the value of the internal
memory when the enable is high. This allows its value to be sampled whenever it is needed
without any request, as long as that is still mutually exclusive from the write commands.
v1 ∨ Lw1 → v0⇂ Lw0 ∨ v1 → Lr0⇂
v0 ∨ Lw0 → v1⇂ Lw1 ∨ v0 → Lr1⇂
¬v1 ∧ ¬Lw1 → v0↾
¬v0 ∧ ¬Lw0 → v1↾ ¬Lw0 ∧ ¬v1 → Lr0↾
¬Lw1 ∧ ¬v0 → Lr1↾
78
3.9 Integrated QDI/BD Circuits
Fig. 38 shows the typical diagram used to describe a 4-phase bundled data pipeline [63]. For
each stage, there is a QDI control block and latched datapath logic. The enable signal from the
QDI control is amplified and used to clock the datapath. Meanwhile, the input request is delayed
to prevent the latches from closing before the input data resolves.
The pipeline protocol as executed by the process in Fig. 38 is demonstrated in Fig. 39,
assuming a WCHB reshuffling for the control. Signals driven by the environment on L are
colored red while signals driven by the environment on R are colored blue. The input enable for
all channels are generally initialized high signifying that the process is ready to receive data on
that channel. This initialization, having been applied to Channel L , opens the p-latches.
Therefore, the bundled data is the first thing to arrive on L , followed by an upgoing transition
on the request wire shortly thereafter. This request must be delayed such that the bundled data
arriving on L has passed through the p-latch by the time the enable is lowered. This ensures that
the incoming data is correctly latched so that datapath logic may resolve and forward the result
through R . Once the enable of L is lowered, the input data is allowed to change. This means that
the delay assumption overlaps with half of the handshake protocol, allowing some of the protocol
to be counted toward the delay assumption, reducing the length of the delay line.
Ultimately, this might be improved by overlapping the delay assumption with the whole
handshake protocol using full-buffering. However, that would require a PCFB reshuffling in the
QDI control and flip-flops instead of latches in the datapath. In the end, both the total throughput
and the device count will double, leaving the throughput efficiency the same. The energy required
per token will increase, so overall that approach would be less performant.
One should also note that the delay elements only need to delay the upgoing transition of the
request wires. The downgoing transition only serves to reset the channel protocol and open the
latches for the next computation. Therefore, one should make extensive use of asymmetric delay
lines as seen in Fig. 40 to increase the overall throughput of the process.
Ultimately, while the pipeline demonstrated in Fig. 39 uses a dataless WCHB reshuffling, it
seems like a reasonable jump that the QDI control could be any half-buffered process with or
without data. In particular, there are a whole host of templates from [68] that can be put in that
box. If the input request communicates data, then all of the request wires must have a delay
element. If the request wires trigger cycles with different cycle times in the QDI control, the delay
lines should be tuned to their associated cycle times. Alternatively, it is possible to clock the
request lines as in [64].
However, there are a few important facets of this approach that require careful consideration.
79
Fig. 38: A basic template for QDI
control with bundled data.
Fig. 39: The channel protocol for the input and output channels of a pipeline stage evaluated over two packets of
data.
Fig. 40: Circuit diagram for an asymmetric delay
line. The upgoing transition is delayed by six
inverters while the downgoing by only two.
This includes communicating data between the control and datapath, dealing with conditional
acknowledgement and conditional output requests, clocking internal memories in the datapath,
and handling exchange channels.
QDI Input Requests to Datapath
Communicating input request signals from the QDI control to the bundled datapath is
80
simple enough. Unfortunately, they transition to neutral before the output enable is lowered,
potentially allowing an incorrect result to propagate out through the datapath before the
latches in the next stage are closed. So, a separate signal must be generated that is held stable
at least until the output enable is lowered. This can be easily achieved with an SR latch
as shown in Fig. 41. The delay insensitive data is pulled from the request wires before the
delay elements and fed through an SR latch. The result is guaranteed to be stable by the
time the request is received by the QDI control due to the delay lines, and remains stable
until the next request. Therefore it can then be used in both the datapath logic and the QDI
control throughout the whole cycle. Unfortunately, anything beyond a two way latch becomes
prohibitively expensive. For example, a 3-way latch requires 18 transistors compared to the
8 required for a 2-way latch. So, to get the best performance from this strategy, any delay
insensitive encoding should be converted to a collection of 1of2 codes before feeding it into
the SR latches.
QDI Internal Memory to Datapath
When communicating the value of an internal memory of the QDI control process to the
datapath, one must ensure that any transition executed on that memory cell waits for the
output enable to lower. This prevents the transition from incorrectly propagating through the
datapath into the next process, potentially causing a glitch. Luckily, this is often the default
implementation of an internal memory unit [68]. However, if this is not possible, then the
signal sent to the datapath must be protected with a latch that remains closed while the output
enable is high as in Fig. 42.
Datapath to QDI Forward Drivers
In order to use a datapath signal D in the pull-down networks of the forward drivers, the
timing assumption must be tightened. Previously the output data was assumed to be stable by
the time the output enable was lowered. However, that only happens after the output requests
have been sent. Any signal used at this stage of the handshake must be stable by the time the
request passes through the delay lines. Ultimately, this just means that the delay lines should
be lengthened as they no-longer overlap the handshake protocol.
Re ∧ Lr ∧ D0 → Rr0↾ ¬Re ∧ ¬Lr → Rr0⇂
Re ∧ Lr ∧ D1 → Rr1↾ ¬Re ∧ ¬Lr → Rr1⇂
Rr0 ∨ Rr1 → Le⇂ ¬Rr0 ∧ ¬Rr1 → Le↾
81
Fig. 41: Communicating a QDI request
signal to the bundled datapath using an
SR latch.
Fig. 42: Communicating a QDI internal
memory to the bundled datapath using
an n-latch.
Datapath to QDI Internal Memory
Using a datapath value D to set the value of an internal memory unit depends upon the
implementation of the internal memory. If transitions on the internal memory are gated by
the downgoing transition of the output enable Re , then D may be used directly. Otherwise,
as below, the delay assumption must be adjusted such that the datapath has stabilized before
the internal memory is allowed to transition.
Re ∧ Lr ∧ v0 → R0↾ ¬v1 ∨ ¬_R1 ∧ ¬D1 → v0↾ ¬Re ∧ ¬Lr ∧ (¬D1 ∧
Re ∧ Lr ∧ v1 → R1↾ ¬v0 ∨ ¬_R0 ∧ ¬D0 → v1↾ ¬v0) → R0⇂
v1 ∧ (_R1 ∨ D1) → v0⇂ ¬Re ∧ ¬Lr ∧ (¬D0 ∧
R0 ∨ R1 → Le⇂ v0 ∧ (_R0 ∨ D0) → v1⇂ ¬v1) → R1⇂
¬R0 ∧ ¬R1 → Le↾
82
QDI/Datapath Cycle
If the process communicates data both from the QDI control to the datapath and from the
datapath to the QDI control, it is possible to introduce a cycle. Specifically, if a QDI logic
block transitions and this transition is passed into the datapath, then this can cause a transition
or even a glitch to propagate back out of the datapath and into the QDI control.
There are two ways to mitigate this problem. First, one could redesign the datapath to
break the cycle. If that cannot be done, then an extra latch can be introduced to effectively
break the cycle as in Fig. 43.
If the data sent to the datapath comes from the internal memory and the data received from
the datapath is used in the forward drivers, then the p-latch is unnecessary since the n-latch
guarding the datapath ensures the output enable is low before causing new transitions on D .
This disables the forward drivers and protects them from any glitches in the datapath this
cycle might cause.
Datapath Memory
An internal memory in the datapath can start to make things a bit more complex,
particularly because they cannot always be clocked in lock-step with another channel. The
basic template requires three layers of latches as in Fig. 44 (left).
Assume for a moment the internal memory were instead implemented using a p-flop,
removing the first layer of latches. If the memory were clocked some delay after the input
latches due to logic, then there would be a short time during the reset phase of the handshake
in which the input latch and the n-latch of the internal memory are both open. This would
allow the new value from the input to race ahead and erase the value that should have been
stored from the last operation.
Of course, the latches implementing the internal memory can be pushed around the
datapath so long as doing so would not affect the output value on Rd . For example, the first
p-latch layer could pushed back to just after the input latches and just after the second p-latch
layer of the internal memory as in Fig. 44 (right). Because the first and second p-latch layer
of the internal memory share the same clock, they can be merged together. If there is no
clocking logic meaning the delay between the input latch clock and the memory clock is zero,
then the input latch and the relocated first p-latch layer of the memory can also be merged.
This strategy of moving the latches around must include all signals that lead into the datapath
internal memory, including signals from the QDI logic.
Then, the clocking logic can also be pushed around. Suppose the clocking logic comes
83
Fig. 43: Breaking a communication
cycle with a p-latch on the output
request.
Fig. 44: Clocking a memory internal to the datapath.
from two channels that are conditionally acknowledged as in Fig. 45 (left). The AND gate
can be pushed into the latches, then the optimizations above can be performed to remove
the unnecessary latching layer as in Fig. 45 (right). This requires special multi-clock latches
shown in Fig. 46. Alternatively, the AND gate can be pushed into the QDI control using
C-elements to ensure it occurs before the input enables. However, this has fairly undesirable
performance.
QDI and Datapath Memories
There are two ways the internal memory of the QDI control can be used to set the datapath
84
Fig. 45: Clocking a memory internal to the datapath.
Fig. 46: The p-and-p latch (left) passes the value when both clocks are high,
the n-or-n latch (right) passes the value when either clock is low.
memory. In the standard approach, the new value of the QDI internal memory will set the
value of the datapath memory at the end of the next cycle as seen in Fig. 47 (left). This
matches the behavior of the QDI internal memory on Rd . Notice that the n-latch with Le can
be merged into the QDI internal memory and that the first layer of p-latches implementing
the datapath memory can be pushed around so that it takes the spot of the n-latch. With these
optimizations, the cost of this approach is fairly low.
In the alternative approach, the new value of the QDI internal memory will set the value of
the datapath memory at the end of this cycle as in Fig. 47 (right). In this case, the first layer of
p-latches implementing the datapath memory have been optimized out for clarity. The n-latch
guarding the QDI internal memory has been removed as well. Removing these two latches
85
shifts the set time by a full cycle.
Exchange Channels
Exchange channels pose both unique challenges and opportunities. For exchange
channels, all of the enable signals also encode data. This means there is no longer a
convenient clocking signal for the datapath. Furthermore, the latching circuitry on both Lid
and Rid must be closed before either requests Lor or Ror are sent out. This necessitates the
creation of a separate set of clocking signals as seen in Fig. 48.
Re-examining the negative exchange channel implementation, the signals driven by the
forward drivers R0 , R1 , R2 , and R3 are the only ones along the critical path of both output
requests. Unfortunately, there are four of them. So this will require special multi-clock latches
shown in Fig. 49. The positive exchange channel implementation is similar.
Re0 ∧ v0 ∧ Lr0 → ¬v1 ∨ ¬Re0 → v0↾ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R0⇂
R0↾ // amplify ¬v0 ∨ ¬Re1 → v1↾ // amplify
Re1 ∧ v1 ∧ Lr0 → v1 ∧ Re0 → v0⇂ (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr0 → R1⇂
R1↾ // amplify v0 ∧ Re1 → v1⇂ // amplify
Re0 ∧ v0 ∧ Lr1 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R2⇂
R2↾ // amplify // amplify
Re1 ∧ v1 ∧ Lr1 → (¬Re0∧¬v1 ∨ ¬Re1∧¬v0)∧¬Lr1 → R3⇂
R3↾ // amplify // amplify
R0 ∨ R1 → Rr0↾ ¬R0 ∧ ¬R1 → Rr0⇂
R2 ∨ R3 → Rr1↾ ¬R2 ∧ ¬R3 → Rr1⇂
R0 ∨ R2 → Le0⇂ ¬R0 ∧ ¬R2 → Le0↾
R1 ∨ R3 → Le1⇂ ¬R2 ∧ ¬R3 → Le1↾
Exchange Channel with Internal Memory
Clocking a memory with an exchange channel provides for a unique optimization. Much
like the earlier datapath memory implementation, the latches can be pushed around the
datapath. In this case, the latches can be merged entirely into the input latches with a little
extra logic in Fig. 50.
3.10 Toolset and Circuit Evaluation
All of the circuits in this work are developed and evaluated using a set of in-house tools
found in [89]. The production rule specifications are verified with a switch-level simulation which
identifies instability, interference, and deadlock. These specifications are then automatically
translated into netlists and their analog properties are verified using Synopsys's combined
86
Fig. 47: QDI internal memory to datapath memory communication.
Fig. 48: Clocking an exchange channel.
Fig. 49: The p-or-p latch (left) passes the value when either clock is high,
the n-and-n latch (right) passes the value when both clocks are low.
simulator with VCS, a verilog simulator, to simulate the testbench and HSIM, a fast spice
simulator, to report power and performance metrics. The CHP was simulated using C++ to
87
Fig. 50: The basic template for datapath memory with exchange channels (left) vs after all of the
previously discussed memory optimizations (right).
generate inject and expect values which were tied into both the switch level and analog
simulations using Python. This facilitated verification of circuit and behavioral correctness by
checking the behavioral, digital, and analog simulations against each other.
A 1V 28nm process was simulated to evaluate frequency and energy per operation. Latency
is measured from the 0.5 V level of the input to the 0.5 V level of the output. To get more
accurate results, each of the digitally driven channels is protected with a FIFO of three WCHBs
isolated to a different power source. All circuits are sized minimally with a pn-ratio of 2. The
simulations do not include extracted parasitics, but a 1 fF capacitor is added to every gate output.
All implementations are explicit about the use of the half-cycle timing assumption (HCTA) [72],
and use weak feedback for C-elements. Circuitry necessary for reset will not be included in any
of the descriptions.
There is no particularly straightforward way to evaluate self-timed circuits. Each circuit has a
set of conditions, executing one per cycle. Some conditions might not toggle the input or output
channels. Some might simply act as a token source or sink with dramatically higher frequency. So
to get a reasonable picture of a circuit's overall performance, one must determine the frequency
and energy of each condition, determine how often each condition is likely to execute, and use
those two measures to determine some average performance metrics per token. In the event that
88
there is not enough data to determine how often a condition might execute, individual numbers
will be reported and a uniform random distribution will be used to determine relative overall
performance.
Evaluating adaptive digit-serial circuits add further difficulty. To be able to compare against
their bit-parallel counterparts, it is necessary to determine the overall performance of the circuit
per digit-stream by using the data from Fig. 32.
Overall, four numbers will be reported: forward latency, operation frequency, energy per
operation, and transistor count. The forward latency is informative of the execution speed of
sequential operations. The operation frequency and transistor count are informative of the total
throughput of parallel operations. And minimizing the energy per operation reduces the
power-wall constraint seen by other processors.
89
CHAPTER 4
COUNTERS
Counters are fundamental for tracking the state of a digit-serial operation. In particular, they
will be absolutely necessary for the implementation of digit-serial shifting, rotation, and sign
compression operations. In general, they are also applicable to a large array of other applications
including the control logic for power gating, clock gating, and pipeline management
[186][198][232]; timers, performance counters, and frequency dividers [87]; and iterative
arithmetic circuits [88].
This versatility leads to a wide variety of functionality. Ultimately, there are five basic input
commands: increment, decrement, clear, read, and write; and six output responses: zero, full,
less-than, greater-than, equal-to, and no-event. Counters are named using the first letter of each
command they support followed by the first letter of each response they support. So idzn would
be an increment/decrement counter with zero/not-zero detection.
While clocked counters have been thoroughly explored such as the increment/decrement
counter in [92], the increment/write in [93], and the decrement/write in [94][95], the same
cannot be said of QDI counters. Aside from my previous work in [75], a constant response
time decrementing counter with zero detection was implemented in [96], an increment/decrement
counter with zero/full detection in [97], and an increment/decrement counter with constant-time
zero detection in [98].
4.1 Behavioral Specification
This work iterates on the designs found in [75], deriving significant improvements in area,
energy, and throughput with five basic optimizations.
First, the input enable previously communicated the counter status on the reset phase of
the handshake using positive exchange channels. This is extremely helpful when using the
counter, but it also encourages long transistor stacks in the reset rules of the input enable. This
forced the use of a validity tree from the output request drivers to reduce the transistor stack
length of the input enable rules, introducing two extra transitions in the handshake that reduced
overall frequency. Furthermore, it required the slowest internal memory implementation, gating
the internal memory against the reset of the input request wires, again reducing the operating
frequency.
Optimizing this requires that the counter status be communicated instead when the input enable
is lowered using negative exchange channels. This makes it more difficult for a user to interface
90
with the counter because the interfacing processes can no longer directly condition their increment
and decrement requests on the status of the counter. Instead, they must have their own internal
memory to keep track of the counter status received between commands. However, it also reduces
the long transistor stacks required by the previous approach. This removes the validity tree and
allows for the fastest internal memory approach, dramatically increasing operating frequency.
Second, a pass-transistor stored 1of2 request can combine the output acknowledge and the
acknowledgement of transitions in the counter status memory unit into a single signal. This
reduces the complexity of the output acknowledge and therefore the forward drivers.
Third, the savings from the first and second optimizations make it possible combine 2-bits of
the counter into a single process. The complexity per bit of the output request lines and internal
memory stays the same. However, the complexity per bit of the counter's status circuitry is cut is
half. Furthermore, command and status data are communicated half as often.
Fourth, a Gray Code can be used to implement the 2-bit counter unit. This reduces the
complexity of the internal memory, ensuring that an increment or decrement command switches
only one of the two latches at a time. Because one of the latches remains stable, it can be used to
control memory gated forward drivers, reducing the total number of forward drivers.
Fifth, pass transistor XOR and XNOR gates can be used between the two internal memory
units to reduce the load capacitance from the forward drivers. This extracts most of the complexity
out of the underlying handshake and into the pass transistor logic.
To keep the circuit as simple as possible, the specification assumes the counter will not under
or overflow. It starts at zero, then for every iteration a command is received from Lc , then the
value, vc , is either increased or decreased by one depending upon the command. Finally, the
status of the counter is sent across Lz . Notice that this specification differs from the specification
in [75] since data is not sent on the acknowledgement until after the command.
vc:=0;
∗[Lc?lc;
[ lc=inc → vc:=vc+1
▯ lc=dec → vc:=vc-1
]; Lz!(vc=0)
]
Deriving a process for a 2-bit counter unit is done by separating the least significant digit, v0 ,
of the counter from the remaining digits, vc . This requires carry circuitry for the increment and
decrement from the first digit to the remaining digits. If Lc is increment and v0 is 3 or Lc is
decrement and v0 is 0 , the increment or decrement command should be carried to the remaining
digits. Otherwise, the remaining digits are left unchanged. Either way, the value of v0 changes
91
by one.
v0:=0, vc:=0;
∗[Lc?lc;
[ lc=inc → [ v0=0 → v0:=1
▯ v0=1 → v0:=2
▯ v0=2 → v0:=3
▯ v0=3 → v0:=0, vc:=vc+1 ]
▯ lc=dec → [ v0=0 → v0:=3, vc:=vc-1
▯ v0=1 → v0:=0
▯ v0=2 → v0:=1
▯ v0=3 → v0:=2 ]
]; Lz!(v0=0 ∧ vc=0)
]
Then, two new channels are introduced into the specification: Rc communicates the carried
command (inc, dec) and Rz responds with the resulting status (zero, not zero). This removes all
direct data dependencies between v0 and vc , preparing the specification for projection [80].
v0:=0, vc:=0, vz:=1;
∗[Lc?lc;
[ lc=inc → [ v0=0 → v0:=1
▯ v0=1 → v0:=2
▯ v0=2 → v0:=3
▯ v0=3 →
v0:=0; Rc!inc; Rz?vz ∥
Rc?rc; vc:=vc+1; Rz!vc=0 ]
▯ lc=dec → [ v0=0 →
v0:=3; Rc!dec; Rz?vz ∥
Rc?rc; vc:=vc-1; Rz!vc=0
▯ v0=1 → v0:=0
▯ v0=2 → v0:=1
▯ v0=3 → v0:=2 ]
]; Lz!(v0=0 ∧ vz=1)
]
In the next step, the least significant digit is projected into a separate process with variables
v0, lc, vz , leaving the remaining digits implemented by the variables vc, rc .
92
v0:=0; vz:=1
∗[Lc?lc;
[ lc=inc → [ v0=0 → v0:=1
▯ v0=1 → v0:=2
▯ v0=2 → v0:=3
▯ v0=3 → v0:=0; Rc!inc; Rz?vz ]
▯ lc=dec → [ v0=0 → v0:=3; Rc!dec; Rz?vz
▯ v0=1 → v0:=0
▯ v0=2 → v0:=1
▯ v0=3 → v0:=2 ]
]; Lz!(v0=0 ∧ vz=1)
] ∥
vc:=0;
∗[Rc?rc;
[ rc=inc → vc:=vc+1; Rz!vc=0
▯ rc=dec → vc:=vc-1; Rz!vc=0
]
]
The specification for the remaining bits is left unaffected, and each digit has four channels: Lc
and Lz for the command and counter status and Rc and Rz to carry the command to and receive
the status from the remaining digits. This sequence of transformations can be executed recursively
on the remaining bits to formulate an N-digit counter. Finally, the specification is flattened into
DSA format.
v:=0, vz:=1;
∗[ Lc?lc;
[ lc=inc ∧ v=0 → v:=1
▯ lc=inc ∧ v=1 → v:=2
▯ lc=inc ∧ v=2 → v:=3
▯ lc=inc ∧ v=3 → v:=0; Rc!inc; Rz?vz
▯ lc=dec ∧ v=0 → v:=3; Rc!dec; Rz?vz
▯ lc=dec ∧ v=1 → v:=0
▯ lc=dec ∧ v=2 → v:=1
▯ lc=dec ∧ v=3 → v:=2
]; Lz!(v=0 ∧ vz=1)
]
Because Lc and Lz , and Rc and Rz always communicate together, they can be merged into
exchange channels L and R with the command encoded in the request and the zero status encoded
in the acknowledge as shown below.
However, the counter must be of finite size meaning it will need to be capped off. This is done
93
Fig. 51: The idzn counter decomposed into processes.
with a circuit attached to the most significant digit that sinks the command on Lc and always
returns true on Lz : ∗[ Lc?; Lz!true ] . This adds an overflow condition to the previous counter
specification.
vc:=0;
∗[Lc?lc;
[ lc=inc → vc:=vc+1
▯ lc=dec → vc:=vc-1
];
[ vc > pow(digit, units) →
vc:=vc-pow(digit, units)
▯ vc < 0 → vc:=vc+pow(digit, units)
▯ else → skip
]; Lz!(vc=0)
]
At the moment, if the value of the counter is pow(digit, units)-1 where digit is the
number of bits each counter unit implements while units is the total number of counter units in
the counter, then an increment command and the resulting status signal would have to propagate
across the full length of the counter. This means that the zero detection circuitry will take linear
time with respect to the number of bits in the worst case.
A constant time zero detection can be implemented by ignoring this overflow case. Instead of
sending on Lz after all of the computation and carries have been performed, the counter status
can be sent on Lz before all of the computation. This requires that the counter status be computed
with the current command in mind, so the check must be changed from v=0 ∧ vz=1 to v=1 ∧
lc=dec ∧ vz=1 . This ignores the increment command and therefore the overflow case altogether.
94
v:=0, vz:=1;
∗[ Lc?lc; Lz!(v=1 ∧ lc=dec ∧ vz=1)
[ lc=inc ∧ v=0 → v:=1
▯ lc=inc ∧ v=1 → v:=2
▯ lc=inc ∧ v=2 → v:=3
▯ lc=inc ∧ v=3 → v:=0; Rc!inc; Rz?vz
▯ lc=dec ∧ v=0 → v:=3; Rc!dec; Rz?vz
▯ lc=dec ∧ v=1 → v:=0
▯ lc=dec ∧ v=2 → v:=1
▯ lc=dec ∧ v=3 → v:=2
];
]
This increases the maximum value the finite-length counter can store before it overflows to
pow(digit, units-1)*(digit+1) .
[ vc ≥ pow(digit, units-1)*(digit+1) →
vc:=vc-pow(digit, units)
▯ vc < 0 → vc:=vc+pow(digit, units)
▯ else → skip
]
4.2 Increment and Decrement
Due to the first optimization, both Rz and Rn are initialized high on reset, with one to
be lowered as acknowledgement. This acknowledgement is recorded in the internal memory
implemented by vz and vn . For example, if Rz is lowered to acknowledge the output request,
vz will transition high followed by vn low.
¬vn ∨ ¬Rz → vz↾
¬vz ∨ ¬Rn → vn↾
vn ∧ Rz → vz⇂
vz ∧ Rn → vn⇂
Following the second optimization using a stored dualrail request, the acknowledgement of
the internal memory can be simplified by using a simple XOR gate. Once the internal memory
matches Rz and Rn , Re will transition low. When both Rz and Rn transition high, then Re
follows.
Rn ∧ vn ∨ Rz ∧ vz → _Re⇂
¬Rn ∧ ¬vz ∨ ¬Rz ∧ ¬vn → _Re↾
¬_Re → Re↾
_Re → Re⇂
95
However, the majority of acknowledgements do not ultimately change the value of the internal
memory, and the above solution introduces four gate delays to every output request handshake.
Alternatively, this feature can be implemented using the pass transistor variant of the stored
dualrail request, introducing gate delays only when the internal memory switches.
vz → _vz⇂
¬vz → _vz↾
vn → _vn⇂
¬vn → _vn↾
@Rn ∧ ¬_vn ∨ @Rz ∧ ¬_vz → Re↾
¬@Rn ∧ _vz ∨ ¬@Rz ∧ _vn → Re⇂
The transitions on vz and vn in the given example cause _vz to transition low and _vn high
in any order. Whichever goes first, there will be a short time in which the drivers for Rz and Rn
interfere with each other through the pass transistors. If _vz⇂ goes first, then the pull-up stack
for Rn will push against the pull-down stack for Rz . If _vn↾ goes first, then the pull-down stack
for Rz will push against the pull-up stack for Rn . This means that the drivers for Rz and Rn must
be made strong when possible and the pass transistor XOR must be minimally sized. This allows
both Rz and Rn to maintain their values at the expense of power.
Ultimately, the gate driving Re is a standard pass-transistor multiplexer and this transient
interference is the typical behavior for such gates. Either way, Re will monotonically transition
low, acknowledging the internal memory, the output acknowledge, and either _vn↾ or _vz↾ .
Once the output request is lowered, the output acknowledge will go high, monotonically
driving Re↾ , acknowledging either _vz⇂ or _vn⇂ .
Next, the Gray Code counter value must be implemented with four signals: v00 , v01 , v10 ,
and v11 . With this encoding, incrementing or decrementing the counter only changes one of the
memory values. For example, incrementing from 1 to 2 lowers v00 and raises v01 .
0. v10 , v00
1. v10 , v01
2. v11 , v01
3. v11 , v00
From here, there are many approaches for grouping those state changes in the forward drivers
with different properties. However, one is ultimately superior in every metric through the use of
pass transistor XORs. This approach groups the alternating increments together and the alternating
decrements together as follows.
• R00 drives v00 → v01 for 0 → 1 or v01 → v00 for 2 → 3
• R01 drives v10 → v11 for 1 → 2 or v11 → v10 for 3 → 0
96
• R11 drives v00 → v01 for 3 → 2 or v01 → v00 for 1 → 0
• R10 drives v10 → v11 for 0 → 3 or v11 → v10 for 2 → 1
R01 's increment from 3 → 0 and R10 's decrement from 0 → 3 carry increments and
decrements to the next counter unit. For both, v00 will remain high and v01 low, allowing for
the use of pass transistor memory gated forward drivers.
@R01 ∧ ¬v01 → Ri↾
¬@R01 ∧ v00 → Ri⇂
v01 → Ri⇂
@R10 ∧ ¬v01 → Rd↾
¬@R10 ∧ v00 → Rd⇂
v01 → Rd⇂
Examining R00 , state 0 is encoded by v00 ∧ v10 and state 2 by v01 ∧ v11 . So the rule for
R00 will look something like (v00 ∧ v10 ∨ v01 ∧ v11) ∧ Li . This is just the XNOR of the
internal state. Similarly, R01 will look something like (v01 ∧ v10 ∨ v00 ∧ v11) ∧ Li which
is just the XOR of the internal state. R11 and R10 follow the same pattern with the transitions
on the internal state toggling between XOR and XNOR. Therefore, a pass transistor XNOR and
XOR can be used to simplify these rules.
@v01 ∧ ¬v11 ∨ @v00 ∧ ¬v10 → v1↾
¬@v00 ∧ v11 ∨ ¬@v01 ∧ v10 → v1⇂
@v00 ∧ ¬v11 ∨ @v01 ∧ ¬v10 → v0↾
¬@v01 ∧ v11 ∨ ¬@v00 ∧ v10 → v0⇂
This simplifies the forward drivers dramatically, cutting the capacitive load of the input
requests by about half. Because R01 and R10 are only forwarded when v00 is high, the check
for Re can be skipped when v01 is high. The input acknowledgement gates R11 to check for the
zero condition.
v0 ∧ _R01 ∧ Li → R00↾
v1 ∧ _R00 ∧ (v01 ∨ Re) ∧ Li → R01↾
v0 ∧ _R11 ∧ (v01 ∨ Re) ∧ Ld → R10↾
v1 ∧ _R10 ∧ Ld → R11↾
v10 ∧ vz ∧ R11 → Lz⇂
(v11 ∨ vn) ∧ R11 ∨ R01 ∨ R10 ∨ R00 → Ln⇂
However, care must be taken when implementing the internal memory with regard to v0
and v1 . Specifically, the reset phase of the forward drivers must acknowledge the downgoing
transitions of those two variables. Unfortunately, the downgoing transitions of v0 and v1
97
acknowledge the upgoing transitions of v10, v11 and the downgoing transitions of v00, v01 .
This means that v00, v01 must form an nlatch while v10, v11 must form a platch. Specifically,
R00 and R11 must set the state with v00+; v01⇂ or v01+; v00⇂ . Meanwhile R01 and R10
must set v10-; v11↾ or v11-; v10↾ . Following the previously defined encoding gives these
rules for the internal state. Keep in mind that the handshake along v00, v01 is three transitions
from the input request to the reset phase while v10, v11 is five. The increased length of the
second ultimately matches the transition count for a standard WCHB buffer, so this only affects
energy.
¬v01 ∨ ¬v11 ∧ ¬_R11 ∨ ¬v10 ∧ ¬_R00 → v00↾
¬v00 ∨ ¬v11 ∧ ¬_R00 ∨ ¬v10 ∧ ¬_R11 → v01↾
v01 ∧ (v11 ∨ _R11) ∧ (v10 ∨ _R00) → v00⇂
v00 ∧ (v11 ∨ _R00) ∧ (v10 ∨ _R11) → v01⇂
¬v11 ∧ (¬v00 ∨ ¬R10) ∧ (¬v01 ∨ ¬R01) → v10↾
¬v10 ∧ (¬v00 ∨ ¬R01) ∧ (¬v01 ∨ ¬R10) → v11↾
v11 ∨ v00 ∧ R10 ∨ v01 ∧ R01 → v10⇂
v10 ∨ v00 ∧ R01 ∨ v01 ∧ R10 → v11⇂
Noticing that all of those rules have XORs or XNORs between the internal state and the
forward drivers, pass transistor XORs can be used to help reduce gate capacitance once again.
This changes the rules for the internal state to the following.
@_R11 ∧ ¬v11 ∨ @_R00 ∧ ¬v10 → x00↾
¬@_R00 ∧ v11 ∨ ¬@_R11 ∧ v10 → x00⇂
@_R00 ∧ ¬v11 ∨ @_R11 ∧ ¬v10 → x01↾
¬@_R11 ∧ v11 ∨ ¬@_R00 ∧ v10 → x01⇂
@R10 ∧ ¬v01 ∨ @R01 ∧ ¬v00 → x10↾
¬@R01 ∧ v01 ∨ ¬@R10 ∧ v00 → x10⇂
@R01 ∧ ¬v01 ∨ @R10 ∧ ¬v00 → x11↾
¬@R10 ∧ v01 ∨ ¬@R01 ∧ v00 → x11⇂
¬v01 ∨ ¬x00 → v00↾
¬v00 ∨ ¬x01 → v01↾
v01 ∧ x00 → v00⇂
v00 ∧ x01 → v01⇂
¬v11 ∧ ¬x10 → v10↾
¬v10 ∧ ¬x11 → v11↾
98
v11 ∨ x10 → v10⇂
v10 ∨ x11 → v11⇂
There are two subtle features of this approach. First, when the circuit is idle, and all of the
forward drivers are low, it will drive x00 and x01 high, keeping v00 and v01 stable. It will
also drive x10 and x11 low, keeping v10 and v11 stable. Second, if x00 transitions low and
x01 high, then x00⇂ will be acknowledged by v00↾ . Immediately following that, x01↾ will
be acknowledged by v01⇂ . Therefore all of the transitions on these new nodes are acknowledged
and require no timing assumptions.
Ultimately, v0 and v1 control the order in which the internal state is set. If only a decrement
counter is required then R00 and R01 are eliminated. If only an increment counter is required,
then R10 and R11 are eliminated. In both cases, the rules driving the internal state are no longer
XORs and x00, x01 and x10, x11 are no longer worth it. In the case of the decrement counter,
the pass-transistor XOR driving v0 should be flipped to pass v10, v11 instead of v00, v01 .
This flips the order in which v10, v11 must be set. Conversely, the increment counter should
flip v1 . This converts the five transition path to three, saving some energy.
Finally, the downgoing transitions of v0 and v1 correctly acknowledge the internal state. So
they can be used as expected for the reset rules of the forward drivers. The input acknowledge is
simply combinational following the standard WCHB reshuffling.
¬v0 ∧ ¬Li → R00⇂
¬v1 ∧ ¬Li ∧ (¬v00 ∨ ¬Re) → R01⇂
¬v0 ∧ ¬Ld ∧ (¬v00 ∨ ¬Re) → R10⇂
¬v1 ∧ ¬Ld → R11⇂
¬v10 ∨ ¬vz ∨ ¬R11 → Lz↾
(¬v11 ∧ ¬vn ∨ ¬R11) ∧ ¬R01 ∧ ¬R10 ∧ ¬R00 → Ln↾
All following counter units will use this as a base template, listing only the modified rules.
4.3 Clear
The clear command is fairly straightforward. Rc always acknowledges the input request with
Lz and checks Re in preparation for the output request.
Re ∧ Lc → Rc↾
v10 ∧ vz ∧ R11 ∨ Rc → Lz⇂
Like the increment and decrement commands, the clear command must be gated. The forward
driver Rc always transitions on a clear command, but the command is only forwarded to Rc if
vz is low. However, because the output request is guaranteed to set vz high, the gating rule must
99
also check Rz to guarantee stability. Unfortunately, having two checks like this precludes the use
of pass transistor logic.
¬_Rc ∧ (¬vz ∨ ¬Rz) → Rc↾
_Rc ∨ vz ∧ Rz → Rc⇂
The clear command always sets v00 and v10 setting the value of this counter unit to 0. For
v10 the clear command is gated by x10 to reduce the load capacitance on x10 . It has the added
effect of cleaning up instabilities on x10 , though ultimately unnecessary for the same reason as
before.
¬v01 ∨ ¬x00 ∨ ¬_Rc → v00↾
¬v00 ∨ ¬Rc ∧ ¬x01 → v01↾
v01 ∧ x00 ∧ _Rc → v00⇂
v00 ∧ (Rc ∨ x01) → v01⇂
(¬v11 ∨ ¬_Rc) ∧ ¬x10 → v10↾
v11 ∧ _Rc ∨ x10 → v10⇂
Finally, the reset for Rc checks the transitions on both internal memory units. When the
command is forwarded, the Re is lowered following the usual sequence. When the command is
not forwarded, then vn and Rc will already be low and Re will remain high.
¬v01 ∧ ¬v11 ∧ ¬Lc ∧ (¬vn ∧ ¬Rc ∨ ¬Re) → Rc⇂
(¬v10 ∨ ¬vz ∨ ¬R11) ∧ ¬Rc → Lz↾
This approach adds a few assumptions to the pass transistor logic. First, when clearing a
counter with both v01 and v11 set high, high representing a state of 2, both internal memory
units will transition. Before the transition, v0 will be high and v1 low. During the transition both
v0 and v1 will glitch with quite a significant voltage swing. However, those glitches follow the
transitions on the internal memory with no delay and are resolved before Rc is lowered. Since v0
and v1 are only used in the forward drivers and their use is gated by the other two inactive input
requests, these glitches are completely masked. Specifically, v1 must complete its transition low
before the completion of this cycle. The upgoing transition on v0 will be checked by the next
command.
Second, there is a transient state in which v10 and v11 are both high. During this transient
state, x00 and x01 are left undriven. Because they are implemented by pass transistor logic, they
will have already transitioned high by this point. However, if they have not, then there would have
to be significant noise at just the right time in order to create a glitch on x00 or x01 . If x00
glitches down, then it will simply help the clear command correctly transition v00↾ and v01⇂ .
100
If x01 glitches down, then it technically might glitch v01 up.
Therefore, x00 and x01 must be held high during the clear command. These transistors can
be as small as possible, as long as it is strong enough to overcome any possible noise.
¬_Rc → x00↾
¬_Rc → x01↾
4.4 Read
There are two ways to approach the read command. For the first approach, the read command
propagates through the counter, but does not acknowledge. This blocks up the counter from all
other commands and keeps the values stored by the internal memory stable. When the command
reaches the end of the counter it sends a request for the bundled data channel formed by the latches
in each counter unit and the output read request/acknowledge of the most significant counter unit.
Once the output request has been acknowledged, then the read command may be acknowledged
throughout the counter, opening it up to the next command. This approach sacrifices throughput
in favor of lower area.
For the second approach, the read command loads the values from the internal memory of each
counter unit into a separate set of latches specifically for the read. Once the read command has
reached the most significant counter unit and then gone through the reset phase, the bundled data
request for the read channel may be sent. This allows the read command to operate at the same
frequency as other commands as long as two read commands are not requested consecutively.
Since it is unlikely that anyone would intend to send two read commands one after another, that
means that the read command is always full throughput. This approach sacrifices area in favor of
high throughput.
This section describes the second approach in detail starting from the idzn counter template.
Throughput was chosen over area because the throughput of the read and write commands are
ultimately a bottleneck for later circuits.
Since this circuit is implemented in the context of a digit-serial CGRA, it is prudent to ensure
that the read value can be divided into digits to be converted to a digit stream. This gives four
possible types of counter unit. Of those four "first" blocks any consecutive read commands until
the active read has completed, "mid" and "last" signal when the active read has completed, and
"base" simply propagates the read while loading the latches. Finally, the interface drives the read
handshake when everything has completed.
The read propagates until it reaches a "mid" unit with its vz flag set. This signals that the
remaining counter units store zero, so there is nothing left to read. If there is not a "mid" unit with
its vz flag set, then the read completes when it reaches "last". The handshake on each digit is
101
handled independently so that it can be fed through a parallel to serial unit.
Base Unit
The base unit implementation starts with two read latches. These are only open when the read
command is active, saving power otherwise. The Gray code used by the internal memory units
needs to be converted to the standard binary code for the bit parallel channel. Luckily, that work
is already done. The first bit, O00, O01 , is equal to the XOR of the internal memory units, which
is covered by v1 and v0 . The second bit, O10, O11 is equal to v10 and v11 .
0. v10 , v00 = O10 , O00
1. v10 , v01 = O10 , O01
2. v11 , v01 = O11 , O00
3. v11 , v00 = O11 , O01
¬O00 ∨ ¬_Rr ∧ ¬v0 → O01↾
¬O01 ∨ ¬_Rr ∧ ¬v1 → O00↾
O00 ∧ (_Rr ∨ v0) → O01⇂
O01 ∧ (_Rr ∨ v1) → O00⇂
¬O10 ∨ ¬_Rr ∧ ¬v10 → O11↾
¬O11 ∨ ¬_Rr ∧ ¬v11 → O10↾
O10 ∧ (_Rr ∨ v10) → O11⇂
O11 ∧ (_Rr ∨ v11) → O10⇂
The transitions on the read memories are acknowledged using the same approach that was
used for the internal memories with the pass transistor XORs. Much like v0 and v1 for the clear
command, these will be allowed to transition for other commands, causing glitches. However, we
assume that the upgoing transitions are short enough compared to the cycle time that they will
have completed before the read command is active.
@O00 ∧ ¬v0 ∨ @O01 ∧ ¬v1 → o0↾
¬@O01 ∧ v0 ∨ ¬@O00 ∧ v1 → o0⇂
@O10 ∧ ¬v10 ∨ @O11 ∧ ¬v11 → o1↾
¬@O11 ∧ v10 ∨ ¬@O10 ∧ v11 → o1⇂
The forward driver must acknowledge the upgoing transitions on v0 and v1 since those
signals are used to set one of the read latches. The input acknowledge must return the same status
that it did in the previous command so that the requesting process does not see a change in the
counter status. This is surprisingly expensive, and requires a validity tree Rv to keep the transistor
stacks driving Ln at a reasonable length.
102
Fig. 52: Read counter components.
(v0 ∨ v1) ∧ Re ∧ Lr → Rr↾
Rr = Rr
¬_R01 ∨ ¬_R10 ∨ ¬_R00 → Rv↾
v10 ∧ vz ∧ (R11 ∨ v00 ∧ Rr) → Lz⇂
(v11 ∨ vn) ∧ R11 ∨ (v01 ∨ v11 ∨ vn) ∧ Rr ∨ Rv → Ln⇂
The reset phase acknowledges the newly introduced pass transistor XORs and everything else
resets combinationally.
¬Re ∧ ¬Lr ∧ ¬o0 ∧ ¬o1 → Rr⇂
_R01 ∧ _R10 ∧ _R00 → Rv⇂
¬v10 ∨ ¬vz ∨ ¬R11 ∧ (¬v00 ∨ ¬Rr) → Lz↾
(¬v11 ∧ ¬vn ∨ ¬R11) ∧ (¬v11 ∧ ¬vn ∧ ¬v01 ∨ ¬Rr) ∧ ¬Rv → Ln↾
First Unit
Starting from the "base" unit, the "first" unit will need an extra dataless channel Xr to sync
with the interface and block consecutive reads. Normally, this would just be added to the read
request as follows.
(v0 ∨ v1) ∧ Xre ∧ Re ∧ Lr → Rr↾
¬Xre ∧ ¬Re ∧ ¬Lr ∧ ¬o0 ∧ ¬o1 → Rr⇂
Rr = Rr
Xrr = Rr
Unfortunately, this makes the transistor stack for the reset phase of the output request too long.
To get around this, a new variable We is introduced to handle the acknowledgement of o0 and
o1 . This increases the length of the set phase transistor stack to five, but reduces the reset phase
back down to four.
(v0 ∨ v1) ∧ Xre ∧ We ∧ Re ∧ Lr → Rr↾
Rr = Rr
Xrr = Rr
103
¬_Rr ∧ ¬o0 ∧ ¬o1 → We⇂
¬Xre ∧ ¬We ∧ ¬Lr ∧ ¬Re → Rr⇂
_Rr ∨ o0 ∨ o1 → We↾
This definition for We relies upon the Half-Cycle Timing Assumption. Alternatively, inverting
the order in which the read latches are set and inverting the XOR gates driving o0 and o1 avoids
this assumption. This would allow We to acknowledge Rr and the now upgoing transitions for
o0 and o1 .
Mid Unit
Starting from the rules in the "first" unit, the "mid" unit only forwards the read request if the
vz flag is not set. Furthermore, the "mid" unit forwards the value of the vz and vn flag out Xr
to signal to the interface unit whether this is the last value in the counter. Pass transistor logic is
sufficient to gate these signals.
Rr = Xrd[0]
@Rr ∧ ¬vn → Xrd[1]↾
¬@Rr ∧ vz → Xrd[1]⇂
vn → Xrd[1]⇂
@Rr ∧ ¬vz → Xrd[0]↾
¬@Rr ∧ vn → Xrd[0]⇂
vz → Xrd[0]⇂
Then, the vz case cannot deadlock the system. So ¬vn is used to skip the output acknowledge
in the reset phase.
¬Xre ∧ ¬We ∧ ¬Lr ∧ (¬vn ∨ ¬Re) → Rr⇂
Last Unit
The "last" unit is the same as "mid" except that all paths for forwarding the command have
been removed. Normally, when doing this, vz and vn would be removed. However, in the case
of the read command it must remain, recording the last value sent across Lz and Ln . This ensures
that the status returned by the read command remains consistent. Specifically, an increment might
overflow the counter, setting value of the counter to 0. However, this case always returns Ln . If
the overflow was not recorded by vz and vn , then the next read would return Lz . This would
switch the vz and vn latch in the previous counter unit, potentially causing an instability on the
status wires down the whole counter.
104
Read Interface
The read interface must block the read channel R until all of the read latches have resolved.
This is signaled by the reset phase of Xo from the "mid" or "last" units. Then, the read interface
must block the next read command until the read channel R has been acknowledged. This is done
by holding Xie low. Ultimately, to keep the command throughput high, the requests on Xi and
Xo are acknowledged as quickly as possible with Xi being a higher priority.
Therefore, the read interface waits for the requests on Xi and Xo , storing the cap value
in v before acknowledging them. Once their output requests have reset, the read latches have
resolved. Therefore v is copied to the output request on the read channel. Once the read channel
has acknowledged the request, the enables on Xi and Xo are raised, unblocking the next read
command. Finally, the output request on R is reset.
v := 1, Xoe+, Xie+, Rr := null; [¬Re ∧ ¬Xir ∧ ¬Xor];
*[(v := Xor; [Re]; Xoe- ∨∨ [Xir ∧ Re]; Xie-);
[¬Xir ∧ ¬Xor]; Rr := v; [¬Re]; Xoe+,Xie+; Rr := null]
The copy of Xo 's request to v is handled quite simply by an SR latch.
v0 ∨ Xo0 → v1⇂
v1 ∨ Xo1 → v0⇂
¬v0 ∧ ¬Xo0 → v1↾
¬v1 ∧ ¬Xo1 → v0↾
The completion of this circuitry is detected with an XOR driving the C-element for Xoa .
Meanwhile, the C-element driving Xia acknowledges the request on Xi as soon as it arrives.
Re ∧ (v0 ∧ Xo0 ∨ v1 ∧ Xo1) → Xoa↾
Re ∧ Xi0 → Xia↾
Xoa → Xoe⇂
Xia → Xie⇂
Once the request on Xo has been lowered, the value stored in v is copied to the output request
of the read channel R .
¬Xo0 ∧ ¬v1 ∧ ¬Xoe → Rc0↾
¬Xo1 ∧ ¬v0 ∧ ¬Xoe → Rc1↾
The acknowledgement of the read channel unblocks Xi and Xo .
¬Re ∧ ¬Xia → Xoa⇂
¬Re ∧ ¬Xi0 → Xia⇂
¬Xoa → Xoe↾
¬Xia → Xie↾
105
Finally, the output requests to the read channel are lowered, resetting the handshake.
Xo0 ∨ v1 ∨ Xoe → Rc0⇂
Xo1 ∨ v0 ∨ Xoe → Rc1⇂
4.5 Write
Unlike the read counter, there is only one reasonable way to approach the bundled-data write.
Once again, the write command propagates through the counter, and at each unit loads in the
written value. Once the value has been successfully loaded into the internal memory, other
commands may follow almost immediately. The only command that must wait would be another
consecutive write. This is similar to the high throughput solution for the read. The equivalent to
the area-saving solution does not save area for the write counter because there are not extra latches
to remove.
Similar to the read counter, this is implemented in the context of a digit-serial CGRA.
Therefore, the write data should be divided into digits from a serial to parallel unit. This again
gives four different components. The "first" unit blocks any consecutive write command until
the active write command has completed, "mid" and "last" signal when the active write has
completed, and "base" propagates the write while loading the written value. Once again, the
interface drives the write handshake once everything has completed.
The write propagates until it reaches the cap token of the digit stream, signalling that there is
no more data to write. If the digit stream is longer than the total number of counter units, then
there will be an overflow. In this case, the write completes when it reaches "last". The handshake
on each digit is handled independently so the write data can be received from the serial to parallel
unit.
Base Unit
Once again, the base unit circuit implementation starts with the latches. The write data received
from the serial to parallel unit uses the standard binary encoding. That means that it has to be
converted to the Gray code used by the counter. This is done by an XOR gate in the datapath, and
the transitions are covered by a delay line on the request for the write data. This will be discussed
in the write interface section.
0. W10 , W00 = v10 , v00
1. W10 , W01 = v10 , v01
2. W11 , W00 = v11 , v01
3. W11 , W01 = v11 , v00
¬W00 ∧ ¬W10 ∨ ¬W01 ∧ ¬W11 → Wx0↾
106
Fig. 53: Write counter components.
(W00 ∨ W10) ∧ (W01 ∨ W11) → Wx0⇂
Wx0 → Wx1⇂
¬Wx0 → Wx1↾
Then, the write command arrives on Lw and is forwarded to Rw . Ultimately, the write interface
must compute where to place the zero flag. This information is passed to the counter unit through
a datapath signal, Z . Now, the write command is acknowledged with the new zero flag for the
previous counter unit. This ultimately makes the pull-up network for the gate driving Ln too long.
So, the Half Cycle Timing Assumption is used with a validity tree, Rv , from the internal nodes of
the C-elements driving R01 , R10 and R00 .
Re ∧ Lw → Rw↾
Rw = Rw
¬_R01 ∨ ¬_R10 ∨ ¬_R00 → Rv↾
v10 ∧ vz ∧ R11 ∨ Rw ∧ Z1 → Lz⇂
(v11 ∨ vn) ∧ R11 ∨ Rv ∨ Rw ∧ Z0 → Ln⇂
The internal state is set using the typical method. Careful consideration is taken to keep
the rules for increment and decrement as short as possible so as to not interfere with their
performance. Ultimately, the write command will create instabilities if assumptions are not
made. Specifically, during a write command, both latches can switch their value. This will cause
transitions in v0 and v1 similar to the transitions created by the clear command. It is assumed
that the downgoing transitions on v0 or v1 have sufficiently resolved before a new input
command is received.
¬v01 ∨ ¬x00 ∨ ¬Wx1 ∧ ¬_Rw → v00↾
¬v00 ∨ ¬x01 ∨ ¬Wx0 ∧ ¬_Rw → v01↾
v01 ∧ x00 ∧ (Wx1 ∨ _Rw) → v00⇂
v00 ∧ x01 ∧ (Wx0 ∨ _Rw) → v01⇂
(¬v11 ∨ ¬W11 ∧ ¬_Rw) ∧ ¬x10 → v10↾
(¬v10 ∨ ¬W10 ∧ ¬_Rw) ∧ ¬x11 → v11↾
v11 ∧ (W11 ∨ _Rw) ∨ x10 → v10⇂
107
v10 ∧ (W10 ∨ _Rw) ∨ x11 → v11⇂
Both internal memory units driving v00, v01 and v10, v11 are nlatches with respect to the
write. This means that setting a value, v01 for example, would cause v01+; v00⇂ in that order.
For v00 and v01 , the downgoing transitions of v00 or v01 are passed directly through the pass
transistors driving v0 and v1 with no gate delay. This is assumed to complete before 5 transition
from when it is enabled. This includes the reset of the forward driver Rw , the input enable Lz,
Ln , and then a new input command Li, Ld, Lw .
For v10 and v11 , the downgoing transitions of v0 or v1 are enabled as soon as v10 or
v11 transition high, one transition earlier than the v00 and v01 case. Then, there is a single gate
delay to drive v0 and v1 low as a result. Therefore, the assumption requires these transitions to
complete before 6 transitions from when it is enabled. This includes the downgoing transition of
the internal memory v10, v11 , the reset of the forward driver Rw , the input enable Lz, Ln , and
then a new input command Li, Ld, Lw . Effectively, this is equivalent to the bounds assumed by
a Half-Cycle Timing Assumption.
Once the internal memory has transitioned, two pass transistor XORs v0w and v1w between
each internal memory and the write value to signal its completion. These two signals will be
unstable as we increment or decrement the counter and the internal state changes, but those
transitions are pass directly through the pass transistors with no delay. Therefore, the upgoing
transitions on v0w and v1w will have resolved long before the write command is active.
@v00 ∧ ¬Wx0 ∨ @v01 ∧ ¬Wx1 → v0w↾
¬@v00 ∧ Wx1 ∨ ¬@v01 ∧ Wx0 → v0w⇂
@v10 ∧ ¬W10 ∨ @v11 ∧ ¬W11 → v1w↾
¬@v10 ∧ W11 ∨ ¬@v11 ∧ W10 → v1w⇂
Finally, v0w and v1w transition low signalling the completion of the write, and the forward
driver Rw is reset. The input enable follows combinationally.
¬Re ∧ ¬Lw ∧ ¬v0w ∧ ¬v1w → Rw⇂
_R01 ∧ _R10 ∧ _R00 → Rv⇂
(¬v10 ∨ ¬vz ∨ ¬R11) ∧ (¬Rw ∨ ¬Z1) → Lz↾
(¬v11 ∧ ¬vn ∨ ¬R11) ∧ ¬Rv ∧ (¬Rw ∨ ¬Z0) → Ln↾
Similar to the clear counter, there is a transient state in which both v10 and v11 are high,
leaving x00 and x01 dynamic. Once again, this can be resolved with two small transistors.
¬_Rw → x00↾
108
¬_Rw → x01↾
First Unit
Starting from the "base" unit, the "first" unit will need an extra dataless channel Xr to sync
with the interface and block consecutive writes. Normally, this would just be added to the write
request.
Xre ∧ Re ∧ Lw → Rw↾
¬Xre ∧ ¬Re ∧ ¬Lw ∧ ¬v0w ∧ ¬v1w → Rw⇂
Rw = Rw
Xrr = Rw
However, this makes the transistor stack for the reset phase of the output request too long.
Once again, a new variable We is introduced to handle the acknowledgement of v0w and v1w .
Xre ∧ We ∧ Re ∧ Lw → Rw↾
Rw = Rw
Xrr = Rw
¬_Rw ∧ ¬v0w ∧ ¬v1w → We⇂
¬Xre ∧ ¬We ∧ ¬Lw ∧ ¬Re → Rw⇂
_Rw ∨ v0w ∨ v1w → We↾
This definition for We would rely upon the Half-Cycle Timing Assumption. Again, this can
be avoided by inverting the order in which the write latches are set and inverting the XOR gates
driving v0w and v1w . This would let We acknowledge Rw and the now upgoing transitions of
v0w and v1w .
This inversion would remove the transient state in which x00 and x01 are dynamic. So, the
previously added keepers would no longer be necessary. However, it also adds a transient state in
which x10 and x11 are dynamic. Therefore two new keepers would be needed.
Rw → x10⇂
Rw → x11⇂
Mid Unit
Starting from the rules in the "first" unit, the "mid" unit only forwards the write request if the
data token is not a cap. The interface communicates this information to the "mid" unit through a
datapath signal called X . This is now used to gate Rw
@Rw ∧ ¬X1 → Rw↾
¬@Rw ∧ X0 → Rw⇂
109
X1 → Rw⇂
Unfortunately, this signal is stored in a platch. This mean that when the platch switches, the
pass transistor logic driving Rw will go dynamic. Therefore, a keeper is needed.
_Rw ∧ ¬Rw → Rw⇂
Then, the vz flag must be set. Careful consideration is taken to ensure that this does not
interfere with the performance of the other commands.
¬vn ∨ ¬Rz → vz↾
¬vz ∧ (¬X1 ∨ ¬Rw) ∨ ¬Rn → vn↾
vn ∧ Rz → vz⇂
(vz ∨ Xd1 ∧ Rw) ∧ Rn → vn⇂
Finally, the acknowledgement of Re is skipped when the output command is not sent.
Furthermore, the new transition on vz, vn must also be acknowledged at this point. A pass
transistor OR gate can make both of these things happen. X is used to gate the transition on _vz
into a new signal called Ze . Then, Rw waits for this signal to transition low. Because it is a
pass transistor gate, transitions on Ze will follow all transitions on _vz with no delay when it is
open. This will be unstable during increment and decrement commands, but will be stable for the
write. This acknowledges the transition on _vz, vz, vn with no timing assumptions for the QDI
control.
¬@_vz ∧ Xd1 → Ze⇂
@_vz ∧ ¬Xd0 → Ze↾
¬X1 → Ze↾
¬Xre ∧ ¬We ∧ ¬Lw ∧ (¬Ze ∨ ¬Re) → Rw⇂
Last Unit
The "last" unit is the same as "mid" except that all paths for forwarding the command have
been removed. Unlike the read counter, there is no need to store the Lz, Ln flags in this unit.
Write Interface
The write interface is responsible for blocking consecutive write commands. Unfortunately,
the write data channel is significantly slower than the command channel. For the next command to
be propagated through the counter, the request on Xi must be acknowledged as soon as possible.
For the next write to propagate through the counter, the request on Xi must be enabled. For now,
it is assumed that the input data remains stable while the write enable We is low. This means that
the write enable can be lowered as soon as the write command has been initiated as signaled by
the request on Xi . However, the write enable cannot transition high until after the write command
110
Fig. 54: Bundled-Data write counter interface.
has completed and the written data has all been acknowledged in the counter units, as signalled by
Xor⇂ .
Unfortunately, these constraints do not lend themselves to any of the standard reshufflings
(WCHB, PCHB, PCFB). Upon initialization, Xir is low, blocking any write command from
proceeding, and We is high, ready to receive write data. Once the write data has arrived on W
as signalled by Wr↾ , the write command is unblocked by Xir↾ . Once the write command is
unblocked, the write data is acknowledged by lowering We . This is done as soon as possible to
overlap the slow write data handshake on W with the write command propagation through the
counter. Xir is also lowered, resetting the handshake in the counter unit and allowing further
commands to proceed as long as they are not write commands. Once the write command has
completely propagated through the counter as signalled by Xor↾ , it is acknowledged by lowering
Xoe . Finally, when Xor is lowered, the write command has entered the reset phase in the last
counter unit and all of the latches in the counter have stabilized. At this point, We is raised to
signal that the process is ready for new write data.
Xir⇂,We↾,Xoe↾; [Xie ∧ ¬Wr ∧ ¬Xor];
∗[([Xie ∧ Wr]; Xir↾ ∥ Xoe↾); We⇂;
([¬Xie]; Xir⇂ ∥ [Xor]; Xoe⇂); [¬Xor ∧ ¬Wr]; We↾ ]
This whole handshake is anchored by We . We is lowered once Xir and Xoe have gone high,
and the write command is propagating through the counter. Then, it is raised once Xor and Wr are
lowered signalling the completion of the write. Then, to disambiguate some of the other states in
the handshake, the downgoing transitions on Xir and Xoe must also be acknowledged.
Xir ∧ Xoe → _We↾
¬Xir ∧ ¬Xoe ∧ ¬Xor ∧ ¬Wr → _We⇂
_We → We⇂
¬_We → We↾
For Xir , there is the usual dependency on Xie , but then Xir must also wait for the write data
to arrive before letting the command propagate. This is covered by Wr in the Xir↾ rule. We then
disambiguates some of the other states in the handshake.
¬Xie ∧ ¬We → Xir⇂
111
Xie ∧ We ∧ Wr → Xir↾
For Xoe , We↾ already acknowledges the downgoing transition of Xor . So, Xoe must only
acknowledge the upgoing transition.
_We ∧ Xor → Xoe⇂
¬_We → Xoe↾
Overall, this interface is as parallel as possible, allowing for the write command propagation to
overlap the write data handshake.
Chunked Write Interface
The chunked write interface has the same underlying handshake as the normal write interface,
but adds channel actions to propagate the zero flag from digit to digit. This allows each digit to be
loaded independently so the write interface may accept input from the serial to parallel unit.
There are two input channels. W carries the write data for this digit, and Uz carries the zero
flag from the next most significant digit. Then, there are the two channels that block consecutive
writes Xi and Xo . Finally, there is one output channel Dz which carries the computed zero flag
for this digit to the next digit of lesser significance.
The handshake with Xi , Xo and W remains largely unchanged. Meanwhile, logic for handling
non-cap tokens and the handshake for the Dz and Uz channels have been merged in. The
handshake starts with We high, signifying that it is ready to receive write data. Xir and Xoe are
both low, blocking any write command from propagating through the counter. Uz is only enabled
when the token received from W is not a cap. This guarantees that the requests on Uz will be low
unless they are required by the handshake.
At the start of the handshake, the requests on Dz and Xi along with the enable on Xo are
raised in parallel. This allows each of these signals to transition as soon as possible in their
respective handshakes. Once this is done, We and Uze transition low, acknowledging the slow
handshakes for the write data as soon as possible. Then, the requests on Dz and Xi along with
the enable on Xo are reset in parallel. This allows the next counter command to propagate through
the counter. When the command arrives at the end of the counter, Xoe is lowered. Then, once the
request on Xo has lowered, all of the latches in the counter are stable, and can therefore request
new write data from W and Uz .
112
Fig. 55: Chunked Bundled-Data write counter interface.
We↾,Dz0⇂,Dz1⇂,Xir⇂,Xoe⇂,Uze⇂; [¬W0 ∧ ¬W1 ∧ ¬Uz0 ∧ ¬Uz1 ∧ Dze ∧ Xie];
∗[[ W0 → Uze↾
▯ W1 → skip
]; (
[ Dze ∧ (W0 ∧ (Uz1 ∨ Zd0) ∨ W1 ∧ Zd0) → Dz1↾
▯ Dze ∧ (W0 ∧ Uz0 ∨ W1) ∧ Zd1 → Dz0↾
] ∥
[ Xie ∧ (W0 ∧ (Uz0 ∨ Uz1) ∨ W1)]; Xir↾ ∥
Xoe↾
); We⇂; Uze⇂; (
[¬Dze]; Dz0⇂,Dz1⇂ ∥
[¬Xie]; Xir⇂ ∥
[Xor]; Xoe⇂
); [¬Xor ∧ ¬W0 ∧ ¬W1 ∧ ¬Uz0 ∧ ¬Uz1]; We↾
]
First, the initial zero flag is loaded for this digit into the bundled datapath. When W1 is high, the
digit is a cap token, meaning it is the last token in the stream. In this case, the initial zero flag is set
to true. Otherwise when W0 is high, the zero flag is loaded from the next most significant digit as
communicated over Uz . Finally, the cap token flag is also loaded from W into the datapath. This
signal will be used to stop the write command from propagating through the rest of the counter.
Zu0 ∨ Uz1 ∧ W0 → Zu1⇂
Zu1 ∨ Uz0 ∧ W0 ∨ W1 → Zu0⇂
¬Zu0 ∧ (¬Uz1 ∨ ¬W0) → Zu1↾
¬Zu1 ∧ (¬Uz0 ∨ ¬W0) ∧ ¬W1 → Zu0↾
X1 ∨ W1 → X0⇂
X0 ∨ W0 → X1⇂
¬X1 ∧ ¬W1 → X0↾
¬X0 ∧ ¬W0 → X1↾
Then, these latches need time to resolve. To reduce the number of delay lines required and
113
reduce energy, the validity of each input is computed and delayed.
W0 ∨ W1 → dWv↾ // delayed
¬W0 ∧ ¬W1 → dWv⇂
Uz1 ∨ Uz0 → dUzv↾ // delayed
¬Uz1 ∧ ¬Uz0 → dUzv⇂
For the handshake, the requests on Dz must wait for Zd to resolve in the datapath, but Zd is
only dependent upon W . When Zd is false, meaning not all of the bits in this digit are one, then
false is sent on Dz before receiving the zero flag from Uz . This early-out mechanism effectively
breaks the carry chain, letting it resolve in O(logN) time on average instead of O(N) . If Zd is
true, then the result sent on Dz is determined by Uz or W . If this digit is a cap token, as signified
by W1 , then the value of Zd is simply forwarded. If this digit is not a cap token, then the value of
Uz is forwarded. However, it is not necessary to wait until after the delay line on Uz because Zd
has already been computed. Because Uze is kept low in the case of a cap token, the requests on
Uz will be low, preventing any instabilities on Dz .
Then, the request on Xi must be driven. The value received from Uz is loaded into Zu and
used in the datapath to compute the zero flag for each bit in the digit. Therefore, it is necessary
to wait until that process has resolved after the delay lines on Uz . Therefore, Xir waits for both
dWv and dUzv . Once the datapath has been resolved, a request is sent on Xir which unblocks the
write command from propagating through the counter units. In parallel Xo is also enabled, letting
the command propagate out the last unit in the counter as well.
Finally, Rv waits for all three of these actions to resolve in parallel.
Dze ∧ We ∧ dWv ∧ (Uz1 ∨ Zd0) → Dz1↾
Dze ∧ We ∧ dWv ∧ (Uz0 ∨ W1) ∧ Zd1 → Dz0↾
Xie ∧ We ∧ dWv ∧ (dUzv ∨ W1) → Xir↾
¬_We → _Xoe⇂
¬_Xoe → Xoe↾
Xoe ∧ (Dz0 ∨ Dz1) ∧ Xir → Rv↾
Once Rv signals the completion of the forward driving rules, the write data on W is
acknowledged. It is assumed that the bundled datapath remains unchanged until after W is enabled
again. This is not a standard implementation of the bundled-data protocol, since typically there
would be a layer of latches to ensure that. However, the serial to parallel units are able to guarantee
that feature without the extra layer of latches.
114
Rv → _We↾
_We → We⇂
Now that W has been acknowledged, the write command must propagate through the counter
units and the request on W must be lowered. These both represent the longest part of their
respective handshakes, and this reshuffling handles them both in parallel. Then upon completion
of each process, the request are lowered immediately, following a reshuffling akin to the PCHB.
This unblocks the counter, allowing subsequent commands to propagate. Then, Rv is lowered
once each signal has been reset.
¬Dze ∧ ¬We → Dz1⇂
¬Dze ∧ ¬We → Dz0⇂
¬Xie ∧ ¬We → Xir⇂
_We ∧ Xor → _Xoe↾
_Xoe → Xoe⇂
¬Xoe ∧ ¬Dz0 ∧ ¬Dz1 ∧ ¬Xir → Rv⇂
Finally, before raising We , the request on Xo must lower. This guarantees that all of the latches
in the counter units have stabilized to their written value. Then raising We requests new data on
the datapath.
¬Rv ∧ ¬Xor ∧ ¬dWv ∧ ¬dUzv → _We⇂
¬_We → We↾
Uze is almost entirely determined by We . However, Uze can only go high when this digit is not
a cap token. Therefore, a pass transistor AND gate ensures that no extra extra delay is introduced
when implementing this feature.
@W0 ∧ ¬_We → Uze↾
¬@W0 ∧ We → Uze⇂
_We → Uze⇂
4.6 Evaluation
In [75], I designed a large class of QDI counter circuits, showing significant gains compared
to other asynchronous counters. While they were faster and used significantly less energy, they
required up to twice as many transistors to implement.
These were compared to the two other known QDI counters from [96] and [98]. To my
knowledge, no one had done quite as thorough an exposition on QDI counters and their
capabilities.
115
Type Trans Frequency Energy/Op Latency
d_z[75] 50N 2.73 GHz 24.01 fJ N/A
dzn[75] 102N+10 2.15 GHz 48.17 fJ 399 ps
idzn[75] 146N+12 2.03 GHz 56.05 fJ 421 ps
idczn[75] 174N+14 2.00 GHz 40.62 fJ 442 ps
idrzn[75] 246N+14 1.88 GHz 89.51 fJ 441 ps
idrzn_bd[75] 188N+32 1.77 GHz 75.20 fJ 441 ps
dwzn[75] 192N+12 1.86 GHz 43.81 fJ 487 ps
is_zn[75] 146N+61 2.08 GHz 45.52 fJ 139 ps
Type Trans Frequency Energy/Op Latency
d_z_n[96] 117N+32 1.42 GHz 73.34 fJ 468 ps
id_zn[98] 398N+26 0.60 GHz 152.76 fJ 1150 ps
The designs presented in this chapter were also compared to the typical counter synthesized
by Synopsys Design Compiler in the same technology. While it is both fast and small, it is also
comparatively power hungry.
The optimizations introduced here improve upon the previous designs, cutting the transistor
count by more than half, increasing the frequency by 43%, and decreasing the energy usage by
20%. Furthermore, the circuit template is simplified, making it easier to fit more functionality into
a single counter bit before running into the limitations of the WCHB template.
These metrics are averaged using the carry chain length statistics from the SPEC2006
benchmark in Fig. 56. This shows that the vast majority of increments and decrements only carry
about five bits past the least significant bit.
As shown in Fig. 57, the counters from this chapter are nearing the clock speed typically
achieved in extremely optimized synchronous architectures while consuming less than half as
much energy during operation using only 10.5 more transistors per bit. This is true to varying
degrees across all of the possible counter commands.
116
Type Trans Frequency Energy/Op
id_c_zn 74N 1.00 GHz 169.18 fJ
id_c_zn 74N 2.00 GHz 116.75 fJ
id_c_zn 74N 3.00 GHz 98.24 fJ
id_c_zn 74N 4.00 GHz 86.12 fJ
Type Trans Frequency Energy/Op Latency
idzn 71N 3.57 GHz 42.38 fJ
idczn 84.5N 3.11 GHz 41.41 fJ
idrzn_bd 102.5N+32 3.28 GHz 53.93 fJ
idwzn_bd 104.5N+42 3.25 GHz 65.31 fJ
idrzn_bdN 102.5N+53M 3.24 GHz 110.06 fJ
idwzn_bdN 104.5N+164M 3.22 GHz 87.05 fJ
dwzn_bd 76N+42 3.48 GHz 58.89 fJ
dwzn_bdN 76N+164M 3.43 GHz 81.22 fJ
Fig. 56: The distribution of carry chain lengths for increment and decrement commands in the SPEC2006
benchmark.
Fig. 57: Measured Performance and Energy for an array of counters.
117
CHAPTER 5
STREAM MANIPULATION
Three operations are fundamental for managing adaptive digit-serial streams. First, given that
two adaptive digit streams are likely to be different lengths, an operator with multiple inputs will
have to sign-extend all inputs to the length of the longest stream. Second, once an operator is
complete, there are likely to be redundant tokens in the result that should be removed with sign
compression. Third, some operators require parallel inputs. Since data is transmitted serially, it
will have to be actively converted to parallel. This chapter will explore multiple implementations
of each in the context of different encodings and topologies.
5.1 Sign Extension
Sign-extension is necessary for any digit-serial operation with multiple inputs. At best, the
control circuitry necessary for sign-extension can be integrated into the circuitry for the operation
itself. At worst, it can be shared across all of the operations in a single CGRA execution node.
This section covers both a QDI approach and an integrated QDI/BD approach.
5.1.1 Behavioral Specification
The function implemented by these approaches remains fairly consistent. Each input stream
communicated over a channel, X , consists of a sequence of tokens in which Xd , or "data",
communicates the value of the bit or digit and Xc , or "cap", communicates the end-of-stream
marking. Xc is true for the last token in the stream and false otherwise.
For the sign-extension process, there are two input channels, A and B , and one output channel,
S . If neither input token is a cap then both tokens can be acknowledged. If only one input token is
a cap, then that token is not acknowledged. This keeps the token waiting on the channel interface,
effectively duplicating it. Meanwhile, the other token is acknowledged until a cap token has been
received on both channels. Once that happens, both tokens can be acknowledged to complete the
operation.
∗[[ !Ac ∧ !Bc → S!({Ad, Bd}, 0); A?, B?
▯ Ac ∧ !Bc → S!({Ad, Bd}, 0); B?
▯ !Ac ∧ Bc → S!({Ad, Bd}, 0); A?
▯ Ac ∧ Bc → S!({Ad, Bd}, 1); A?, B?
]]
Meanwhile, the output channel combines the two input bits Ad and Bd into an output with
a single end-of-stream marking. In the context of another operation, part of that operation can
118
be implemented here as an optimization to reduce the size of the output channel, reduce the
complexity of the next pipeline stage, and increase the overall frequency. An example of this
would be adding Ad and Bd into a 3-valued output in preparation for a bit-serial addition
operation.
For multi-bit digits, a problem arises in the treatment of the cap token. For the above
specification, the most significant digit is simply repeated. The last digit in the stream could be
constrained to either 0 or -1. If the digits are represented with a binary encoding, then this would
be encoded as all zeros or all ones. This faithfully implements two's complement encoding.
Alternatively, extra circuitry could be introduced in the datapath that duplicates the most
significant bit in the cap token to the remaining bits in a new cycle, sign-extending the digit. This
would remove all constraints from the cap token and reduce the number of tokens necessary to
represent a given value.
5.1.2 QDI Only (PCHB)
If reliability in extreme temperature environments is the most important factor for the design,
then a QDI-only design is likely optimal. This reliability comes at a direct sacrifice to throughput,
energy, and area.
Because of the built-in control-data split, the best QDI template to use for these operations is
likely PCHB. It is the simplest template in which the reset of the output data is not dependent upon
the acknowledgement on the input data. This minimizes circuit complexity in the face of wide
data operations. However, parts of the WCHB template can simplify the complexity of the input
acknowledgement for the control. If the data grows too wide, then it should be desynchronized
from the control with its own pipeline circuitry. Assuming digit sizes of 4 bits or less, this is not a
concern.
Per the CHP specification, there are four conditions that determine the behavior of the process.
Unfortunately, while these conditions do share various behaviors, they are not shared in a way that
can be merged per the first optimization rule in Chapter 3 due to acknowledgement constraints.
Therefore, four C-elements are required to record which condition is covered by this cycle. These
four signals are then used to generate the output request.
en ∧ Ac0 ∧ Bc0 → Xd0↾
en ∧ Ac1 ∧ Bc0 → Xd1↾
en ∧ Ac0 ∧ Bc1 → Xd2↾
en ∧ Ac1 ∧ Bc1 → Xd3↾
Xd0 ∨ Xd1 ∨ Xd2 → Sc0↾
Sc1 = Xd3
119
The PCHB template inherently requires two validity trees: one for the input requests and one
for the output requests. However, the acknowledgement of the input requests for the control
circuitry will be handled with a WCHB reshuffling. Therefore, the validity signals necessary for
this design only cover the input data requests with An and Bn , the output data requests with Sdan
and Sdbn , and the output control requests with Scn .
Because the behavior condition was previously recorded with four C-elements, their internal
nodes can be used to gate the input enable. The global enable en also uses the internal nodes to
handle cases that do not acknowledge an input.
Sc0 ∨ Sc1 → Scn⇂
¬Sdan ∧ ¬Sdbn ∧ ¬Scn → Rn⇂
¬An ∧ ¬Rn ∧ (¬_Xd0 ∨ ¬_Xd2 ∨ ¬_Xd3) → Ae⇂
¬Bn ∧ ¬Rn ∧ (¬_Xd0 ∨ ¬_Xd1 ∨ ¬_Xd3) → Be⇂
(¬Ae ∨ ¬_Xd1) ∧ (¬Be ∨ ¬_Xd2) ∧ ¬Se → en⇂
Entering the reset phase, the input control requests are acknowledged using WCHB techniques
directly in the drivers for the output requests. This dramatically simplifies the acknowledgement
circuitry in the control.
¬en ∧ ¬Ac0 ∧ ¬Bc0 → Xd0⇂
¬en ∧ ¬Bc0 → Xd1⇂
¬en ∧ ¬Ac0 → Xd2⇂
¬en ∧ ¬Ac1 ∧ ¬Bc1 → Xd3⇂
¬Xd0 ∧ ¬Xd1 ∧ ¬Xd2 → Sc0⇂
Finally, the validity, input enable, and global enable signals are all set following the typical
PCHB template.
¬Sc0 ∧ ¬Sc1 → Scn↾
Sdan ∧ Sdbn ∧ Scn → Rn↾
An ∧ Rn → Ae↾
Bn ∧ Rn → Be↾
Ae ∧ Be ∧ Se → en↾
In the datapath, the output request signals are simply conditioned on the global enable. This
serves the same purpose as the clock in a clocked datapath. The reset phase for the data requires
no information about the acknowledgement of the input channel.
120
en ∧ Ad0 → Sda0↾ ¬en → Sda0⇂
en ∧ Ad1 → Sda1↾ ¬en → Sda1⇂
en ∧ Bd0 → Sdb0↾ ¬en → Sdb0⇂
en ∧ Bd0 → Sdb1↾ ¬en → Sdb1⇂
Ad0 ∨ Ad1 → An⇂ ¬Ad0 ∧ ¬Ad1 → An↾
Bd0 ∨ Bd1 → Bn⇂ ¬Bd0 ∧ ¬Bd1 → Bn↾
Sda0 ∨ Sda1 → Sdan⇂ ¬Sda0 ∧ ¬Sda1 → Sdan↾
Sdb0 ∨ Sdb1 → Sdbn⇂ ¬Sdb0 ∧ ¬Sdb1 → Sdbn↾
5.1.3 Integrated QDI/BD (Extensible)
The bundled-data timing assumption simplifies much of the circuitry in this design. As
suggested in Chapter 3, the control should be handled by a QDI process while the datapath should
be clocked by the QDI control. Therefore, the cap signal of each token is communicated by the
QDI input requests while the data bits of each token are communicated with a clocked bus.
Instead of using four C-elements to keep track of the current condition like the QDI approach
does, the integreated QDI/BD approach takes advantage of the bundled-data timing assumption
and copies the QDI input requests into the latched datapath. The keeps that information stable
throughout the whole cycle and removes the responsibility from the output request drivers.
Axcd0 ∨ Ac0 → Axcd1⇂
Axcd1 ∨ Ac1 → Axcd0⇂
¬Axcd0 ∧ ¬Ac0 → Axcd1↾
¬Axcd1 ∧ ¬Ac1 → Axcd0↾
Bxcd0 ∨ Bc0 → Bxcd1⇂
Bxcd1 ∨ Bc1 → Bxcd0⇂
¬Bxcd0 ∧ ¬Bc0 → Bxcd1↾
¬Bxcd1 ∧ ¬Bc1 → Bxcd0↾
Then, instead of the four delay lines typically required by a bundled-data design, the input
control requests can be merged together using two C-elements and delayed. This reduces the
number of delay lines from four to two and simplifies the remaining circuitry.
(Ac0 ∧ (Bc0 ∨ Bc1) ∨ Ac1 ∧ Bc0) → ABd0↾ // delay
Ac1 ∧ Bc1 → ABd1↾ // delay
¬Ac0 ∧ ¬Bc0 → ABd0⇂
¬Ac1 ∧ ¬Bc1 → ABd1⇂
Then, the latched control signals can be used to condition the input acknowledgement and the
remainder of the design is a simple WCHB buffer.
121
Se ∧ ABd0 → Sc0↾
Se ∧ ABd1 → Sc1↾
Sc0 ∧ Axcd0 ∨ Sc1 → Ae⇂ // amplify
Sc0 ∧ Bxcd0 ∨ Sc1 → Be⇂ // amplify
¬Se ∧ ¬ABd0 → Sc0⇂
¬Se ∧ ¬ABd1 → Sc1⇂
(¬Sc0 ∨ ¬Axcd0) ∧ ¬Sc1 → Ae↾ // amplify
(¬Sc0 ∨ ¬Bxcd0) ∧ ¬Sc1 → Be↾ // amplify
The input enable signals Ae and Be are amplified and used to latch their respective data. This
removes the validity trees that were in the datapath of the QDI approach.
Sda0 ∨ Ad0 ∧ Ae → Sda1⇂
Sda1 ∨ Ad1 ∧ Ae → Sda0⇂
¬Sda0 ∧ (¬Ad0 ∨ ¬Ae) → Sda1↾
¬Sda1 ∧ (¬Ad1 ∨ ¬Ae) → Sda0↾
Sdb0 ∨ Bd0 ∧ Be → Sdb1⇂
Sdb1 ∨ Bd1 ∧ Be → Sdb0⇂
¬Sdb0 ∧ (¬Bd0 ∨ ¬Be) → Sdb1↾
¬Sdb1 ∧ (¬Bd1 ∨ ¬Be) → Sdb0↾
5.2 Sign Compress by N
Many operators may produce redundant outputs. For example, the sum of 127, encoded as
…01111111 , and -128, encoded as …10000000 , is -1. A digit-serial adder will produce the result
redundantly encoded as …11111111 instead of the minimal encoding of …1 . If data becomes
too redundantly encoded, all of the previously described benefits from an adaptive digit serial
datapath are lost. So it may be prudent to occasionally compress the digit stream.
5.2.1 Behavioral Specification
Each digit stream can be divided into a collection of runs. Each run is a sequence of ones or a
sequence of zeros. The last run in the encoding will always contain the cap token. If the last run is
longer than a single token, then the encoding is redundant. Sign compression simply cuts the last
run back down to a single token.
Since all of the tokens in the run will be the same value, either all ones or all zeros, a run is
stored and its output delayed by recording its sign and counting up each token as it arrives. If the
run ends and it does not contain the cap token, then the whole run is emitted on the output by
122
using the stored sign and length. Effectively, the digit stream is temporarily run-length encoded.
Once the cap token has been received, there will be a run of some length stored in the counter.
This run is simply dropped and the cap token is forwarded, sign compressing the value. For
multi-bit datapaths, a token may not be all ones or all zeros. These tokens cannot be part of a run
therefore can bypass the run-length logic entirely.
The implementation requires two internal variables: v represents the sign, and n counts the
number of tokens. For clarity, the ext() function generates a full token from a sign value, the
sign() function determines the sign of a token by examining its most significant bit, and the
chain() function returns true if a token is all ones or all zeros.
v := 0, n := 0;
∗[∗[ sign(Ld)≠v ∧ n>0 → R!(ext(v),0); n := n-1 ];
v := sign(Ld);
[ !chain(Ld) → R!(Ld,0)
▯ chain(Ld) ∧ Lc=0 → n := n+1
▯ Lc=1 → n := 0, R!(ext(v),1)
]; L?
]
If the received token is not part of the stored run, then the run does not contain the cap token.
So, the process loops over n , draining the stored run to the output. Then, the next run is recorded,
setting v to the last bit in the input token. If the input token is neither all ones nor all zeros, it
bypasses the run logic and is forwarded to the output. Otherwise, if the input token is not a cap, it
is stored by incrementing n . If the input token is a cap, then the last run is dropped. n is reset and
a cap token is forwarded.
Flattening this CHP poses a large number of possible designs. This process must strike a
balance between attempting to execute independent actions in parallel and the resulting circuit
complexity. For example, after decrementing the currently stored run, the next token will
immediately increment it again. Alternatively, if this is the last token in the stream, it is possible
to end up clearing an already-cleared counter. These actions ultimately cancel each other out
suggesting opportunity for optimization. While these look like low hanging fruit, they ultimately
result in long transistor stacks in the forward drivers.
Therefore, there are four basic conditions. Condition 1 implements the loop, decrementing
when the input is not part of the run. Condition 2 implements the bypass case when the input is
not part of any run. Condition 3 implements the run accumulation case that consumes inputs and
increments n . Finally, condition 4 implements the cap condition in which n is reset and the cap
token is forwarded.
123
v := 0, n := 0;
∗[[ sign(Ld)≠v ∧ n≠0 → n := n-1, R!(ext(v),0)
▯ !chain(Ld) ∧ n=0 → v := sign(Ld); R!(Ld,0); L?
▯ Lc=0 ∧ (sign(Ld)=v ∨ n=0) → n := n+1, v := sign(Ld); L?
▯ Lc=1 ∧ sign(Ld)=v → n := 0, R!(ext(v),1); L?
]]
This design exposes complex communication patterns between the control and datapath.
While the acknowledgement requirements introduced by these patterns make a QDI multi-bit
implementation inefficient, an integrated QDI/BD design saves avoids most of these
requirements. Ultimately, making intelligent choices about what is implemented in the datapath
will make all the difference in creating a simple and efficient design.
The n variable and its associated increment, decrement, and clear actions are implemented
efficiently by the idczn counter described in Chapter 4. This provides an exchange channel
interface that maps well to the control's behavior.
For a single-bit datapath, the control is simpler. The bypass case in condition 2 is no longer
possible, and the use of the ext() , sign() , and chain() functions is no longer required.
v := 0, n := 0;
∗[[ Ld≠v ∧ n≠0 → n := n-1, R!(v,0)
▯ Lc=0 ∧ (Ld=v ∨ n=0) → n := n+1, v := Ld; L?
▯ Lc=1 ∧ Ld=v → n := 0, R!(v,1); L?
]
]
5.2.2 QDI Only
In the end, the QDI approach requires a non-standard implementation which more closely
resembles three separate interacting processes than just one. First the circuitry necessary for the
decrement loop and resulting data send are implemented. D flags whether the input data is equal
to the stored value V and is calculated every cycle. While this adds transitions to the overall cycle
time, it also reduces the length of the transistor stacks and reduces the capacitive load on both the
internal memory and the input data.
Rc0 = Cd
Dd1 ∧ Cn ∧ Re → Cd↾
Vd0 ∧ (Rc0 ∨ Rc1) → Rd0↾
Vd1 ∧ (Rc0 ∨ Rc1) → Rd1↾
124
¬Cn ∧ ¬Re → Cd⇂
¬Vd0 ∨ ¬Rc0 ∧ ¬Rc1 → Rd0⇂
¬Vd1 ∨ ¬Rc0 ∧ ¬Rc1 → Rd1⇂
Then, once the counter has been decremented down to zero, the value of the input is copied to
the internal memory.
Dd1 ∧ Ld0 ∧ Cz ∨ Vd0 → Vd1⇂
(Dd1 ∧ Ld1 ∧ Cz ∨ Vd1) → Vd0⇂
(¬Vd0 ∨ ¬Ld0) ∧ (¬Vd1 ∨ ¬Ld1) → Dd0⇂
(¬Vd0 ∨ ¬Ld1) ∧ (¬Vd1 ∨ ¬Ld0) → Dd1⇂
(¬Dd1 ∧ ¬Ld0 ∨ ¬Cz) ∧ ¬Vd0 → Vd1↾
(¬Dd1 ∧ ¬Ld1 ∨ ¬Cz) ∧ ¬Vd1 → Vd0↾
(Vd0 ∧ Ld0 ∨ Vd1 ∧ Ld1) → Dd0↾
(Vd0 ∧ Ld1 ∨ Vd1 ∧ Ld0) → Dd1↾
This triggers the subsequent increments or clear, consuming the next carry chain.
(Cz ∨ Cn) ∧ Dd0 ∧ Lc0 → Ci↾
(Cz ∨ Cn) ∧ Dd0 ∧ Lc1 → Cc↾
Cc ∧ Re → Rc1↾
Ci ∨ Rc1 → Le⇂
¬Cz ∧ ¬Cn ∧ ¬Dd0 ∧ ¬Lc0 → Ci⇂
¬Cz ∧ ¬Cn ∧ ¬Dd0 ∧ ¬Lc1 → Cc⇂
¬Cc ∧ ¬Re → Rc1⇂
¬Ci ∧ ¬Rc1 → Le↾
These particular design choices significantly reduce the number of required transistors, but also
force the average throughput of this approach to be half of the input throughput. This is because
every input carry chain must be counted and drained to the output in separate steps.
It is possible to save the throughput of this device by doing those two steps in parallel. Such
a design would require two separate counters. One which can be devoted to only increment, and
one to only decrement and clear. In between, the value of the counter would have to be copied
bit-parallel from one counter to the next.
125
5.2.3 Integrated QDI/BD
The timing assumptions introduced by the integrated QDI/BD design flow pose new challenges
in the design of the compression unit. In particular, the condition that does not acknowledge L
skips the delay lines and forces other methods of implementing the timing assumption.
Once again, the control starts with the four forward driving cases. The logic for these drivers
uses three input data. Cz and Cn implement a one of two encoding that signals whether the
counter is zero or not. Lc0 and Lc1 implement a one of two encoding with the input request
specifying whether this token is a cap token, and D is a one of three encoding. Dd0 signals that
Ld=ext(v) , Dd1 signals Ld=ext(¬v) , and Dd2 signals !chain(Ld) . D comes from the datapath
and is assumed to be stable by the time a request on is received Lc0 or Lc1 .
The output signals are named by their interaction with the counter. Cd decrements the counter,
Cs skips the counter action, Ci increments the counter, and Cc clears it. Rc0 is calculated from
the internal nodes of the C-element driving Cd and Cs , and Rc1 lines up with Cc . Finally, all of
the output requests except Cd acknowledge the input.
Re ∧ Cn ∧ (Dd1 ∨ Dd2) ∧ (Lc0 ∨ Lc1) → Cd↾
Re ∧ Cz ∧ Dd2 ∧ Lc0 → Cs↾
(Cz ∧ (Dd0 ∨ Dd1) ∨ Cn ∧ Dd0) ∧ Lc0 → Ci↾
Re ∧ (Cz ∧ (Dd0 ∨ Dd1) ∨ Cn ∧ Dd0) ∧ Lc1 → Cc↾
¬_Cd ∨ ¬_Cs → Rc0↾
Rc1 = Cc
Ci ∨ Cc ∨ Cs → Le⇂
This yields a fairly clean WCHB implementation, and the reset behavior is as expected.
¬Re ∧ ¬Cn → Cd⇂
¬Re ∧ ¬Ld0 → Cs⇂
¬Cz ∧ ¬Cn ∧ ¬Ld0 → Ci⇂
¬Re ∧ ¬Cz ∧ ¬Cn ∧ ¬Ld1 → Cc⇂
_Cd ∧ _Cs → Rc0⇂
¬Ci ∧ ¬Cc ∧ ¬Cs → Le↾
Meanwhile, the datapath shown in Fig. 58 is somewhat more complex. Pass transistor logic
reduces its energy requirements and increases its maximum switching frequency. A variant of the
manchester carry chain checks the conditions that are finally fed into the control block as D . This
yields an implementation with high frequency at low energy and area requirements.
126
The MSB of #Ld is copied into v in every condition that L is acknowledged except for the
last. However, in the last condition when the input is a cap token, the next value of v does
not matter because the associated clear guarantees that n is 0 . So v can be set every time
L is acknowledged. This allows v to be implemented as a flip-flop in the datapath, making it
accessible to the comparison operation between #Ld and ext(v) , and its assignment to the MSB
of #Ld .
To start, each bit neads to pick between sending Ld or ext(v) on R in order to implement
the bypassing case with Cs . This is done with pass transistor multiplexers. When Zd1 is high, it
passes the value from Ld to Rd . When Zd0 is high, it passes the value from V to Rd . Z is one
when the counter is zero and comes from latching Cn and Cz using the input acknowledge.
¬@Ld0 ∧ Zd1 ∨ ¬@Vd0 ∧ Zd0 → Rd0⇂
@Ld0 ∧ ¬Zd0 ∨ @Vd0 ∧ ¬Zd1 → Rd0↾
¬@Ld1 ∧ Zd1 ∨ ¬@Vd1 ∧ Zd0 → Rd1⇂
@Ld1 ∧ ¬Zd0 ∨ @Vd1 ∧ ¬Zd1 → Rd1↾
Next, the equality checks are implemented with Dd0 representing Ld=ext(¬v) , Dd1
representing Ld=ext(v) , and Dd2 representing !chain(Ld) . To do this, each bit will have two
Ci signals and two Co signals. Cod0 should be low if this and all previous bits are 0 , and Cod1
if they are 1 . This effectively forms two parallel carry chains.
¬@Cid0 ∧ Ld0 → Cod0⇂
@Cid0 ∧ ¬Ld1 ∨ ¬Ld0 → Cod0↾
¬@Cid1 ∧ Ld1 → Cod1⇂
@Cid1 ∧ ¬Ld0 ∨ ¬Ld1 → Cod1↾
Finally in the MSB, these two carry signals are compared against the carry chain token value
stored in V to implement the equality checks.
(Cod0 ∨ Vd1) ∧ (Cod1 ∨ Vd0) → Dd0⇂
(Cod0 ∨ Vd0) ∧ (Cod1 ∨ Vd1) → Dd1⇂
¬Cod0 ∧ ¬Vd1 ∨ ¬Cod1 ∧ ¬Vd0 → Dd0↾
¬Cod0 ∧ ¬Vd0 ∨ ¬Cod1 ∧ ¬Vd1 → Dd1↾
¬Dd0 ∧ ¬Dd1 → Dd2↾
Dd0 ∨ Dd1 → Dd2⇂
5.3 Sign Compress by One
The full compression unit has a few issues when applied to certain problem spaces. First, the
total number of bits in the counter dictates the maximum length of any run before incorrect results
127
Fig. 58: The architecture of the integrated QDI/BD stream full compression
unit.
are given. This means that it does not ultimately support arbitrary precision arithmetic. This also
means that longer runs require logarithmically more counter units making the design area hungry.
Second, it is possible that a run is close to the full length of the value but the cap token is not
part of that run. For example, -128 is minimally encoded as …10000000 with a run of 7 bits and
a length of 8. This implementation would be forced to cut the throughput of such numbers in half
by waiting for the whole run to be consumed before emitting it again and moving on. This means
that the device can vary dramatically between half and full throughput.
Instead of storing and collapsing the whole run, a limit could be imposed on its length, simply
passing the remaining bits once that limit is reached. Implementing an arbitrary limit would
require an idczfn counter and likely be fairly expensive. However, implementing a limit of one
token simply requires a single bit register in the compression unit.
A limit of one also lends itself to an implementation with a guaranteed constant throughput
because the stored value is only important when the current input token is the cap. Therefore, it
may be possible to spread these compression units throughout a computational fabric and execute
the compression over the course of multiple operations without causing gridlock in the network.
128
5.3.1 Behavioral Specification
Unfortunately, the transition between two input streams complicates the necessary control
because there is not a stored token. So two values must be stored. v is a multi-bit storage
that records the previous token's data, and n signals whether v is valid. Luckily, n is directly
represented by the previous token's control signifying cap/not cap. v is not yet valid for any token
proceeding a cap token and is valid otherwise.
v := 0, n := 0
∗[[ Lc=0 ∧ n=0 → n := 1, v := Ld; L?
▯ Lc=0 ∧ n=1 → R!(v,0); v := Ld; L?
▯ Lc=1 ∧ n=0 ∧ v≠Ld → n := 1, v := Ld
▯ Lc=1 ∧ n=1 ∧ v≠Ld → R!(v,0); v := Ld
▯ Lc=1 ∧ v=Ld → R!(v,1); n := 0, v := Ld; L?
]]
The first condition handles the first token in the stream. v is not valid yet, so the input data
needs to be loaded into v and n needs to be set. The second condition handles the majority of the
stream. Lc=0 means that the end of the stream has not been reached and n=1 means that the first
token has already been received. In this case the previous token stored in v should be forwarded
and the new input should be loaded into v .
Then, there are three stream completion cases. In the first case, the cap token is also the
first token of the stream. This case only happens in the context of a stream representing 0
or -1 . To simplify the circuitry for the output request on R , the input is loaded into v and
left unacknowledged. This will transition directly into the last case. In the second case, the
stored token is a different value from the cap token. This means that the digit stream cannot be
compressed. So v is forwarded and the new input is stored. Again, the input is not acknowledged,
transitioning directly into the last case. In the last case, the cap token on the input and the stored
value in v are the same, meaning that the digit-stream can be compressed. So, the value in v is
forwarded as a cap token and the input is acknowledged. This completes the stream and resets n
to 0.
Once again, the specification for a single bit datapath is simpler. v and n can be merged into
a single internal memory encoding three states: valid 0, valid 1, and invalid. Furthermore because
datapath is QDI, the third condition can be merged with the last. Instead of first loading the input
data into the internal storage, it is forwarded directly to the output. Otherwise, the specification is
about the same.
129
v := inv;
∗[[ Lc=0 ∧ v=inv → v := Ld; L?
▯ Lc=0 ∧ v≠inv → R!(v,0); v := Ld; L?
▯ Lc=1 ∧ (v≠inv ∧ v≠Ld) → R!(v,0); v := Ld
▯ Lc=1 ∧ (v=inv ∨ v=Ld) → R!(Ld,1); v := inv; L?
]]
5.3.2 QDI Only
Unfortunately, while the CHP looks fairly clean, the logic for the forward drivers, the internal
memory, and the input acknowledgement are almost entirely independent. Furthermore, the
equality checks hide a significant amount of complexity introduced by the XOR operations
necessary for their implementation. So there end up being 10 forward driving signals and no real
way to optimize them.
The first two handle condition 1, saving the input data value in an internal driver for the
express purpose of setting the internal memory. The next four handle condition 2, and must remain
separate because the stored value of v that is subsequently forwarded on R could be different
from the input value on L that is then stored in v . The next two handle condition 3, forwarding
the stored value without consuming the input when the cap token has been received. Finally, the
last two actually implement the compression in condition 4.
// Condition 1
v2 ∧ Lc0 ∧ Ld0 → Rxd0↾
v2 ∧ Lc0 ∧ Ld1 → Rxd1↾
// Condition 2
Re ∧ v0 ∧ Lc0 ∧ Ld0 → Rxd2↾
Re ∧ v0 ∧ Lc0 ∧ Ld1 → Rxd3↾
Re ∧ v1 ∧ Lc0 ∧ Ld0 → Rxd4↾
Re ∧ v1 ∧ Lc0 ∧ Ld1 → Rxd5↾
// Condition 3
Re ∧ v1 ∧ Lc1 ∧ Ld0 → Rxd6↾
Re ∧ v0 ∧ Lc1 ∧ Ld1 → Rxd7↾
// Condition 4
Re ∧ Lc1 ∧ (v2 ∨ v0) ∧ Ld0 → Rxd8↾
Re ∧ Lc1 ∧ (v2 ∨ v1) ∧ Ld1 → Rxd9↾
Because there are so many forward drivers, validity trees become absolutely necessary to
derive the signals for the output channel's request. In a quest to reduce transistors, _Rd0 and
130
_Rd1 are shared across the output channel's data, control, and the input acknowledgement by
omitting Rxd7 and Rxd8 from _Rd0 and Rxd6 and Rxd9 from _Rd1 and adding them later in the
computation. This makes for some fairly complex forward logic.
Rxd0 ∨ Rxd1 → _Rs⇂
Rxd2 ∨ Rxd3 → _Rd0⇂
Rxd4 ∨ Rxd5 → _Rd1⇂
¬_Rd0 ∨ ¬_Rxd7 ∨ ¬_Rxd8 → Rd0↾
¬_Rd1 ∨ ¬_Rxd6 ∨ ¬_Rxd9 → Rd1↾
¬_Rd0 ∨ ¬_Rd1 ∨ ¬_Rxd6 ∨ ¬_Rxd7 → Rc0↾
¬_Rxd8 ∨ ¬_Rxd9 → Rc1↾
¬_Rs ∨ ¬_Rd0 ∨ ¬_Rd1 → _Le↾
_Le ∨ Rc1 → Le⇂
Then, the Rx signals are used to set the value of the internal memory. In the case of Rxd0 and
Rxd1 , a token is not being sent on R . So, v must wait for the input to reset, which is the slowest
possible implementation of an internal memory.
(¬v1 ∧ ¬v2 ∨ ¬Ld0 ∧ ¬_Rxd0 ∨ ¬Re ∧ (¬_Rxd2 ∨ ¬_Rxd4 ∨ ¬_Rxd6)) → v0↾
(¬v0 ∧ ¬v2 ∨ ¬Ld1 ∧ ¬_Rxd1 ∨ ¬Re ∧ (¬_Rxd3 ∨ ¬_Rxd5 ∨ ¬_Rxd7)) → v1↾
¬v0 ∧ ¬v1 ∨ ¬Re ∧ (¬_Rxd8 ∨ ¬_Rxd9) → v2↾
(v1 ∨ v2) ∧ (Ld0 ∨ _Rxd0) ∧ (Re ∨ _Rxd2 ∧ _Rxd4 ∧ _Rxd6) → v0⇂
(v0 ∨ v2) ∧ (Ld1 ∨ _Rxd1) ∧ (Re ∨ _Rxd3 ∧ _Rxd5 ∧ _Rxd7) → v1⇂
(v0 ∨ v1) ∧ (Re ∨ _Rxd8 ∧ _Rxd9) → v2⇂
Finally, the forward logic is reset following the typical WCHB template with conditional
acknowledgement. The first condition does not produce any output on R , so Re will not be
lowered. Rxd2 and Rxd5 implementing the second condition do not change the value of the
internal memory. So those acknowledgements are removed. The third condition does not
acknowledge the input, so input request neutrality does not need to be checked. Finally, the last
condition must check all of these features.
// Condition 1
¬v2 ∧ ¬Lc0 ∧ ¬Ld0 → Rxd0⇂
¬v2 ∧ ¬Lc0 ∧ ¬Ld1 → Rxd1⇂
// Condition 2
¬Re ∧ ¬Lc0 ∧ ¬Ld0 → Rxd2⇂
¬Re ∧ ¬v0 ∧ ¬Lc0 ∧ ¬Ld1 → Rxd3⇂
131
¬Re ∧ ¬v1 ∧ ¬Lc0 ∧ ¬Ld0 → Rxd4⇂
¬Re ∧ ¬Lc0 ∧ ¬Ld1 → Rxd5⇂
// Condition 3
¬Re ∧ ¬v1               → Rxd6⇂
¬Re ∧ ¬v0               → Rxd7⇂
// Condition 4
¬Re ∧ ¬v0 ∧ ¬Lc1 ∧ ¬Ld0 → Rxd8⇂
¬Re ∧ ¬v1 ∧ ¬Lc1 ∧ ¬Ld1 → Rxd9⇂
Resetting the validity trees and input acknowledge is simply a matter of implementing the other
half of the combinational gates.
¬Rxd0 ∧ ¬Rxd1 → _Rs↾
¬Rxd2 ∧ ¬Rxd3 → _Rd0↾
¬Rxd4 ∧ ¬Rxd5 → _Rd1↾
_Rd0 ∧ _Rxd7 ∧ _Rxd8 → Rd0⇂
_Rd1 ∧ _Rxd6 ∧ _Rxd9 → Rd1⇂
_Rd0 ∧ _Rd1 ∧ _Rxd6 ∧ _Rxd7 → Rc0⇂
_Rxd8 ∧ _Rxd9 → Rc1⇂
_Rs ∧ _Rd0 ∧ _Rd1 → _Le⇂
¬_Le ∧ ¬Rc1 → Le↾
This is a stark example of everything that can go wrong when attempting QDI design. When
nothing lines up, complexity can grow very quickly for seemingly simple specifications. Because
of this, QDI design often takes a very long time.
5.3.3 Integrated QDI/BD
Separating the control and datapath untangles these independent concerns and allows for
simpler XORs with standard clocked logic. This ultimately simplifies the control. However, any
feature that might add complexity to the datapath must be reigned in.
Therefore, a few constants are maintained in an attempt to optimize the circuitry. First, the data
for the output request on R always comes from v . This removes any muxing from the datapath
and redirects that complexity into the control. Second, the value stored in v is always set using
the data on the input channel L . These two factors allow for the datapath to be implemented as
a set of flops that shift the data backwards in the stream by a single pipeline stage. Third, the
control circuitry is only dependent upon an equality test from the datapath and the datapath is
only dependent upon clocking signals from the control. This allows for a fairly strict separation
132
between the two, simplifying the control.
There are two primary challenges presented by the specification. The first is that the clocking
signal for the input latches on the input request data #Ld and the clocking signal for the extra set of
flops implementing v are different. v needs to be clocked on every iteration of the control while
the input request should only be clocked on conditions 1, 2, and 5. The second challenge is that
conditions 3 and 4 bypass the delay line on the input request #Lc , but still change the value of v
and therefore of the equality test between #Ld and v . This forces the creation a signal specifically
for those two conditions with its own delay line.
The implementation starts with the five conditions that drive the output request on R . Except
for Ls , the signals used to compute these conditions come directly from the specification. Ls is
the extra delay line signal for conditions 3 and 4, and D signals whether Ld is different from v .
Since conditions 3 and 4 can only transition to 4 or 5, only conditions 4 and 5 must check the reset
of Ls .
Re ∧ n0 ∧ Ld0 → Rxd0↾
Re ∧ n1 ∧ Ld0 → Rxd1↾
Re ∧ n0 ∧ Dd1 ∧ Ld1 → Rxd2↾
Re ∧ Ls ∧ n1 ∧ Dd1 ∧ Ld1 → Rxd3↾
Re ∧ Ls ∧ Dd0 ∧ Ld1 → Rxd4↾
Then, these conditions are used to drive the output request, the input enable, and the extra delay
signal. Because conditions 1 and 3 only serve to load the input into the internal memory, they
do not forward any request on the output. Furthermore, conditions 3 and 4 redirect the control to
condition 5 and therefore do not acknowledge the input request.
¬_Rxd1 ∨ ¬_Rxd3 → Rd0↾
Rd1 = Rxd4
Rxd0 ∨ Rxd1 ∨ Rxd4 → Le⇂
Rxd2 ∨ Rxd3 → Ls⇂
Because the value of the input control token is reflected in the output request rails, they can be
used to set the internal memory unit for n . Conditions 1 and 3 set n to 1 while condition 5 sets
it to 0 . The value of n for conditions 2 and 4 is already set correctly. So, the reset phase of the
output request is then used to acknowledge the transitions on the internal memory and reset the
input acknowledges. The delay line for Ls is placed between the driver and all proceeding usages.
(¬n0 ∨ ¬Ld0 ∧ ¬_Rxd0 ∨ ¬Ls ∧ ¬_Rxd2) → n1↾
n0 ∧ (Ld0 ∨ _Rxd0) ∧ (Ls ∨ _Rxd2) → n1⇂
¬n1 ∨ ¬Re ∧ ¬_Rxd4 → n0↾
133
Fig. 59: The architecture of the integrated QDI/BD Stream Compress One unit.
n1 ∧ (Re ∨ _Rxd4) → n0⇂
¬Ld0 ∧ ¬n0 → Rxd0⇂
¬Re ∧ ¬Ld0 → Rxd1⇂
¬Ls ∧ ¬n0 → Rxd2⇂
¬Re ∧ ¬Ls       → Rxd3⇂
¬Re ∧ ¬Ld1 ∧ ¬n1 → Rxd4⇂
_Rxd1 ∧ _Rxd3 → Rd0⇂
¬Rxd0 ∧ ¬Rxd1 ∧ ¬Rxd4 → Le↾
¬Rxd2 ∧ ¬Rxd3 → Ls↾
To implement the datapath in Fig. 59, the clocking signal for v must be generated using Le
and Ls . Meanwhile, the clock signal for the input data is just Le .
Ls ∧ Le → vclk↾
¬Ls ∨ ¬Le → vclk⇂
Then, the equality check between Ld and v are implemented for each bit.
134
Ld0 ∧ Rd1 ∨ Ld1 ∧ Rd0 → Cd0⇂
¬Ld0 ∧ ¬Rd0 ∨ ¬Ld1 ∧ ¬Rd1 → Cd0↾
Cd0 → Cd1⇂
¬Cd0 → Cd1↾
Inspiration is taken from a Manchester Carry Chain to propagate this equality check across the
bits using pass transistors.
¬@Di ∧ Cd0 → Do⇂
@Di ∧ ¬Cd1 ∨ ¬Cd0 → Do↾
And finally, the one hot encoding D is generated for the equality check used in the control.
Dd1 = Do
Do → Dd0⇂
¬Do → Dd0↾
5.4 Serial to Parallel
Serial to parallel conversion is ultimately required for multiplication and shifting. Digit-serial
multiplication requires an array of adders to sum up the partial products. Therefore, it actually
uses a hybrid architecture with one serial operand and one parallel. Similarly, digit-serial shifting
requires a counter to keep track of the number of tokens that have either been pushed to the front
of the stream or popped from it. This counter is loaded with a bit-parallel operand. For both of
these, extra work must be done to convert the second operand from serial to parallel.
5.4.1 Behavioral Specification
However, there are a few complicating factors. The array architecture limits multi-node
operations to a distributed control. Furthermore, it is assumed that each network node will have a
pipeline stage to maintain switching frequency. This rules out the three standard approaches to this
problem: a tree of alternating splits, a counter with a demultiplexer[232], or a broadcast channel.
Ultimately, this leaves two possible approaches. For the “upflow” approach, the stream flows
up from the least significant parallel output channel to the most significant. The first token in the
stream is emitted out the first parallel output channel, and every consecutive token in the stream is
forwarded to the next least significant node. This is repeated, popping the first token off the stream
at each node until it has been fully converted. Unfortunately, this strategy introduces significant
skew to the output tokens, likely affecting performance.
The “downflow” approach reverses the direction of flow. The digit stream flows down across
all of the output channels until the first token reaches the bottom of the converter. At this point,
all of the tokens in the stream will be aligned with the correct output channel. This allows them to
be emitted in parallel with very little skew.
135
Each of these approaches imply strong constraints on the construction of the CGRA and its
network architecture. The upflow approach allows for dynamic allocation of neighboring idle
nodes on the array as in Fig. 60.
Meanwhile, the downflow approach allocates a multi-node operation through nodes on which
the input digit stream is already being routed as in Fig. 61. This effectively folds the overhead
of the serial-to-parallel converter into the routing network's own pipeline structure. However, this
approach would also require knowledge of exactly how many nodes are required for the operation
ahead of time. As discussed before, this would prevent the system from truly implementing
arbitrary length arithmetic.
To get the most capacity from the array, it may be prudent to use a hybrid approach, starting
with downflow and switching to upflow as needed. Therefore, implementations for both of these
approaches will be demonstrated in this chapter.
Unfortunately, the naive implementations for these circuits will deadlock. First, both the
multiplier and the counter leave the parallel channels unacknowledged until the completion of
the operator. Second, once the operator has completed, the counter will acknowledge its parallel
inputs from bottom to top while the multiplier will acknowledge them from top to bottom. Third,
the basic circuit template implements a half-buffered pipeline. This means that each token requires
two spaces in the pipeline, and that the second token will be blocked until the first parallel output
is acknowledged.
Careful design will be needed to work within these constraints. While it would be easy
to add extra pipeline stages wherever needed, doing so incurs significant overhead. Overall, a
half-buffered pipeline has enough latching layers to hold all of the parallel outputs. The failure,
instead, lies with the typical QDI control templates.
Both the upflow and downflow approaches start with a standard half buffered pipeline, adding
parallel output channels at each pipeline stage. For the upflow approach, each stage must be able
to store the parallel digit while passing the remaining serial digits, requiring a minimum of two
latching layers. Furthermore, the QDI control for the parallel output channel must not deadlock
the QDI control for the serial pipeline. This means that the parallel output must be driven by a full
buffer.
Overall, the behavioral description of the upflow approach is unable to capture these
constraints. The specification has three channels for data: Si receives the incoming serial stream
from below, Pi forwards the first token on the parallel output channel, and Si+1 forwards the
remaining tokens of the serial stream up until a cap token is received.
136
Fig. 60: Structure of multi-node operations for first approach for serial to
parallel conversion.
Fig. 61: Structure of multi-node operations for second
approach for serial to parallel conversion.
∗[ Sn-1?v; Pn-1!v; ∗[ v=0 → Sn-1?v ] ] ∥
…
∗[ Si?v; Pi!v; ∗[ v=0 → Si?v; Si+1!v ] ] ∥
…
∗[ S0?v; P0!v; ∗[ v=0 → S0?v; S1!v ] ]
For the downflow approach, each pipeline stage can deadlock once it has forwarded its parallel
output. This means that only one latching layer is required at each pipeline stage and the parallel
137
output can be driven by a half buffer.
The specification therefore has three channels for data: Si+1 receives the incoming serial
stream from above, Si forwards the remaining tokens of the serial stream down, and Pi is the
parallel output for a single token. Given a half-buffered pipeline, it is guaranteed that the parallel
output in the next stage will stall the pipeline, keeping the enable of Si low until after the
completion of the whole operator. Unfortunately, there is not a clean way to differentiate that
event from a simple pipeline stall, necessitating the use of extra control channels Ci and Ci+1 .
After this stall, a token will arrive on Ci signalling that this stage should forward its token waiting
on Si+1 through Pi . Once that is done, a signal can be sent through Ci+1 to continue the process.
∗[[ Sn-1 → Sn?v; Sn-1!v
▯ Cn-1 → Sn?v, Cn-1?; Pn-1!v
]] ∥
…
∗[[ Si → Si+1?v; Si!v
▯ Ci → Si+1?v, Ci?; Pi!v, [v=0 → Ci+1! ▯ v=1 → skip]
]] ∥
…
∗[ S1?v; P0!v, [v=0 → C1! ▯ v=1 → skip] ]
5.4.2 Upflow Approach
For simplicity, the serial channels of a single pipeline stage, Si and Si+1 , have been renamed
D for down and U for up respectively. As previously discussed, each stage is divided into two
interacting circuits. A PCFB drives the parallel output channel P and a WCHB drives the serial
up-going channel U .
The PCFB starts the handshake, forwarding the first token from D to the parallel output
channel P . The input enable De is then lowered by way of the state variable x , and the standard
PCFB state variable en signals the start of the reset phase. The input channel D resets and De is
raised in compliance with the standard PCFB handshake.
en ∧ D0 → P0↾
en ∧ D1 → P1↾
en ∧ (P0 ∨ P1) → x↾
P0 ∨ P1 → Pn⇂
¬De ∧ ¬Pn → en⇂
¬en ∧ ¬D0 ∧ ¬D1 → x⇂
138
During the reset phase of the PCFB, the WCHB forwards remaining tokens through the
up-going channel U . The WCHB is forced to wait until the reset phase of the PCFB through De ,
_en , and P0 in the forward drivers for U . This continues until the cap token is received. Before
the reset for U1 completed, it waits until P0 is reset.
De ∧ Ue ∧ _en ∧ P0 ∧ D0 → U0↾
De ∧ Ue ∧ _en ∧ P0 ∧ D1 → U1↾
x ∨ U0 ∨ U1 → De⇂
¬Ue ∧ ¬D0 → U0⇂
¬Ue ∧ ¬D1 ∧ ¬P0 → U1⇂
¬x ∧ ¬U0 ∧ ¬U1 → De↾
Once the cap token is forwarded up, the parallel channel P is allowed to reset. This is enforced
by the check on _U1 in the reset rule for P0 . Then the PCFB handshake waits for the reset of the
WCHB to complete by checking _U1 in the reset rule for en .
¬en ∧ ¬Pe ∧ ¬_U1 → P0⇂
¬en ∧ ¬Pe → P1⇂
¬P0 ∧ ¬P1 → Pn↾
De ∧ Pe ∧ Pn ∧ _U1 → en↾
Once everything has reset, the process starts again with a new digit stream. Therefore, P is
allowed to deadlock for as long as necessary without affecting the other tokens flowing through
the system, and is allowed to reset in any order with respect to the other parallel channels.
For the datapath, the input requests on D are delayed to implement the bundled-data timing
assumption. en is amplified and used to clock the layer of latches for P that are open when en is
high and closed otherwise. Finally, the input enable on D is amplified and used to clock the layer
of latches for U . Once again, those latches are open when De is high and closed otherwise.
While this QDI control is fairly non-standard, it allows the datapath to stay as small and simple
as possible while faithfully implementing those constraints. Overall, this has a dramatic effect on
the transistor count.
5.4.3 Downflow Approach
The downflow approach is similarly non-standard, but ultimately simpler. Tokens flow down
the converter until they reach their destination. Each token lines up one after another, so only one
layer of latches is necessary. Once again, the two serial channels Si and Si+1 have been renamed
139
to D for down, and U for up respectively. The control channels have been similarly renamed to
Cu and Cd .
*[[ #D → U?v; D!v
[] #Cd ∧ #U=0 → U?v, Cd?; P!v, Cu!
[] #Cd ∧ #U=1 → U?v, Cd?; P!v
]]
Because no assumptions are made about the order in which the parallel outputs are reset, it is
possible for the reset of a parallel channel break the guarantee for the stall on D . Specifically, if
the parallel channel of the node attached to D is reset before P , then De will be raised. In parallel,
U has been acknowledge and the handshake is waiting on the input requests on U to be lowered.
This could allow the token that was forwarded through P to be duplicated onto D , producing an
incorrect result later in the pipeline and possibly causing an instability. This means that the rules
generating the requests on D must be gated by the output requests on P .
Pe ∧ De ∧ Ud0 ∧ _Pd0 → Dd0↾
Pe ∧ De ∧ Ud1 ∧ _Pd1 → Dd1↾
Dd0 ∨ Dd1 ∨ _Cde → Ue⇂
¬De ∧ ¬Ud0 → Dd0⇂
¬De ∧ ¬Ud1 → Dd1⇂
¬Dd0 ∧ ¬Dd1 ∧ ¬_Cde → Ue↾
After D lowers the enable, a token will be received on Cd . If the token on Cd arrives before
the input requests on U are lowered, then it is possible for the token on D to be duplicated to P .
This will produce an incorrect result on P which may be unstable as the input requests on U can
be lowered before that transition completes. This means that the rules generating the requests on
P must be gated by the output requests on D .
Finally, there can be no constraints on the order in which the parallel channels are reset. They
could be reset from bottom to top, top to bottom, or in parallel. The control channels Cu and
Cd form a pipeline with tokens running from the bottom of the converter to the top. If a typical
WCHB pipeline is used, then that will force the parallel channels to reset from bottom to top,
causing deadlock. This means that the Cd?; Cu! pipeline must be at least a PCHB reshuffling.
Pe ∧ Cue ∧ Cdd0 ∧ Ud0 ∧ _Dd0 → Pd0↾
Pe ∧ Cdd0 ∧ Ud1 ∧ _Dd1 → Pd1↾
Cud0 = Pd0
140
Pd0 ∨ Pd1 → Cde⇂
¬Cde → _Cde↾
¬Pe ∧ ¬Cue ∧ ¬Ud0 → Pd0⇂
¬Pe ∧ ¬Ud1 → Pd1⇂
¬Cdd0 ∧ ¬Pd0 ∧ ¬Pd1 → Cde↾
Cde → _Cde⇂
Now a single layer of latches can be clocked to serve both U and P .
Ue ∧ Pe → clk⇂
¬Ue ∨ ¬Pe → clk↾
Until a control token is received on Cd and tokens are being forwarded through D , the latches
serve to pass data along D . Then, a control token is received and a token is passed onto P . At this
point the latches lock with the value passed through P and are not used again until P completes
its reset.
5.5 Evaluation
5.5.1 Sign Extension
Overall, five different approaches were explored: The QDI PCHB and Integrated approaches
presented in this chapter, an Integrated approach with control logic similar to the QDI PCHB, a
Bundled-Data, and a QDI WCHB approach. The QDI approaches were measured with both 4-bit
and 1-bit datapaths while the others were measured with only 4-bit.
Fig. 62 shows the average performance for each unit assuming a maximum input bitwidth of
64 bits. Overall, the Integrated QDI/BD design presented in this chapter is the most desirable
solution for a few reasons. It has the lowest energy requirement, a reasonably high throughput per
transistor, and it is fairly simple to graft more complex control to it. This will be important for the
development an efficient adder and could help with efficient bitwise operators.
Table 4 shows the raw per-token measurements for each condition of the sign extension units.
In the "ab" condition, the control tokens on the inputs are both internal tokens, and thus both
inputs are acknowledged and forwarded. In the "a" and "b" condition, only the token on A or
B respectively is an internal token. Therefore, only that input is acknowledged, sign-extending
the other. Finally, in the "cap" condition, both inputs are cap tokens. So both are acknowledged,
completing the operation. The frequency and latency are fairly consistent between the four
conditions, but the energy drops significantly during the extend condition for the Integrated and
Bundled-Data solutions because it only has to latch one of the two inputs.
141
Fig. 62: Overview of the sign-extension unit performance.
The sign extension unit is used by the addition, subtraction, AND, OR, and XOR operators.
Therefore, the average utilization of the four behavioral conditions is determined by the joint
bitwidth distribution of the two inputs from those operators as shown in Fig. 63. This is
moderately different from the output bitwidth from the addition operator in Fig. 32. The center
plot shows the joint probability distribution while each histogram shows the associated individual
probability and cumulative distributions for that axis. However, this plot includes some operations
that should not be handled by a digit-serial architecture. Specifically, there are significant spikes
around 47 and 48 bits as discussed in Chapter 2 Section 4. These spikes represent memory address
computations with a 48-bit wide memory bus. These operations have predictable bitwidth and
should be handled by their own bit-parallel datapath.
max_bitwidth(A) and max_bitwidth(B) represent the maximum bitwidth of the input
operands A and B . The probability at each coordinate in Fig. 63 is sampled with P(bitwidth(A)
== a and bitwidth(B) == b) , ignoring cases in which A or B are 47 or 48 bits wide. The
number of internal tokens in each digit stream are computed from the bitwidth. If the bitwidth is
1 , then that single bit is placed in the cap token int(1+4-2) = 3/4 = 0 . With a bitwidth of
2 , the digit stream has one internal token int(2+4-2)/4 = 4/4 = 1 and one cap token. The
average number of cycles per stream is computed for each condition. "ab" is equal to the number
of non-cap tokens before sign-extension is required. Then, if one stream is longer than another,
this is counted in the "a" or "b" conditions. Finally, every digit stream has exactly one cap token.
142
Type Transistors Condition Token Frequency Energy/Token Latency
ab 2.76 GHz 36.31 fJ 92 ps
Integrated Serial Adaptive a 2.79 GHz 23.98 fJ 90 ps
181
(Extensible 4-bit) b 2.81 GHz 24.72 fJ 89 ps
cap 2.82 GHz 27.95 fJ 86 ps
ab 3.04 GHz 50.15 fJ 123 ps
Integrated Serial Adaptive a 3.19 GHz 28.92 fJ 116 ps
178
(Standard 4-bit) b 3.17 GHz 30.24 fJ 120 ps
cap 3.94 GHz 28.70 fJ 87 ps
ab 3.01 GHz 58.52 fJ 117 ps
BD Serial Adaptive a 2.84 GHz 35.52 fJ 123 ps
188
(4-bit) b 3.03 GHz 34.66 113 ps
cap 3.09 GHz 47.93 111 ps
ab 1.38 GHz 125.06 fJ 286 ps
QDI Serial Adaptive a 1.36 GHz 120.98 fJ 284 ps
445
(PCHB 4-bit) b 1.36 GHz 121.00 fJ 285 ps
cap 1.38 GHz 122.61 fJ 285 ps
ab 1.86 GHz 130.88 fJ 69 ps
QDI Serial Adaptive a 1.86 GHz 137.16 fJ 61 ps
504
(WCHB 4-bit) b 1.87 GHz 135.92 fJ 63 ps
cap 1.85 GHz 135.40 fJ 56 ps
ab 1.98 GHz 46.07 fJ 187 ps
QDI Serial Adaptive a 1.97 GHz 41.05 fJ 184 ps
169
(PCHB 1-bit) b 1.97 GHz 40.87 fJ 183 ps
cap 1.99 GHz 43.21 fJ 184 ps
ab 2.44 GHz 53.48 fJ 69 ps
QDI Serial Adaptive a 2.44 GHz 54.93 fJ 61 ps
228
(WCHB 1-bit) b 2.46 GHz 54.46 fJ 63 ps
cap 2.42 GHz 55.11 fJ 57 ps
Table 4. Raw performance measurements for the sign extension units.
143
Fig. 63: Combined probability distribution of the input bitwidths to the
addition, subtraction, and bitwise operations.
u = {'ab': 0, 'a': 0, 'b': 0, 'cap': 0}
for a in range(1, max_bitwidth(A)+1):
for b in range(1, max_bitwidth(B)+1):
# ignore memory address operations
if a in [47,48] or b in [47,48]:
tmpa = 44 if a in [47,48] else a
tmpb = 44 if b in [47,48] else b
p = P(bitwidth(A) == tmpa and bitwidth(B) == tmpb)
else:
p = P(bitwidth(A) == a and bitwidth(B) == b)
# number of internal tokens
atok = int((a+packet-2)/packet)
btok = int((b+packet-2)/packet)
u['ab'] += min(atok,btok)*p
u['a'] += max(0, atok-btok)*p
u['b'] += max(0, btok-atok)*p
u['cap'] += p
144
The computed utilizations for both 1-bit and 4-bit datapaths in Table 5 show that the average
digit stream for a 4-bit datapath has around 4.5 tokens with roughly equal time spent in the "ab"
and extend conditions "a" and "b". A tends to have longer digit streams due to compiler and
human preference."
Finally, the average performance of each sign extension unit were computed from these
utilizations in Fig. 64 for a given maximum input bitwidth. Overall, the QDI only solutions are not
competitive. However, the integrated solutions show the best energy efficiency and competitive
throughput.
5.5.2 Compression
Six different approaches were explored: the Integrated and QDI compressN and compress1
approaches presented in this chapter, and Bundled-Data compressN and compress1 approaches.
Fig. 65 shows the average performance of these approaches for a maximum bitwidth of 64-bits.
Once again, the Integrated approaches perform the best, operating with around 29% higher
throughput per transistor and using around 14% less energy than the bundled-data approaches.
The compressN units have four conditions. During the increment condition, it counts all input
tokens that are part of a run, emitting no tokens on the output. During the decrement condition,
it emits the stored run, consuming no tokens from the input. During the pass condition, there is
no stored run and the input token is not part of any new run, so the input is passed directly to the
output. Finally, the clear condition clears any stored run and forwards the input token directly to
the output. The raw measurements for these conditions for each design are presented in Table 6.
Meanwhile, the compress1 units have three conditions. During the init condition, the first token
is consumed and stored for later comparison, emitting no tokens on the output. During the pass
condition, tokens on the input are stored and the stored token is passed to the output. Then, the end
condition simply emits the stored token to the output. The raw measurements for these conditions
for each design are presented in Table 7.
The compression units are ultimately difficult to compare. The compressN units have the
capability to compress digit streams that are extremely redundant, but incur significant throughput
cost to do so. Meanwhile, the compress1 units can only compress a single token at a time, but
also affect the throughput significantly less. Three measurements must be taken to resolve this
trade-off.
First, for the 4-bit compressN units, computing the average per-stream performance metrics
requires digit run-length statistics from the output of each operator. This is because every token
in each run must execute both the inc and dec conditions, which has a significant effect on both
energy and throughput. Fig. 66 shows the average occurrence count per operation for a run of a
145
Condition Average Cycles/Stream
1-bit 4-bit
ab 6.602 1.998
a 4.511 1.123
b 1.359 0.358
cap 1.000 1.000
Total Cycles 13.472 4.479
Table 5. Utilization of each condition
for the addition, subtraction, AND,
OR, and XOR operators.
Fig. 64: Overview of the sign-extension unit performance.
Fig. 65: Overview of the compression unit performance.
given length. For the 1-bit compressN unit, every bit is part of a run and therefore executes both
an inc and dec conditions. The compress1 units are not sensitive to digit runs.
Second, the bitwidth distribution at the output of the operations is different from the bitwidth
distribution at their inputs. Fig. 67 shows this bitwidth after compression.
Third, on top of the bitwidth in Fig. 67, the addition, subtraction, multiplication, and bitwise
operators introduce redundant bits to the end of the encoding. For example, given the operation
1024-1023=1, the result would be encoded using 10 bits, 8 of which are unnecessary. Fig. 68
shows the probability distribution for an operation to introduce some number of redundant bits.
The utilization of the behavioral conditions for each circuit must be determined from the data in
146
Type Transistors Condition Token Frequency Energy/Token Latency
inc 2.24 GHz 75.51 fJ 170 ps
Integrated Serial Adaptive dec 2.36 GHz 77.27 fJ 170 ps
304+710
CompressN (4-bit) pass 2.30 GHz 56.34 fJ 178 ps
clear 2.38 GHz 71.77 fJ 169 ps
inc 1.98 GHz 91.46 fJ 215 ps
BD Serial Adaptive dec 2.21 GHz 84.57 fJ 211 ps
338+710
CompressN (4-bit) pass 1.94 GHz 61.66 fJ 226 ps
clear 1.99 GHz 93.46 fJ 187 ps
inc 2.37 GHz 49.66 fJ 168 ps
QDI Serial Adaptive
127+710 dec 2.90 GHz 47.10 fJ 40 ps
CompressN (1-bit)
clear 2.32 GHz 57.10 fJ 73 ps
Table 6. Raw performance measurements for the compressN units.
Type Transistors Condition Token Frequency Energy/Token Latency
init 1.81 GHz 63.24 fJ 167 ps
Integrated Serial Adaptive
396 pass 2.46 GHz 64.34 fJ 174 ps
Compress1 (4-bit)
end 2.25 GHz 51.77 fJ 176 ps
init 1.99 GHz 79.86 fJ 221 ps
BD Serial Adaptive
356 pass 1.71 GHz 81.39 fJ 209 ps
Compress1 (4-bit)
end 1.78 GHz 59.33 fJ 209 ps
init 2.48 GHz 20.72 fJ 45 ps
QDI Serial Adaptive
238 pass 2.24 GHz 32.15 fJ 98 ps
Compress1 (1-bit)
end 2.65 GHz 29.78 fJ 88 ps
Table 7. Raw performance measurements for the compress1 units.
Fig. 66, Fig. 67, and Fig. 68. Realistically, doing so correctly would require the joint distribution
of all of these measures which is unfortunately unavailable. Therefore, these distributions must
be assumed to be independent. Though, there are a few things that can constrain this assumption.
First, the sum of the bitwidth and redundant bits must be within the maximum bitwidth. Second,
any bit run must also fit within the maximum bitwidth. Third, runs cannot overlap within a single
input.
For the compressN units, the behavior can be divided into two sections. The first section
covers the non-redundant internal tokens while the second covers the redundant internal tokens.
147
Fig. 66: Combined probability distribution of the output run lengths for a
given start bit for the addition, subtraction, multiplication, and bitwise
operations.
Fig. 67: Average bitwidth distribution of the output for the addition, subtraction, multiplication, and bitwise
operations.
For the first section, the run-length distribution gives information about the "inc", "dec", and
"pass" conditions. Specifically, only the last run in the digit matters. This means that the run's
start bit is inside the digit and its length at least covers the remaining bits in the digit length >=
excess . This prevents that digit from being double counted in the pass condition. First, this digit
is passed u['pass'] += p . Then, any further digit that is completely contained within the run
int((length-excess)/packet) is incremented u['inc'] += ltok*p and then decremented
148
Fig. 68: Average number of redundant bits introduced into the encoding by the addition, subtraction, multiplication,
and bitwise operations.
u['dec'] += ltok*p .
For the second section, the output and redundant distributions give information about the
remaining "inc" condition cycles. The redundant bits that do not already slot into the last digit
in the stream int((red-excess-1)/packet) are incremented in the counter u['inc'] +=
rtok*p and then cleared u['clear'] = 1.0 .
149
u = {'inc': 0, 'dec': 0, 'pass': 0, 'clear': 0}
for start in range(0, max_bitwidth(L)):
for length in range(1, max_bitwidth(L)-start+1):
p = P((start,length) in runs(L))
offset = start%packet
excess = (packet - offset)%packet
if length >= excess:
ltok = int((length-excess)/packet)
u['inc'] += ltok*p
u['dec'] += ltok*p
if off > 0:
u['pass'] += p
for width in range(1, max_bitwidth(L)+1):
for red in range(0, max_bitwidth(L)-width):
p = P(bitwidth(L) == width and redundant(L) == red)
offset = (width-1)%packet
excess = (packet - offset)%packet
if red >= excess:
rtok = int((red-excess-1)/packet)
u['inc'] += rtok*p
u['clear'] = 1.0
This computes the utilizations in Table 8. On average, the compressN unit will compress 4.733
bits or 0.835 digits. These are slightly off from each other because of digit-boundaries. Keep in
mind that this assumes a compression unit is placed after every operation.
Computing the compress1 utilizations is somewhat easier. All of the internal tokens in the
digit stream are forwarded in the "pass" condition except the last redundant digit u['pass'] +=
(rtok-skip)*p . The "init" condition is only executed if there is an internal token to load rtok
> 0 . Then, every operation executes a single "end" condition u['end'] += p , emitting the cap
token.
150
Condition Average Cycles/Stream
1-bit 4-bit
inc 23.021 1.714
dec 18.289 0.879
pass 0.000 3.253
clear 1.000 1.000
Total Cycles 42.310 6.846
Table 8. Utilization of each compressN
condition for the addition, subtraction,
multiplication, and bitwise operators.
u = {'init': 0, 'pass': 0, 'end': 0}
for width in range(1, max_bitwidth(L)+1):
for r in range(0, max_bitwidth(L)-o):
p = P(bitwidth(L) == width and redundant(L) == red)
rtok = int((width+red+packet-2)/packet)
skip = 0
offset = width%packet
excess = (packet-offset)%packet
if red > excess+packet:
skip = 1
if rtok > 0:
u['init'] += p
u['pass'] += (rtok-skip)*p
u['end'] += p
This computes the utilizations in Table 9. While the compress1 approach significantly reduces
the total cycle count for a 1-bit datapath, it is actually worse for the 4-bit datapaths. This is because
the "pass" condition in the compressN units covers the vast majority of tokens leaving the "inc"
condition mostly to the redundant tokens.
Finally, the average performance for each approach is computed using these utilizations in
Fig. 69. The Integrated QDI/BD designs successfully outperform the QDI and Bundled-Data
designs by a significant margin. Ultimately, the compressN approach outperforms the compress1
approach for wider datapaths since the "pass" condition becomes exponentially more likely to
cover the majority of the internal tokens.
151
Condition Average Cycles/Stream
1-bit 4-bit
init 0.986 0.986
pass 20.419 5.361
end 1.000 1.000
Total Cycles 22.405 7.347
Table 9. Utilization of each compress1
condition for the addition, subtraction,
multiplication, and bitwise operators.
Fig. 69: Overview of the compress unit performance.
5.5.3 Serial to Parallel
The serial-to-parallel units ultimately do not incur as high a cost. They are not used particularly
often, and they do not need to be particularly fast. Though low latency would certainly be a
beneficial property.
Ultimately, the two approaches offer similar performance, routing tokens at around 1.8 GHz.
With 153 transistors per stage, the downflow approach requires 30% fewer transistors because it
only needs a single layer of latches. This brings the upflow approach to 219 transistors per stage.
However, the upflow approach uses 20% less energy on average because shorter streams do not
need to be routed across the whole pipeline. Therefore, as the length of the pipeline increases,
the energy required by the upflow approach levels off to the average stream length while the
downflow approach continues with linearly increasing energy requirements.
Fig. 70 shows the frequency level off to its steady state as the pipeline length increases.
Meanwhile the energy increases with pipeline length because the tokens have to be routed further
along the pipeline.
152
Fig. 70: Throughput and energy per serial token of the upflow and downflow serial to parallel units as
pipeline length increases.
153
CHAPTER 6
BITWISE OPERATIONS
Bitwise operators generally fall into two categories. In the first category, operators have no
cross-bit dependencies. Each bit in the result is dependent only upon the bits in the inputs at the
same bit location. This category covers and, or, invert, xor, etc.
The second category involves moving bits around without changing their value, covering
shift and rotate. Unfortunately, these operators are quite a bit more complex in the context of
digit-serial data. This work will only examine shifting. Rotation involves saving multiple tokens
from the beginning of a digit-stream to move them to the end. Direct implementation requires
an unbounded amount of memory making it a bad fit for a digit-serial ALU. Furthermore, in the
context of adaptive digit-serial arithmetic, rotation is no longer meaningful because it requires a
rigid bitwidth to rotate around.
Non-adaptive digit-serial shifting is well studied, leading to a particularly simple
implementation found in [144]. Unfortunately, this simplicity is ultimately derived from the
non-adaptive nature of the architecture, shifting the bitwidth boundary location and overwriting
digits of earlier or later digit-streams instead of shifting the value itself.
An alternative technique in [141] simply uses a multi-port shift register. However, the
complexity of this circuit grows quickly with the maximum amount of shifting possible.
The implementation that most closely matches our requirements is the adaptive bit-serial
shifter found in BitSNAP [232][233]. It shifts the value by adding zeros to or deleting bits from
the front of the stream, tracking the shift amount with a counter. This work will build on that
premise to support digit-serial shifting.
6.1 AND, OR, XOR
A lack of cross-bit dependencies means that bitwise operators can be grafted directly onto the
control circuitry of other operators with no extra overhead. The most basic example of this would
be grafting them into the datapath of the sign extension unit in Chapter 5 Section 1.
6.2 Shift Left
There are three primary modes for the shift left operator: the digit-shift, the bit-shift, and
the cap-token. The first mode, as implemented by the first and second conditions, handles the
digit-shift by adding zeros to the front of the stream. The digit-shift amount is recorded by a
counter which is decremented for each new zero added.
154
When the counter flags zero, the shifter switches to the second mode which handles the
bit-shift within each digit as implemented by the third and fourth conditions. Assuming a four-bit
digit, two bits of the shift amount are loaded directly into the shifter control as and offset, off ,
when the counter is written. off is used to control a barrel shifter driving M . The three most
significant bits from previous digit, stored in X , are shifted in from the right, displacing the bits
from the current digit #Ad . Each cycle the result of the shift is sent through the output channel S ,
the three most significant bits of the current digit are stored in X for the next cycle, and then the
input digit is acknowledged.
When the cap token is received on A and the remaining results from the shift are successfully
covered by the repeated value of the cap token, the shifter enters the third mode implemented by
the last condition. Specifically, this does not require the counter to be zero if the cap token value
is also zero since shifting zero left will always yield zero. In order to complete the output stream
by sending a cap token, the shifter must wait until the bit-shift has resolved. If the bit-shift shifts
1-valued bits into a zero cap or 0-valued bits into a one-cap, then the protocol will be broken.
Therefore, the bit shift is checked and resolved by sending a non-cap token if necessary. Once this
is done, the cap token is sent, the previous digit X is reset, and a write command is sent to the
counter to load the next shift amount.
∗[[ cnt=0 → M:=({Ad,X} << off)[3:7] ▯ cnt≠0 → M:=0 ];
[ cnt≠0 and Ac=0 → S!(0, M); cnt:=cnt-1
▯ cnt≠0 and Ac=1 and Ad≠M → S!(0, M); cnt:=cnt-1
▯ cnt=0 and Ac=0 → S!(0, M); X:=Ad; A?
▯ cnt=0 and Ac=1 and Ad≠M → S!(0, M); X:=Ad
▯ Ac=1 and Ad=M → S!(1, M); X:=0; A?; B?(off,cnt)
]
]
Unfortunately, implementing this logic requires a relatively complex datapath. Fig. 71 shows a
high-level block diagram of the shift left operator. At the top, a dwzn counter manages cnt along
with loading the new cnt from B . The two least significant bits loaded from B are forwarded
out of the counter and used to control the barrel shifter as it shifts bits from X into M . The mux
from M to Sd sets the resulting data to all zeros when handling the first condition in the behavioral
specification, and passes the value on M otherwise.
On the other side of the barrel shifter, X is stored by a special flip-flop with two clocking
signals as seen in Fig. 72. Sc0 loads the next value from Ax when vzx is high. Sc1 resets X to
0 , implementing the last condition in the behavioral specification. Finally, the data from Ad is
latched using the input enable Ae .
155
Fig. 71: Block diagram of the shift left operator.
Fig. 72: Block diagram of the flop driving X .
Most of the control signals ( Sc0 , Sc1 , and Cw ) are ultimately driven by the QDI handshake.
Before implementing this handshake, the bundled datapath requires a stable view of the counter
status throughout the cycle. During the handshake, decrements and writes store the counter status
into a latch that drives vz and vn signifying whether or not the counter is zero. However, this
latch changes half-way through the handshake. Therefore, another latch must be used to store that
value for the other half of the cycle. This latch is allowed to switch when the output requests are
lowered, signalling the completion of the handshake.
¬vzx ∨ ¬vz ∧ ¬Sc1 → vnx↾
¬vnx ∨ ¬vn ∧ ¬Sc0 → vzx↾
vzx ∧ (vz ∨ Sc1) → vnx⇂
156
vnx ∧ (vn ∨ Sc0) → vzx⇂
The output mux and the barrel shifter are combined, and the control Ox and vzx are combined
into a one-hot encoding for the five possible conditions. ctrl0 maps Sd[0\:4] to Axd[0\:4]
implementing a shift amount of 0. ctrl1 maps Sd[0\:4] to {Axd[0\:3],X2} implementing a shift
amount of 1. ctrl2 maps Sd[0\:4] to {Axd[0\:2],X[1\:3]} implementing a shift amount of 2.
ctrl3 maps Sd[0\:4] to {Axd[0],X[0\:3]} implementing a shift amount of 3. Finally, vnx maps
Sd[0\:4] to 0 implementing the output mux. The logic driving ctrl0 is swapped relative to the
others in order to pull the signals from Ox with the lowest possible gate delay.
Ox0 ∨ Ox1 ∨ vnx → ctrl01⇂
¬Ox0 ∧ ¬Ox1 ∧ ¬vnx → ctrl01↾
ctrl01 → ctrl00⇂
¬ctrl01 → ctrl00↾
Ox0 ∧ ¬Ox1 ∧ vzx → ctrl10⇂
¬Ox0 ∨ Ox1 ∨ ¬vzx → ctrl10↾
ctrl10 → ctrl11⇂
¬ctrl10 → ctrl11↾
¬Ox0 ∧ Ox1 ∧ vzx → ctrl20⇂
Ox0 ∨ ¬Ox1 ∨ ¬vzx → ctrl20↾
ctrl20 → ctrl21⇂
¬ctrl20 → ctrl21↾
Ox0 ∧ Ox1 ∧ vzx → ctrl30⇂
¬Ox0 ∨ ¬Ox1 ∨ ¬vzx → ctrl30↾
ctrl30 → ctrl31⇂
¬ctrl30 → ctrl31↾
This shifting logic is implemented with pass transistor logic using only a single layer of pass
transistors. Four of the five conditions are controlled by the ctrl signals. The final condition is
the output mux controlled by vnx . With the inputs taken from the inverted sense of Ax , and the
output of the shifting logic protected by inverters, the total gate delay of the datapath from Ad to
Sd is two.
These five conditions happen to map well to the cases in which bits from the previous digit
are stored in X . This also allows for clock-gating of the unused flops in X depending upon the
amount of the bit-shift and the value of the counter.
@Rclk ∧ ¬ctrl30 → Xclk0↾
¬@Rclk ∧ ctrl31 ∨ ctrl30 → Xclk0⇂
157
@Rclk ∧ (¬ctrl30 ∨ ¬ctrl20) → Xclk1↾
¬@Rclk ∧ (ctrl31 ∨ ctrl21) ∨ ctrl30 ∧ ctrl20 → Xclk1⇂
@Rclk ∧ (¬ctrl30 ∨ ¬ctrl20 ∨ ¬ctrl10) → Xclk2↾
¬@Rclk ∧ (ctrl31 ∨ ctrl21 ∨ ctrl11) ∨ ctrl30 ∧ ctrl20 ∧ ctrl10 → Xclk2⇂
Finally, because the delay of the datapath is so low, the comparison of Ax to Sd can be
implemented directly on the output. All of the bits in Ax will be the same when A is a cap token,
so the comparison must only check one. Furthermore, only the first three bits of the output could
differ from the cap token value due to the shift. Therefore, the comparison can be implemented
using a single gate.
¬Axd3 ∧ ¬Sd2 ∧ ¬Sd1 ∧ ¬Sd0 ∨ Axd3 ∧ Sd2 ∧ Sd1 ∧ Sd0 → D1⇂
Axd3 ∧ (¬Sd2 ∨ ¬Sd1 ∨ ¬Sd0) ∨ ¬Axd3 ∧ (Sd2 ∨ Sd1 ∨ Sd0) → D1↾
D1 → D0⇂
¬D1 → D0↾
Because this is an Integrated QDI/BD circuit, it would be possible to pull the input request
data Ac into the bundled datapath using a latch before the delay lines. This would facilitate for
grouping the first four conditions into a single forward driver for Sc0 . Unfortunately, the fourth
condition handling sign extension does not acknowledge the input request and so skips the delay
lines. Therefore, a delay line must be placed on the reset of Sc0 for only the fourth condition to
allow the datapath and the comparison D to resolve. If all four conditions are combined into a
single forward driver, separating the fourth condition out for delay would become quite expensive.
Furthermore, this approach requires an extra latch anyways which is equivalent in cost to an extra
C-element for a third forward driver.
So instead, the first and third conditions from the CHP can be combined into a single forward
driver R0 , and the second and fourth conditions may be combined into R1 . Finally, the last
condition is directly implemented by R2 . This removes the need for the latch on Ac and allows
for a delay line on the reset phase of R1 to cover the fourth condition.
Cz ∧ Cn ∧ Se ∧ Ac0 → R0↾
Cz ∧ Cn ∧ Se ∧ Ac1 ∧ D1 → R1↾
We ∧ Se ∧ Ac1 ∧ D0 → R2↾
R2 ∨ vzx ∧ R0 ∨ Cz ∧ Cn ∧ Cw → Ae⇂
The pass transistor intermediate forward drivers micro-optimization can be used to drive Sc0
using R0 and R1 . After that, the pass transistor gated forward driver can be used to generate the
counter decrement command. More specifically, the counter may be decremented by R0 or R1 ,
158
but only when it is not empty.
@R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Sc0↾
¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Sc0⇂
Sc1 = R2
@Sc0 ∧ ¬vzx → Cd↾
¬@Sc0 ∧ vzn → Cd⇂
vzx → Cd⇂
When the counter decrement command is acknowledged by the counter, that acknowledgement
is stored into a latch for use next cycle.
¬vz ∨ ¬Cn → vn↾
¬vn ∨ ¬Cz → vz↾
vz ∧ Cn → vn⇂
vn ∧ Cz → vz⇂
During the reset phase of the forward drivers, all three have sent an output request and therefore
must wait for the output acknowledge on Se . R0 and R1 decrement the counter when it is not
zero, which means that either vz and vn have not changed when the counter acknowledged with
Cn or vz is set by an acknowledgement on Cz . Finally, R0 acknowledges the input and must wait
for the input request on Ac0 to reset. Thankfully, this condition is mutually exclusive from the
counter decrement. Meanwhile, R1 just skips in this case. Since vzx and vnx provide a stable
view of the counter, they can be used directly to check for this skip condition.
¬Se ∧ (¬vn ∧ ¬Cz ∨ ¬Cn ∨ ¬Ac0) → R0⇂
¬Se ∧ (¬vn ∧ ¬Cz ∨ ¬Cn ∨ ¬vnx) → R1⇂
¬Se ∧ ¬We ∧ ¬Ac1 → R2⇂
¬R2 ∧ (¬vzx ∨ ¬R0) ∧ (¬Cw ∨ ¬Cz ∨ ¬Cn) → Ae↾
The behavior implemented by R2 is rather special. Of note is the interaction between the
counter write and the output of the cap token with Sc1 . Specifically, Sc1 must be allowed to
reset before Cz or Cn acknowledge the write request. If not, then it introduces two undesirable
behaviors. First, it would emit a token on the output during reset before any input had been
received. Second, each shift operation would have to wait for the next one to be ready before
completing. This means that the final shift operation would never complete. Therefore, an extra
half-buffer must be inserted between the shifter and the counter during the write command with
Sc1 as the input request, Cw as the output request, and We as the input enable. The output
acknowledge is stored in vz and vn with the standard completion detection.
159
Cz ∧ Cn ∧ Sc1 → Cw↾
Cw → We⇂
¬Sc1 ∧ (¬Cz ∧ ¬vn ∨ ¬Cn ∧ ¬vz) → Cw⇂
¬Cw → We↾
Take note that R0 and R1 do not acknowledge We because the input enable Ae is held low
by Cw until the write command has been acknowledged. This is necessary because the pull-down
stack for Sc0 is already five transistors long without We .
This design ultimately requires four delay lines. Two delay lines are placed on the input
requests, handling the majority of the cases. A delay line must be placed on the reset of Cz .
This allows time for the datapath to resolve during the transition from the digit-shift mode to the
bit-shift mode. Finally, as previously discussed, a delay line is placed on the reset of R1 . This
allows the datapath to resolve due to updates on X . This means that a separate clocking signal
Rclk must be generated before the delay line on R1 and Sc0 must be generated after the delay
line on R1
@R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Rclk↾
¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Rclk⇂
6.3 Shift Right
The right shift is very similar to the left shift in many aspects of its behavior. The second, third,
and fourth conditions of the right shift operator are nearly identical to the third, fourth, and fifth
conditions of the left shift operator. However, the first condition consumes input tokens instead of
sourcing zeros on the output.
Once again, there are three primary modes for this shifter: the digit-shift, the bit-shift, and the
cap-token. The first mode is implemented by the first condition and handles the digit-shift by
deleting tokens from the front of the input stream. The second mode is implemented by the second
and third conditions and handles the bit-shift. The final mode is handled by the fourth condition,
sending the cap token and loading a new value into the counter.
Unfortunately, the right shifter has some extra considerations. First, the values on A are
digit-serial with the least significant digit first. Suppose this value is being shifted right by two
bits. Assuming that each digit is four bits, then the first digit sent on S will require two bits from
the first digit of A and two bits from the second. Therefore, if the shift is not aligned to the size
of the digit, then the right shift will always need to read one extra digit from A . This is why the
counter is incremented during the load.
Second, when developing a full ALU, it would be beneficial for these operators to share
160
devices when possible. Requiring a bi-directional barrel-shifter to implement the combined left
and right shift datapath is undesirable. Instead, the right shift in the datapath M:=({#Ad,X} >>
off)[0\:4] can be replaced by a left shift M:=({#Ad,X} << off)[3\:7] by negating the offset
off .
∗[[ cnt=0 → M:=({Ad,X} << off)[3:7] ▯ cnt≠0 → M:=Ad ];
[ cnt≠0 and Ac=0 → X:=Ad; A?; cnt:=cnt-1
▯ cnt=0 and Ac=0 → S!(0, M); X:=Ad; A?
▯ Ac=1 and Ad≠M → S!(0, M); X:=Ad
▯ Ac=1 and Ad=M → S!(1, M); X:=0; A?; B?(off,cnt)
[ off≠0 → off:=4-off, cnt:=cnt+1 ▯ off=0 → skip ]
]
]
This allows for the datapath of the right shift operator to be nearly identical to the left shift.
The primary difference between the two comes from the handling of the cap token. If the counter
is empty, then the cap token is handled similarly to the left shift. However, if the counter is not,
then there is no need to extend the digit stream in the same way that the left shift operator does.
Instead, the cap token is simply forwarded and the counter is ignored. This means that when the
counter is not zero, the shift amount needs to be zeroed so that the cap token data can be correctly
forwarded to S .
Instead of directly incrementing the counter when the value is loaded, the QDI control will
have an extra state in its internal memory. Therefore, vz is replaced by v0 , vn is replaced by
v2 , and a new state v1 is introduced to handle the increment case. This is significantly cheaper
than just incrementing the counter on the write for two reasons. First, it reduces the number
of decrement operations required from the counter. Second, implementing a write-increment
command also requires a functional write command for when the carry on the increment is zero.
This doubles the overhead of the write command which is already expensive to implement.
Much like the left shift operator, the datapath for the right shift needs a stable view of this
internal memory. Therefore, a 3-valued latch drives vx which is allowed to change once the
forward drivers have been reset. However, the internal memory in the QDI control will be handled
with a positive latch. This means that its valid state has one signal high and two low. Chaining a
positive latch on top of that for vx would introduce some complexity into the set logic. Instead,
a negative latch is used to generate the inverted value _vx in which a valid state has one signal
low and two high, simplifying the set logic. Finally, the datapath logic also needs the non-inverted
signals vx2 and vx0 .
¬_vx0 ∨ ¬_vx1 ∨ ¬v2 ∧ ¬R0 ∧ ¬R1 ∧ ¬R2 → _vx2↾
161
Fig. 73: Block diagram of the shift right operator.
¬_vx0 ∨ ¬_vx2 ∨ ¬v1 ∧ ¬R0 ∧ ¬R1 ∧ ¬R2 → _vx1↾
¬_vx1 ∨ ¬_vx2 ∨ ¬v0 ∧ ¬R2 → _vx0↾
_vx0 ∧ _vx1 ∧ (v2 ∨ R0 ∨ R1 ∨ R2) → _vx2⇂
_vx0 ∧ _vx2 ∧ (v1 ∨ R0 ∨ R1 ∨ R2) → _vx1⇂
_vx1 ∧ _vx2 ∧ (v0 ∨ R2) → _vx0⇂
¬_vx0 → vx0↾
_vx0 → vx0⇂
¬_vx2 → vx2↾
_vx2 → vx2⇂
The logic surrounding the offset O[0\:2] is a little complex. The control signal inc determines
when the counter is incremented during the write. Unfortunately, the QDI control needs access
to this signal before Cw is lowered in order to correctly set the value of its internal memory v .
Furthermore, the QDI control also needs inc to be stable throughout the rest of the operation
for subsequent control of v while decrementing. Therefore, inc must be sampled between the p
and n latches that make up the flop for Ox . Per the CHP, when the offset is not zero, the counter
should be incremented.
Op00 ∧ Op10 → inc1⇂
¬Op00 ∨ ¬Op10 → inc1↾
inc1 → inc0⇂
¬inc1 → inc0↾
Building off this idea, putting the negation unit between the two latches allows the delay of the
162
negation unit to be hidden in the half-cycle between when Cw is raised and when it is lowered.
Because only a two-bit value is being negated and the carry is not necessary, the negation unit is
very simply an XOR gate, which can be implemented with pass-transistor logic.
nOp0 = Op0;
@Op11 ∧ ¬Op00 ∨ @Op10 ∧ ¬Op01 → nOp10↾
¬@Op11 ∧ Op01 ∨ ¬@Op10 ∧ Op00 → nOp10⇂
The control signals for the right shift are generated much like the left shift. Once again, the
multiplexer can be merged into the control for the shifter. When the counter is not zero, ctrl0 is
set to keep the shift value equal to zero.
vx0 ∧ (Ox0 ∨ Ox1) → ctrl01⇂
¬vx0 ∨ ¬Ox0 ∧ ¬Ox1 → ctrl01↾
ctrl01 → ctrl00⇂
¬ctrl01 → ctrl00↾
Ox0 ∧ ¬Ox1 ∧ vx0 → ctrl10⇂
¬Ox0 ∨ Ox1 ∨ ¬vx0 → ctrl10↾
ctrl10 → ctrl11⇂
¬ctrl10 → ctrl11↾
¬Ox0 ∧ Ox1 ∧ vx0 → ctrl20⇂
Ox0 ∨ ¬Ox1 ∨ ¬vx0 → ctrl20↾
ctrl20 → ctrl21⇂
¬ctrl20 → ctrl21↾
Ox0 ∧ Ox1 ∧ vx0 → ctrl30⇂
¬Ox0 ∨ ¬Ox1 ∨ ¬vx0 → ctrl30↾
ctrl30 → ctrl31⇂
¬ctrl30 → ctrl31↾
The shifting logic for the datapath of the right-shift operator is implemented similarly to the
left shift. There is only a single layer of pass transistors with all conditions controlled by the ctrl
signals. Once again, the inputs are taken from the inverted sense of Ax , and the output of the
shifting logic is protected by a layer of inverters. Therefore, the total gate delay of the datapath
from Ad to Sd is two.
The datapath for the left-shift unit was able to clock-gate the flops driving X when the counter
was not zero. This saved a lot of energy along the way. When the counter reached zero, the
previous digit was guaranteed to be zero. Because X was reset to zero during the write, no extra
work was necessary to store the digit just prior to the counter being empty.
163
The right-shift unit is a bit more difficult. While X is no longer reset during the write, the last
input digit consumed right before the counter reaches zero must be recorded. This means that X
can only be clock-gated while the value of the counter is greater than one.
Most cases increment the counter during the write, and that increment uses the internal memory
in the QDI control v1 . Therefore, it is possible to clock-gate X for those cases. The one case that
does not increment the counter is when the shift amount is aligned to the digit size. In this case,
the offset is zero, so the flop driving X is left unused and can therefore be turned off. This means
that it is still possible to fully clock-gate X to save energy.
Unfortunately, this new condition with the counter being greater than one means that the
clock-gating for X no longer maps as nicely to the ctrl signals. Therefore, gate signals must
be generated separate of the ctrl signals to handle this clock-gating.
(Ox0 ∨ Ox1) ∧ _vx2 → gate20⇂
¬Ox0 ∧ ¬Ox1 ∨ ¬_vx2 → gate20↾
gate20 → gate21⇂
¬gate20 → gate21↾
Ox1 ∧ _vx2 → gate10⇂
¬Ox1 ∨ ¬_vx2 → gate10↾
gate10 → gate11⇂
¬gate10 → gate11↾
Ox0 ∧ Ox1 ∧ _vx2 → gate00⇂
¬Ox0 ∨ ¬Ox1 ∨ ¬_vx2 → gate00↾
gate00 → gate01⇂
¬gate00 → gate01↾
@S0 ∧ ¬gate00 → Xclk0↾
¬@S0 ∧ gate01 ∨ gate00 → Xclk0⇂
@S0 ∧ ¬gate10 → Xclk1↾
¬@S0 ∧ gate11 ∨ gate10 → Xclk1⇂
@S0 ∧ ¬gate20 → Xclk2↾
¬@S0 ∧ gate21 ∨ gate20 → Xclk2⇂
Once again, because the delay of the datapath is so low, the comparison of Ax to Sd can
be implemented directly on the output. This comparison is identical to the one in the left shift
operator.
¬Axd3 ∧ ¬Sd2 ∧ ¬Sd1 ∧ ¬Sd0 ∨ Axd3 ∧ Sd2 ∧ Sd1 ∧ Sd0 → D1⇂
164
Axd3 ∧ (¬Sd2 ∨ ¬Sd1 ∨ ¬Sd0) ∨ ¬Axd3 ∧ (Sd2 ∨ Sd1 ∨ Sd0) → D1↾
D1 → D0⇂
¬D1 → D0↾
To simplify the control a bit and reduce the length of transistor stacks, the inc control signal
will be used to gate Cz into two separate signals Cz1 and Cz0 . Doing this requires special
treatment for pass transistor logic in the QDI delay model. Therefore, acknowledging Cz should
also acknowledge both Cz1 and Cz0 .
@Cz ∧ ¬inc0 ∨ ¬inc1 → Cz1↾
¬@Cz ∧ inc1 → Cz1⇂
@Cz ∧ ¬inc1 ∨ ¬inc0 → Cz0↾
¬@Cz ∧ inc0 → Cz0⇂
With the datapath complete, it is now possible to implement a clean QDI control. Once again, it
would be possible to pull the input request data Ac into the bundled datapath using a latch before
the delay lines. However, like the left shift operator, this does not actually reduce the complexity
of the circuit overall.
This gives three forward drivers R0 , R1 and R2 that are identical to the three forward drivers
of the left shift. Furthermore, the gate driving Ae has only one minor difference. While the
left shift only acknowledges the input for R0 when the counter is empty, the right shift always
acknowledges the input for R0 .
Cz ∧ Cn ∧ Se ∧ Ac0 → R0↾
Cz ∧ Cn ∧ Se ∧ Ac1 ∧ D1 → R1↾
We ∧ Se ∧ Ac1 ∧ D0 → R2↾
R2 ∨ R0 ∨ Cn ∧ Cz ∧ Cw → Ae⇂
Once again, the pass transistor intermediate forward drivers micro-optimization is used to drive
Sc0 using R0 and R1 . However for the right shift, it must also be gated by vx0 . This ensures that
the digits deleted from the front of the input stream on A are not then forwarded on S . While the
left shift unit gated Sc0 to generate Cd , the right shift unit does not decrement the counter when
R1 is raised. This means that code Cd can be gated directly from R0 instead.
@R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → S0↾
¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → S0⇂
@S0 ∧ ¬_vx0 → Sc0↾
¬@S0 ∧ vx0 ∨ _vx0 → Sc0⇂
Sc1 = R2
165
@R0 ∧ ¬_vx2 → Cd↾
¬@R0 ∧ vx2 ∨ _vx2 → Cd⇂
The extra increment is handled by the internal memory. Now, when receiving the counter
status, Cn sets v2 , and Cz conditionally sets either v0 or v1 depending upon the increment.
Finally, the last decrement is handled directly by R0 , switching the internal memory from v1 to
v0 .
¬v1 ∧ ¬v2 ∨ ¬Cz0 ∨ ¬_vx1 ∧ ¬_R0 ∧ ¬Ae → v0↾
¬v2 ∧ ¬v0 ∨ ¬Cz1 → v1↾
¬v1 ∧ ¬v0 ∨ ¬Cn → v2↾
(v1 ∨ v2) ∧ Cz0 ∧ (_vx1 ∨ _R0 ∨ Ae) → v0⇂
(v2 ∨ v0) ∧ Cz1 → v1⇂
(v1 ∨ v0) ∧ Cn → v2⇂
The reset phase of the forward drivers is now fairly different from the left shift. For R0 , instead
of always sending a digit on S , it now always consumes the input digit from A . Furthermore, R0
also handles the last decrement from v1 to v0 which must be acknowledged in the reset phase.
For R1 , the counter never decrements. More specifically, the counter is always guaranteed to
be zero when R1 occurs. If the counter is not zero, then D is guaranteed to be false. Therefore, R1
must only acknowledge the output enable Se .
R2 and the extra half-buffer on Cw remains identical between the left and right shift.
¬Ac0 ∧ (¬v2 ∧ ¬Cz ∨ ¬Cn ∨ ¬Se ∨ ¬_vx1 ∧ ¬v1) → R0⇂
¬Se → R1⇂
¬Se ∧ ¬We ∧ ¬Ac1 → R2⇂
¬R2 ∧ ¬R0 ∧ (¬Cw ∨ ¬Cz ∨ ¬Cn) → Ae↾
However, in the reset phase of the rules driving Cw , the extra increment condition must be
correctly acknowledged.
Cz ∧ Cn ∧ R2 → Cw↾
Cw → We⇂
¬R2 ∧ (¬Cn ∧ ¬v0 ∧ ¬v1 ∨ ¬v2 ∧ (¬Cz0 ∧ ¬v1 ∨ ¬Cz1 ∧ ¬v0)) → Cw⇂
¬Cw → We↾
Once again, R0 and R1 do not acknowledge We because the input enable Ae is held low by Cw
until the write command has been acknowledged.
The right shift unit requires three delay lines. Like the left shift unit, two delay lines are placed
in the input requests, handling the majority of the cases and one delay line is placed on the reset of
166
R1 allowing updates on X to resolve. However unlike the left shift unit, the delay line on the input
requests successfully covers the transition from the digit-shift to the bit-shift modes. Therefore,
the right-shift unit does not require the fourth delay line on Cz . However, like the left shift unit,
this means that a separate clocking signal Rclk must be used for X .
@R0 ∧ ¬_R0 ∨ @R1 ∧ ¬_R1 → Rclk↾
¬@R0 ∧ _R1 ∨ ¬@R1 ∧ _R0 → Rclk⇂
Finally, Ac0 does not always need to be delayed. While the input digits are being consumed by
the shift, the data of those tokens does not matter because it is not forwarded through the datapath
and into S this means that while vx0 is false meaning the counter is not empty, Ac0 need not be
delayed. Therefore, pass transistors can be used to conditionally enable the delay line on Ac0 .
6.4 Evaluation
6.4.1 Bitwise Operators
Fig. 74 shows the bitwidth distribution for the input operands of the bitwise operators. There is
a strong distribution along the diagonal, signifying that the operands in most bitwise operations are
the same width. Furthermore, a significant amount of operations seem to be only one bit. This is
likely used to resolve compound conditions for branches. Finally, the vast majority of operations
are 32-bits wide.
Because the bitwise operators have been grafted to the sign extension unit, the behavioral
conditions and utilization computation remain unchanged from Chapter 5. This computes the
utilization data found in Table 10.
Ultimately, there are not many ways to implement bit-parallel bitwise operators. Furthermore,
all of the bitwise operators (AND, OR, XOR) have extremely similar performance metrics.
Therefore, only the Integrated Adaptive Digit-Serial XOR operator and the clocked and QDI
64-bit bit-parallel versions as seen in Table 11 will be compared. While the clocked operator can
run at 10 GHz in a vacuum, it will need to run slower in the context of a larger architecture. In
most architectures, the maximum frequency is around 4 GHz. This analysis will be conservative
and compare the digit-serial operator against the bit-parallel operator at 10 GHz.
In Table 12, there are four conditions. During the "ab" condition, neither input operand is a cap
token, so both are consumed to produce the output. During the "a" and "b" conditions, only one of
the two input operands have cap tokens, so only one input is consumed. Finally, during the "cap"
condition, both input operands are cap tokens. Both are consumed to complete the operation. As
stated earlier, the performance of the three bitwise operators are nearly identical, operating around
2.6 GHz at around 30 fJ per token.
167
Fig. 74: Probability distribution for the bitwidth of the left operand A and
right operand B for AND, OR, and XOR.
Condition Average Cycles/Stream (4-bits)
ab 2.439
a 0.868
b 0.627
cap 1.000
Total Cycles 4.933
Table 10. Utilization of each condition for the
AND, OR, and XOR operators.
Type Transistors Frequency Energy/Op Latency
Clocked Parallel XOR (64-bit) 3712 10.00 GHz 0.508 pJ
QDI Parallel XOR (PCHB 64-bit) 4096 3.93 GHz 1.780 pJ
Table 11. Performance measurements for the bit-parallel bitwise operators.
In Fig. 75, the distribution in Fig. 74 is applied to the raw measurements in Table 12 and
compared against Table 11. Overall, as the max width of the operator is larger, the adaptive
serial operator does less and less work. At 64 bits, the adaptive serial operator is competitive in
168
Type Transistors Condition Token Frequency Energy/Token Latency
ab 2.62 GHz 41.98 fJ 94 ps
Integrated Serial Adaptive a 2.64 GHz 28.92 fJ 92 ps
221
XOR b 2.66 GHz 29.50 fJ 91 ps
cap 2.68 GHz 30.42 fJ 88 ps
Table 12. Raw performance measurements for the digit-serial bitwise operators.
throughput per transistor with the clocked bit-parallel architectures, but uses less than half the
energy for the same computation.
Unfortunately, this approach introduces some redundant tokens into the encoding of the result.
If two numbers are ANDed and one is shorter than the other, then the cap token is zero and the
number will be sign extended. The bitwidth of the output would match the bitwidth of the larger
of the two inputs, but should really only match the shorter. Fig. 76 shows the distribution of the
number of redundant bits introduced into the output encoding. Overall, about 50% of operations
do not introduce any redundant bits. However, a non-negligible number of operations introduce a
significant number of redundant bits. Tackling this will be important for future work.
6.4.2 Shift Operators
Overall, Fig. 77 shows that the adaptive digit-serial shift operators developed here use half
the energy for the same operations while remaining competitive with the circuits synthesized by
Synopsys.
The integrated QDI/BD shift operators are compared against three implementations for
synchronous bit-parallel shifters: the standard circuit produced by Synopsys Design Compiler
with a base-2 shift amount, a full custom pass transistor shifter with a base-2 shift amount, and a
full custom pass transistor shifter with a base-4 shift amount. These circuits were each evaluated
for a range of bitwidths. The performance of the full 64-bit shifters of these implementations are
shown in Table 13.
Table 14 shows the raw per-token measurements for each condition of the length-adaptive
digit-serial shifters. In the "digit-shift" condition, the counter is not empty and the shifter is either
producing zeros on the output in the case of the left shift or consuming the input digits in the case
of the right shift. In the "bit-shift" condition, the counter is empty and the shifter is forwarding
shifted input digits to the output. In the "extend" condition, the shifter has bits remaining in its
token storage that are shifted into the data from the cap token. This requires the stream to be
sign-extended. Finally, in the "cap" condition, the digit-stream is completed with a cap token and
the next value is loaded into the counter in preparation for the next operation.
169
Fig. 75: Throughput/transistor (left) and energy/op (right) metrics scaled by maximum bitwidth.
Fig. 76: Probability distribution for the number of redundant bits introduced per operation by this implementation of
the bitwise operators.
Fig. 77: Performance and energy averaged over the distribution in
Fig. 78 and Fig. 80 vs Transistor Count.
Type Transistors Frequency Energy/Op Latency
Clocked Parallel Synopsys Left (64-bit) 6918 1.89 GHz 1.603 pJ
Clocked Parallel 2x Left (64-bit) 3791 2.36 GHz 1.103 pJ
Clocked Parallel 4x Left (64-bit) 3192 3.34 GHz 0.911 pJ
Clocked Parallel Synopsys Right (64-bit) 5988 0.89 GHz 1.070 pJ
Clocked Parallel 2x Right (64-bit) 3808 2.36 GHz 1.128 pJ
Clocked Parallel 4x Right (64-bit) 3185 3.34 GHz 0.991 pJ
Table 13. Performance measurements for the bit-parallel shift operators.
170
Type Transistors Condition Token Frequency Energy/Token Latency
digit-shift 2.67 GHz 52.28 fJ 59 ps
bit-shift 2.34 GHz 61.00 fJ 152 ps
Integrated Serial Adaptive Left 450+441
extend 2.33 GHz 61.29 fJ 163 ps
cap 2.0 - 2.20 GHz 127.37 fJ 150 ps
digit-shift 2.40 GHz 82.70 fJ -
bit-shift 1.72 GHz 73.70 fJ 206 ps
Integrated Serial Adaptive Right 479+441
extend 2.35 GHz 56.18 fJ 149.5 ps
cap 2.0 - 2.15 GHz 151.45 fJ 149 ps
Table 14. Raw performance measurements for the shift operations.
The frequency of the cap condition is determined primarily by the counter. The maximum
frequency of the write command for the counter is ultimately 2.0 GHz. However, the counter can
handle other commands in parallel while the write command is working. This means that as long
as write commands are not called consecutively, it will effectively operate at around 3.0 GHz. At
this point, the shifter logic becomes the limiting factor, setting a frequency around 2.20 GHz. The
transistor counts for the digit-serial shifters show the transistor counts for the shifter unit plus the
count for a 4-bit dwzn counter.
6.4.3 Shift left
For the left shift, Fig. 78 shows the probability distribution for the bitwidth of the shifted input
A and the shift amount B as measured from the SPEC2006 benchmark. The center plot shows
the joint probability distribution while each histogram shows the associated individual probability
and cumulative distributions for that axis. The bitwidth of the input operand A averages around
8.97 bits while the shift amount B averages around 8.65.
The utilization is computed directly from this distribution as follows. max_bitwidth(A)
represents the maximum bitwidth of the shifted input A and max(B) represents the maximum
value on B which is loaded into the counters. packet represents the size of each digit in the
stream, which is assumed to be 4. P(bitwidth(A) == w and B == s) samples the distribution
in Fig. 78 at w,s . u stores the computed utilization of each condition.
171
Fig. 78: Probability distribution for the bitwidth of the shifted value A and
the shift amount B for the left shift operator.
u = {'digit-shift': 0, 'bit-shift': 0, 'extend': 0, 'cap': 0}
for w in range(1, max_bitwidth(A)+1):
for s in range(0, max(B)):
p = P(bitwidth(A) == w and B == s)
if i == 0:
u['digit-shift'] += 0.5 * int(s/packet) * p
else:
u['digit-shift'] += int(s/packet) * p
u['bit-shift'] += int((w+packet-2)/packet) * p
if s%packet > (max_bitwidth-w+1)%packet:
u['extend'] += p
u['cap'] += p
Ultimately, most of the shift amount is used to append digits to the front of the digit stream.
This executes the digit-shift condition int(s/packet) times in most cases. However, if the
shifted input value A is 0, then it would be redundant to append these digits to the front of the
172
stream. In this case, the digit-shift condition is skipped. If the input bitwidth is 1, then the shifted
value is 0 or -1. It is assumed that 0 happens about 50% of the time.
Once the digit-shift condition has run its course, the bit-shift condition takes over. This
forwards each token received on A , shifted by the remaining shift amount. Ultimately, the number
of non-cap digits in A is equal to int((w+packet-2)/packet) .
Then, if bits from a non-cap token would be shifted into the cap token, the extend condition
executes. This happens when the remaining shift amount s%packet is greater than the bit-spaces
left over in the last non-cap token (max_bitwidth-w)%packet . Finally, every digit-stream
executes the cap condition exactly once.
This computes the utilization of each condition as shown in Table 15. On average, each shift
runs about 5.433 cycles in total, generating that many output digits as well. To compute these
numbers for a given max-bitwidth, the distribution is assumed to be effectively truncated to that
bitwidth. Then, the raw performance numbers in Table 14 are multiplied with the utilization in
Table 15 to produce the average performance of the left shift operator as shown in Fig. 79.
Fig. 79 shows the average performance of the length adaptive digit-serial left shift operation
against the other bit-parallel shifter implementations in Table 13 for a given maximum input
bitwidth. Ultimately, the digit-serial shifter operates 1.83 times faster per transistor than the
Synopsys shifter but 52% slower per transistor than the best bit-parallel full custom design.
However, the digit-serial shifter uses 77% and 58% less energy respectively to execute the same
operations.
6.4.4 Shift Right
For the right shift, Fig. 80 shows the probability distribution for the input bitwidth and shift
amount. The bitwidth averages around 16.2 bits while the shift amount averages around 14.3 bits.
Once again, the utilization is computed from this distribution. wp represents the number of
digits in the input digit stream. sp represents the maximum number of digits that could be
consumed by the shift. p0 represents the actual number of digits consumed.
173
Condition Average Cycles/Stream
digit-shift 1.839
bit-shift 2.401
extend 0.193
cap 1.000
Total Cycles 5.433
Table 15. Utilization of each condition
for left shift.
Fig. 79: Throughput/transistor (left) and energy/op (right) metrics scaled by maximum bitwidth.
Fig. 80: Probability distribution for the bitwidth of the shifted value A and
the shift amount B for the left shift operator.
174
u = {'digit-shift': 0, 'bit-shift': 0, 'extend': 0, 'cap': 0}
for w in range(1, max_bitwidth(A)+1):
for s in range(0, max(B)):
p = P(bitwidth(A) == w and B == s)
wp = int((w+packet-2)/packet)
sp = int((s+packet-1)/packet)
p0 = min(wp, sp)
u['digit-shift'] += p0 * p
u['bit-shift'] += (wp - p0) * p
if s < w and (s+packet-1)%packet < (w+packet-2)%packet:
u['extend'] += p
u['cap'] += p
First, the right shift operation executes the digit-shift condition. This consumes one more
token than necessary so that the data from the next token is available to be shifted in. Either the
whole shift can be executed or the shift amount is greater than the bitwidth of the input, hence
min(wp, sp) . Then, the remaining digits wp-p0 are forwarded during the bit-shift condition. If
the extra token consumed during the digit-shift condition was necessary, then the extend condition
generates that token. Finally, the cap token finishes the digit-stream.
This computes the utilization of each condition as shown in Table 16. On average each shift
runs about 5.411 cycles in total, generating 2.897 digits on the output. Once again, the distribution
is assumed to be effectively truncated to the max bitwidth. Then, the raw performance numbers in
Table 14 are multiplied with the utilization in Table 16 to produce the average performance of the
right shift operator as shown in Fig. 81.
Fig. 81 shows the average performance of the length adaptive digit-serial right shift operation
against the other bit-parallel shifter implementations in Table 13 for a given maximum bitwidth.
The digit-serial shifter operates about 2.84 times faster per transistor than the Synopsys
implementation but 60% slower than the best bit-parallel full custom design. However, it once
again uses 54% and 50% less energy respectfully for the same operations.
175
Condition Average Cycles/Stream
digit-shift 2.514
bit-shift 1.632
extend 0.265
cap 1.000
Total Cycles 5.411
Table 16. Utilization of each condition
for right shift.
Fig. 81: Throughput and energy metrics scaled by maximum bitwidth.
176
CHAPTER 7
ADDITION AND SUBTRACTION
Addition and subtraction are ultimately the core operations executed in general-compute
applications, accounting for 43% of all integer arithmetic operations as shown in Fig. 31.
Therefore, these operations are also some of the most explored operations in a CPU.
There has long been significant research toward a varied array of bit-parallel arithmetic
circuitry [137]. The Ripple-Carry Adder is simple and energy efficient but ultimately slow,
producing a result in a worst case linear time. The Manchester Carry Chain improves upon this
structure using pass transistor logic along the carry chain [104]. Sacrificing area and energy
for latency and throughput [139][140], there is a large class of carry-lookahead adders that
produce a result in worst case logarithmic time [99][100][101][102][103]. Finally, there are hybrid
adders that mix multiple strategies: tying four bit Machester Carry Chains together using Carry
Lookahead techniques [105][106].
7.1 Addition
The fundamental algorithm for LSD first serial addition is fairly simple. Both input streams
are assumed to be aligned such that the first token in each stream represents the same digit-place.
Digits arrive on the input channels A and B in the same order that the carry chain is propagated.
So, they are added with the carry from the previous iteration, ci , to produce the sum on the output
channel S and a new carry for the next iteration, co . The CHP below describes the algorithm.
ci:=0;
∗[s  := (Ad + Bd + ci) % pow(2, N);
co := (Ad + Bd + ci) / pow(2, N);
S!s;
A?,B?;
ci:=co;
]
However, a real implementation must support finite length streams. So, the not-cap/cap control
bit added to each token. To operate on two streams of differing lengths, the shorter stream is
sign extended by skipping the acknowledgement of its cap token, repeating it until the cap token
of the longer stream. Then, both are acknowledged, continuing to the next operation. Because
streams can extend to an arbitrary length, they can represent arbitrarily large numbers with a fixed
precision. This builds upon the sign-extend logic presented in Chapter 5.
For addition, finite-length streams also introduce overflow conditions. When both inputs are
177
cap tokens, then two's complement dictates that their values repeat. So the output values must also
repeat. However, if co ≠ ci , then the next sum token will be different from the current one.
Extending the input streams by one more token on an overflow condition guarantees that the co
= ci on the next iteration and that consecutive sum bits will all be the same. Then, ci is reset
and the output stream is completed by forwarding a cap token.
∗[s  := (Ad + Bd + ci) % pow(2, N);
co := (Ad + Bd + ci) / pow(2, N);
[ !Ac ∨ !Bc → S!(s,0); ci:=co;
[ !Ac → A? ▯ else → skip ],
[ !Bc → B? ▯ else → skip ]
▯ Ac ∧ Bc ∧ co≠ci → S!(s,0); ci:=co
▯ Ac ∧ Bc ∧ co=ci → S!(s,1); ci:=0; A?,B?
]
]
This circuit implementation builds from the integrated QDI/BD sign extension unit described
in Chapter 5. Once again, this has four cases defined by the intersection of Ac and Bc that need
to be implemented. In the first case, neither inputs are cap tokens. In the second only A and in the
third only B has a cap token. Finally both inputs have cap tokens in the fourth case. Unfortunately,
none of these cases line up in the forward driver or acknowledgement logic. In the forward drivers
conditions 0, 1, and 2 drive Sc0 and condition 3 drives Sc1 . In the acknowledgement logic, A is
only acknowledged for conditions 0, 2, and 3, and B for conditions 0, 1, and 3.
To avoid this, SR latches store a static version of the input requests. These are placed before
the delay lines on the input request to give them time to stabilize before the QDI circuitry starts to
operate as in the sign extension unit.
Ax0 ∨ Ac0 → Ax1⇂
Ax1 ∨ Ac1 → Ax0⇂
¬Ax0 ∧ ¬Ac0 → Ax1↾
¬Ax1 ∧ ¬Ac1 → Ax0↾
Bx0 ∨ Bc0 → Bx1⇂
Bx1 ∨ Bc1 → Bx0⇂
¬Bx0 ∧ ¬Bc0 → Bx1↾
¬Bx1 ∧ ¬Bc1 → Bx0↾
Then, the input requests are combined before the delay lines. This reduces the number of delay
lines by two and has zero overhead with respect to the rest of the control. After delaying AB , the
rest of the control can be implemented as necessary using AB as its input.
178
Fig. 82: The architecture of the Adaptive
Adder.
(Ac0 ∧ (Bc0 ∨ Bc1) ∨ Ac1 ∧ Bc0) → AB0↾
Ac1 ∧ Bc1 → AB1↾
¬Ac0 ∧ ¬Bc0 → AB0⇂
¬Ac1 ∧ ¬Bc1 → AB1⇂
The comparison logic for Ci and Co comes next. To reduce the overall gate area, a pass
transistor XOR is used to determine whether Ci and Co are different. Because this XOR will
be used in the QDI handshake, the output of this XOR must remain high as Ci is transitioning
between values through its neutral state, (1,1) . This means that the usual pass transistor XOR is
not sufficient. However, Co remains stable through the QDI handshake and both Ci and Co are
one hot encodings.
@Cid1 ∧ ¬Cod1 ∨ @Cid0 ∧ ¬Cod0 → Dd1↾
¬@Cid0 ∧ Cod1 ∨ ¬@Cid1 ∧ Cod0 → Dd1⇂
Dd1 → Dd0⇂
¬Dd1 → Dd0↾
With the above setup, the main cycle can now be implemented starting with the forward
drivers. Luckily, it can be drastically simplified by a few key observations. First regarding the
acknowledgement signals Ae and Be , if AB is not a cap, then a non-cap token is output on S, A
is acknowledged if it is not a cap, and B is acknowledged if it is not a cap. However, if AB is a
cap token, then there are two conditions. The overflow condition when Co ≠ Ci also outputs a
non-cap token on the output. It acknowledges neither A nor B , and luckily both A and B are cap
tokens. So the acknowledgement is automatically implemented by the same logic that handles the
case in which AB is not a cap. The final case in which Co = Ci outputs a cap token on S and so
179
must be handled as a separate set of logic anyways.
Second, on an overflow condition, both A and B are cap tokens but Co ≠ Ci . This means that
the inputs are not acknowledged, the next operation does not pass through the delay lines, and the
bundled-data timing assumption breaks.
However, because cap tokens must be all ones or all zeros, if Co is not equal to Ci , then the
data on A and B must be equal. If they were not, then the resulting addition would be all ones and
the value on Ci would be faithfully propagated to Co making them equal.
If A and B are all zeros, then Co is guaranteed to be 0 meaning Ci must be 1 . In this
case, only the least significant bit of the datapath changes. If A and B are all ones, then Co is
guaranteed to be 1 and Ci must be 0 . In this case no bits are changed in the datapath. This means
that the max delay required by the datapath in this case is constant at one bit in the carry chain,
which is far less than the natural cycle time of the control process. This makes the forward driver
and acknowledgement logic extremely simple.
Se ∧ (ABd0 ∨ ABd1 ∧ Dd1)  → Sd0↾
Se ∧ ABd1 ∧ Dd0 → Sd1↾
Sd0 ∧ Ax0 ∨ Sd1 → Ae⇂
Sd0 ∧ Bx0 ∨ Sd1 → Be⇂
Third, if Co ≠ Ci , then setting Ci = Co will not cause any transition on Co . The only time
Co is dependent upon the value of Ci is when all of the bits in the adder propagate the carry.
However, in that case Co is guaranteed to be equal to Ci . This means that the value of Ci can
be both an input to the datapath and set by an output from the datapath without any extra control
circuitry.
¬Cid0 ∨ ¬Se ∧ ¬_Sd0 ∧ ¬Cod0 → Cid1↾
¬Cid1 ∨ ¬Se ∧ (¬_Sd1 ∨ ¬_Sd0 ∧ ¬Cod1) → Cid0↾
Cid0 ∧ (Se ∨ _Sd0 ∨ Cod0)  → Cid1⇂
Cid1 ∧ (Se ∨ _Sd1 ∧ (_Sd0 ∨ Cod1)) → Cid0⇂
Fourth, on the reset phase the next Ci must have the correct value before resetting the forward
drivers and the acknowledgement. Luckily, the overflow case does not acknowledge ABd1 , so
resetting Sd0 only has to make sure ABd0 is acknowledged and Ci = Co as evaluated by D .
¬Se ∧ ¬ABd0 ∧ ¬Dd1 → Sd0⇂
¬Se ∧ ¬ABd1 ∧ ¬Cid1 → Sd1⇂
(¬Sd0 ∨ ¬Ax0) ∧ ¬Sd1 → Ae↾
(¬Sd0 ∨ ¬Bx0) ∧ ¬Sd1 → Be↾
180
Fig. 83: Transistor diagram of LSB adder control circuitry.
For the datapath shown in Fig. 82, the input data for A and B are latched using Ae and Be
respectively. The latched data and the Ci are fed into a Manchester Carry Chain which drives the
output data, Sd , and the carry-out, Co .
7.3 Subtraction
Adaptive digit-serial subtraction can be implemented by simply inverting the second of the two
inputs just after the latches as in Fig. 84 and resetting the carry-in to one instead of zero. In a
CGRA, this should be configured during initialization synchronously.
(¬Cid0 ∨ ¬Se ∧ (¬_Sd1 ∧ ¬_cfg ∨ ¬Cod0 ∧ ¬_Sd0)) → Cid1↾
(¬Cid1 ∨ ¬Se ∧ (¬_Sd1 ∧ ¬cfg ∨ ¬Cod1 ∧ ¬_Sd0)) → Cid0↾
Cid0 ∧ (Se ∨ (_Sd1 ∨ _cfg) ∧ (_Sd0 ∨ Cod0)) → Cid1⇂
Cid1 ∧ (Se ∨ (_Sd1 ∨ cfg) ∧ (_Sd0 ∨ Cod1)) → Cid0⇂
¬Se ∧ ¬ABd1 ∧ (¬Cid1 ∧ ¬cfg ∨ ¬Cid0 ∧ ¬_cfg) → Sd1⇂
7.4 Evaluation
Aside from the Integrated Adaptive adder found in this paper, four other serial adders were
181
Fig. 84: The architecture of the Adaptive
Adder/Subtractor.
developed for comparison including a clocked non-adaptive digit-serial adder, a clocked adaptive
digit-serial adder synthesized by Synopsys Design Compiler, a BD adaptive serial adder, and
a QDI adaptive serial adder. Furthermore, a set of parallel adders were built for comparison
including clocked Kogge & Stone[101], Han & Carlson[103], and Brent & Kung[102] carry
lookahead adders, a clocked Manchester-Carry[104], a clocked Ripple-Carry, and a QDI ripple
carry adder[68]. Table 17 shows the measured performance for the parallel adders.
Table 18 shows the raw per-token measurements for each condition of the digit-serial adders.
In the "ab" condition, both inputs are non-cap tokens and are therefore both acknowledged.
In the "a" condition, A is a non-cap token and B is a cap token. This means that only A is
acknowledged. Similarly, in the "b" condition, only A is a cap token. This means that only B
is acknowledged. Finally, in the "cap" condition, both inputs are cap tokens and are therefore
acknowledged.
The utilization of these behavioral conditions are directly determined by the joint bitwidth
distribution of the two inputs as shown in Fig. 85 as measured from the SPEC2006 benchmark.
The center plot shows the joint probability distribution while each histogram shows the associated
individual probability and cumulative distributions for that axis. However, this plot includes
some operations that should not be handled by a digit-serial architecture. Specifically, there
are significant spikes around 47 and 48 bits as discussed in Chapter 2 Section 4. These spikes
represent memory address computations with a 48-bit wide memory bus. These operations have
predictable bitwidth and should be handled by their own bit-parallel datapath. Therefore ignoring
those operations, the bitwidth of the input operand A averages around 11.01 bits while B averages
around 7.63 bits.
The utilization is computed directly from this distribution as follows. max_bitwidth(A)
182
Type Transistors Frequency Energy/Op
Clocked Parallel Kogge-Stone (64-bit) 7846 4.72 GHz 0.955 pJ
Clocked Parallel Han-Carlson (64-bit) 6552 3.88 GHz 0.846 pJ
Clocked Parallel Brent-Kung (64-bit) 5832 1.87 GHz 0.799 pJ
Clocked Parallel Ripple (64-bit) 4830 0.39 GHz 0.736 pJ
Clocked Parallel Manchester (64-bit) 4958 0.94 GHz 0.865 pJ
QDI Parallel Ripple (64-bit) 8196 2.88 GHz 2.572 pJ
Table 17. Performance measurements for the bit-parallel addition operators.
Type Transistors Condition Token Frequency Energy/Token Latency
ab 2.16 GHz 71.70 fJ 172 ps
Integrated Serial Adaptive a 2.21 GHz 47.99 fJ 171 ps
302
(4-bit) b 2.22 GHz 47.32 fJ 170 ps
cap 2.20 GHz 52.53 fJ 167 ps
ab 2.09 GHz 117.46 fJ 140 ps
QDI Serial Adaptive a 2.11 GHz 106.49 fJ 129 ps
423
(1-bit) b 2.11 GHz 106.42 fJ 127 ps
cap 2.10 GHz 109.23 fJ 120 ps
ab 2.07 GHz 110.87 fJ 222 ps
BD Serial Adaptive a 1.99 GHz 65.82 fJ 227 ps
344
(4-bit) b 2.08 GHz 65.28 fJ 218 ps
cap 2.13 GHz 87.01 fJ 241 ps
ab 1 GHz 247.56 fJ 305 ps
BD Serial Adaptive a 1 GHz 221.35 fJ 305 ps
616
Synopsys (4-bit) b 1 GHz 216.87 fJ 305 ps
cap 1 GHz 242.60 fJ 305 ps
Clocked Serial (4-bit) 288 3.65 GHz 66.26 fJ
Table 18. Raw performance measurements for the digit-serial addition operators.
and max_bitwidth(B) represent the maximum bitwidth of the input operands A and B . The
probability at each coordinate in Fig. 85 is sampled with P(bitwidth(A) == a and
bitwidth(B) == b) , ignoring cases in which A or B are 47 or 48 bits wide. Then, the average
number of cycles per stream is computed for each condition. "ab" is equal to the number of
non-cap tokens before sign-extension is required. Then, if one stream is longer than another, this
183
Fig. 85: Joint probability distribution for the two input bitwidths.
is counted in the "a" and "b" conditions. Finally, every digit stream has exactly one cap token.
u = {'ab': 0, 'a': 0, 'b': 0, 'cap': 0}
for a in range(1, max_bitwidth(A)+1):
for b in range(1, max_bitwidth(B)+1):
# ignore memory address operations
if a in [47,48] or b in [47,48]:
tmpa = 44 if a in [47,48] else a
tmpb = 44 if b in [47,48] else b
p = P(bitwidth(A) == tmpa and bitwidth(B) == tmpb)
else:
p = P(bitwidth(A) == a and bitwidth(B) == b)
atok = int((a+packet-2)/packet)
btok = int((b+packet-2)/packet)
u['ab'] += min(atok,btok)*p
u['a'] += max(0, atok-btok)*p
u['b'] += max(0, btok-atok)*p
u['cap'] += p
184
Overall, the computed results in Table 19 show that the average digit stream is relatively short
with around 4.4 tokens with the majority time spent in the "ab" condition. Furthermore, A tends to
have longer digit streams, which could be reasonably explained by compiler behavior and human
preference. The main result being operated on tends to be scheduled to the first input while the
modifier tends to be scheduled to the second. These utilization values where computed for both 1
and 4 bit digit sizes.
Now, the average performance of the circuits can be computed using the raw performance data
and the utilization data. Fig. 86 shows the average addition throughput per transistor versus the
energy per add of each adder. For a 4-bit datapath, the integrated adaptive serial adder requires
302 transistors. This is competitive with the 288 transistors necessary for the clocked non-adaptive
serial adder because the integrated adaptive adder is only half-buffered, using latches instead of
flip-flops. While the integrated adaptive design operates at 60% of the frequency using 8% more
energy per token, this overhead allows the adaptive adder to skip a majority of the tokens whereas
the non-adaptive design cannot. This translates to a 2.2x increase in throughput from 228 MHz to
an average of 494 MHz, and a 75% decrease in energy from 1060 fJ to an average of 263 fJ.
The most competitive 64 bit parallel adder, the Han-Carlson, has 6552 transistors. Its operation
throughput is 7.85 times the integrated adaptive adder's at 3.876 GHz. However, given the same
transistor count to multiple instances of the integrated adaptive adder, they would have an average
of 11 GHz, using 69% less energy per operation.
The synchronous adaptive adder synthesized by Synopsys uses twice as many transistors at 616
and has 54% lower operation throughput at 226 MHz. Furthermore, it uses 4 times the energy per
operation. This difference is likely because the design is synthesized using a standard cell library
while the rest are full custom.
Adaptivity requires stateful control-flow either in the form of a val-rdy interface or some
asynchronous channel protocol. The devices best geared to implement stateful control flow are
asymmetric c-elements. These do not really exist in any standard-cell libraries because of the gross
number of possible cells. For this reason, good self-timed circuits often custom-layout these cells
for each design. Synthesizers do not have this option though. Instead they cobble together stateful
control from latches, flops, and combinational logic which is ultimately not a good fit.
In the datapath, the integrated adaptive adder uses latches on the data while the synthesizer
used flops, dramatically increasing the transistor count. Furthermore, the integrated adaptive adder
used a 4-bit Manchester Carry Chain while the synthesized implementation uses full-adder cells
from the standard-cell library, ultimately implementing a normal Ripple Carry Adder. This means
that while the integrated adaptive adder can operate at 2.16 GHz, the synthesized adder is limited
to 1 GHz.
185
Condition Average Cycles/Stream
1-bit 4-bit
ab 6.377 1.951
a 4.631 1.152
b 1.250 0.329
cap 1.000 1.000
Total Cycles 13.258 4.432
Table 19. Utilization of each condition
for the addition operator.
Fig. 86: Performance and energy averaged over the distribution in
Fig. 85 vs Transistor Count.
All of these differences are fairly typical in synthesized vs full-custom and the synthesis could
be tuned to produce a better result. In recognition of this a full custom latched synchronous
adaptive adder is also compared. In the end, it required a val-rdy interface because the input
streams are variable-length. The circuitry required to implement a val-rdy interface is ultimately
near-identical to the circuitry required to implement a bundled-data interface. The only difference
is that for the val-rdy interface, the control signals are clocked instead of delayed with a delay line.
So, this design ended up being the BD Adaptive Add. Ultimately, the architecture is very similar
to the integrated design. At 344 transistors, it burns only 1.5 times the energy per operation with
only 6% lower operation throughput.
The only adaptive self-timed adder in the literature is from Bitsnap [232]. This adder was not
directly compared because because the implementation of its adaptivity was not self contained.
The design of the adder is ultimately a single bit from the QDI bit-parallel ripple-carry adder with
its carry-out fed back into the carry in through a FIFO. The implementation of the control relied
heavily upon the Bitsnap Microprocessor architecture as a whole and was entirely inseparable.
The QDI adaptive serial adder compared here is the self-contained version of this. It is ultimately
more expensive than other approaches due to acknowledgement requirements between the control
and the datapath, implementing a 1 bit datapath with 423 transistors, 68% lower operation
186
Fig. 87: Each point corresponds to the simulated energy per add averaged for multiple adds over the distribution
in Fig. 85 for a given maximum bitwidth.
Fig. 88: Probability distribution for the number of redundant bits introduced per operation by the adder.
throughput, and burning 5.6 times as much energy per operation.
Fig. 87 shows the performance of these adders on average for a given maximum bitwidth
using the bitwidth distribution from SPEC2006. At 32 bits, the Integrated Adaptive adder has
the same average throughput efficiency as the Kogge & Stone adder. Once above 32 bits, the
average throughput efficiency of the Integrated Adaptive adder is significantly better than any
other architecture. For widths of more than 12, the Integrated Adaptive adder uses significantly
less energy on average.
Overall, the digit-serial addition operator can introduce some redundant tokens into the
encoding of the result. Specifically if the result of adding a positive and negative number is
smaller in magnitude than the widest input, then some number of bits near the cap token of the
result will be redundant. Fig. 88 shows the probability distribution of these redundant bits over all
addition and subtraction operations.
187
CHAPTER 8
COMPARISON AND CONDITIONALS
Control operations account for 22% of all instructions executed in the SPEC2006 benchmark.
Efficient execution of these operations is one of the largest determining factors for a platform's
performance. Modern systems employ complex branch predictors and large caches to mitigate the
effect of control operations on performance. The goal of this chapter is not to design something
that takes its place, but to reduce the load on these systems to help increase their accuracy for the
control instructions that matter.
Overall, 68% of the control logic comes from branches. A branch operates directly on the
program counter. However, in a dataflow platform like a CGRA, there is no program counter.
Instead, a single branch instruction must be broken up into a conditional move for every piece of
data used or modified in the body of the conditional. Unfortunately, time constraints have impeded
further measurements regarding the distribution of the number of conditional moves required to
implement each branch. Overall, it is safe to say that at least some branches would require only
one or two conditional moves, and supporting this kind of operation in the CGRA accelerator
would allow for larger basic blocks, reducing the load on the branch predictor circuitry.
Fig. 31 shows that comparison operators account for 39% of all arithmetic operations, though
many of these may not be implemented on the CGRA depending upon the branch/conditional
move distribution. Therefore, the implementation of the compare and conditional move operations
are not guided by input data distributions. Luckily, these operations are relatively simple and
efficient.
8.1 Compare to Zero
The comparison algorithm is fairly straightforward. Ultimately, this circuit must be able to
resolve any one of six possible comparison operations: < , > , = , <= , >= , or ≠ . Unfortunately, it
is not possible to determine whether the input is greater than zero or less than zero until the sign
has been received from the cap token at the very end of the stream. Furthermore, it is not possible
to determine that the stream is equal to zero until all of the tokens have been received. Therefore,
the naive approach for encoding those six cases into output requests would be to assign an output
request for each of < , = , and > . The other three cases can then be computed with an OR. For
example, ≠ is < OR > . The comparison unit would keep track of whether all of the received
tokens were zero, and then resolve the relation upon receiving the cap token.
188
v := 0;
∗[ L?l;
[ lc = 0 → [ ld ≠ 0 → v := 1 ▯ else → skip ]
▯ lc = 1 → [ ld ≠ 0 → R!"<"
▯ v = 0 ∧ ld = 0 → R!"="
▯ v = 1 ∧ ld = 0 → R!">"
]
]]
However, it is ultimately possible to determine that the input is not equal to zero before
receiving the cap token. In fact, the ≠ condition can generally be resolved by the very first
token of the digit-stream. This early out would affect 73% of all arithmetic comparison operation
executions. Unfortunately, tackling this early-out opportunity yields quite an unusual architecture.
The naive assignment for the output requests is insufficient since it is not possible to resolve which
of < or > that ≠ condition belongs to until the very end of the digit-stream.
Therefore, the output requests must be inverted. This yields output requests for >= , ≠ , and
<= respectively. Unfortunately, while the naive approach ensures mutual-exclusivity for the
output requests, this approach does not. For example, < would require both <= and ≠ to be
asserted simultaneously. This would force the output channel to use a 2of3 encoding on the output
requests. Therefore, taking advantage of the early out for ≠ at the receiving side will be difficult
because it would have to wait for a second signal before acknowledging the condition and then
wait for both signals to be reset before enabling the channel again.
Fortunately, it is possible to implement a non-standard channel protocol to allow for early
acknowledgement and early enable. However, this means that the behavior of this circuit is no
longer reasonably expressible using CHP.
(Re ∨ R1 ∨ R2) ∧ _Rv ∧ Lc1 ∧ ¬Ld0 → R0↾
(Re ∨ R0 ∨ R2) ∧ _Rv ∧ (Lc0 ∨ Lc1) ∧ z0 → R1↾
(Re ∨ R0 ∨ R1) ∧ _Rv ∧ Lc1 ∧ (_R1 ∨ Ld0) → R2↾
Starting with the forward drivers: R0 implements >= , R1 implements ≠ , and R2 implements
<= . R0 and R2 can only be determined using the cap token as signalled by Lc1 . Because all of
the bits in the cap token have the same value, it is enough to simply check one. If the cap token
is zero, signalled by ¬Ld0 , then the sign is positive and L is greater than or equal to zero. This is
enough to signal R0 . If the cap token is not zero, signalled by Ld0 , then the sign is negative and
L must be less than zero. Unfortunately, this is not enough to signal R2 because it does not cover
whether L is equal to zero. For that, the inverted sense of R1 can be used since R1 will remain
low if L is zero.
189
¬Ld0 ∧ ¬Ld1 ∧ ¬Ld2 ∧ ¬Ld3 → z1↾
Ld0 ∨ Ld1 ∨ Ld2 ∨ Ld3 → z1⇂
z1 → z0⇂
¬z1 → z0↾
The bits in the datapath are combined to generate a zero/not-zero signal z for each token.
Because this process does not latch the datapath, this signal z is only stable after Lc0 or Lc1 go
high and before Le is lowered, acknowledging the channel. Then, if one of the bits is not zero, as
signalled by z0 , then R1 is raised to signal the ≠ event.
R0 ∧ R1 ∨ R0 ∧ R2 ∨ R1 ∧ R2 → Rv↾
The forward drivers check Re . However, because Re can be lowered after any one of the output
requests is raised, that check must be stabilized with the output requests themselves. Finally, an
extra signal Rv ensures that the handshake remains stable.
Rv ∨ (z1 ∨ R1) ∧ Lc0 → Le⇂
Rv is only raised once the cap token has been received on L . This signals the completion of the
comparison operation. Therefore, a different expression must handle the input acknowledgement
on Le for the rest of the tokens in the digit stream. If z0 is high, then R1 must be raised. So Le
waits for either R1 or z1 . Finally, Le acknowledges the non-cap input request on Lc0 directly.
(¬Re ∨ ¬R1 ∧ ¬R2) ∧ ¬_Rv ∧ ¬Lc1 → R0⇂
(¬Re ∨ ¬R0 ∧ ¬R2) ∧ ¬_Rv ∧ ¬Lc1 → R1⇂
(¬Re ∨ ¬R0 ∧ ¬R1) ∧ ¬_Rv ∧ ¬Lc1 → R2⇂
¬R0 ∧ ¬R1 ∧ ¬R2 → Rv⇂
¬Rv ∧ (¬z1 ∧ ¬R1 ∨ ¬Lc0) → Le↾
In the reset phase, Re can be lowered at any time after the first output request is sent. However,
because of the check on ¬_Rv in the reset phase of the forward drivers, none of the forward drivers
will reset until two of the three output requests have been raised, completing the delay insensitive
encoding. Once that happens, one the output requests are reset. However, once that happens, the
output enable Re can be raised. This means that the handshake has to be stabilized once again by
checking the value of the output requests. Once all of the output requests have been reset, Rv is
lowered and the input channel enabled on Le following the standard WCHB handshake.
Now, implementing the conversion to the six comparison conditions is similar to the naive
approach. However, instead of combining them with OR, they are combined with AND. For
example, the following implements < on C . C0 is connected directly to R0 , because R0 goes high
when L is greater than or equal to zero. Meanwhile, C1 is generated by combining R2 which is
190
raised when L is less than or equal to zero and R1 which is raise when L is not equal to zero.
C0 = R0
R2 ∧ R1 → C1↾
¬R2 ∨ ¬R1 → C1⇂
The non-standard handshake implemented by the comparison units allows for this conversion
to be implemented directly with combinational logic instead of a collection of C-elements.
8.2 Conditional Sink
The conditional sink is not much more complex than a WCHB buffer. The input condition C
determines whether or not a given digit-stream from L is forwarded through R . If C is 1, then the
digit stream is forwarded, sending each token received from L on R . Otherwise, those tokens are
discarded. When the cap token is finally received, the input condition C is also acknowledged,
signalling the completion of the operation.
∗[L?l;
[ C? = 1 → R!l ▯ else → skip ];
[ lc = 1 → C? ▯ else → skip ]
]
This is implemented directly with the standard WCHB reshuffling, similar to a token split. C1
is used to condition the forward drivers controlling R . Meanwhile, one of the forward drivers for
the skip can be bypassed directly to the input acknowledgement on Le . When the cap token is
received, then the condition is acknowledged on Ce , and everything resets.
Re ∧ C1 ∧ Lc0 → Rc0↾
Re ∧ C1 ∧ Lc1 → Rc1↾
C0 ∧ Lc1 → S1↾
Rc0 ∨ Rc1 ∨ C0 ∧ Lc0 ∨ S1 → Le⇂
Rc1 ∨ S1 → Ce⇂
¬Re ∧ ¬Lc0 → Rc0⇂
¬Re ∧ ¬C1 ∧ ¬Lc1 → Rc1⇂
¬C0 ∧ ¬Lc1 → S1⇂
¬Rc0 ∧ ¬Rc1 ∧ (¬C0 ∨ ¬Lc0) ∧ ¬S1 → Le↾
¬Rc1 ∧ ¬S1 → Ce↾
8.3 Evaluation
Unfortunately, it will be difficult to evaluate these operations relative to typical synchronous
architectures, because it is not really possible to separate their implementations from the
191
architectures themselves. Comparison operations are built into other operations in the execute
stage, and often do not have their own pipeline stage. Meanwhile, there is not anything
comparable to the conditional sink in a synchronous architecture. Overall, these circuits would
really need to be evaluated architecture-wide, which is simply not possible within the scope of this
thesis.
8.3.0 Comparison
In synchronous systems, the comparison operation is generally grafted onto the addition
operation by examining the overflow flag. Effectively, the comparison operation is generally
less than a pipeline stage. This means that it should be difficult for any self-timed digit-serial
approach to compete with the performance of the bit-parallel approach. Furthermore, because
the comparison is less than a pipeline stage, it is difficult to determine how much of the power
from that pipeline stage and from that part of the clock tree should be attributed to this particular
operation. Therefore, four bit-parallel compare-to-zero architectures are compared both with and
without the pipeline overhead. The real performance is ultimately somewhere in between.
Table 20 shows the performance for a set of possible synchronous bit-parallel comparison
architectures. The Precharge architecture uses a distributed OR across all 64 bits to determine
whether the input is equal to zero. This OR skips the long transistor stack by precharging the node
high in the first half of the clock cycle and resolving the OR in the second half. The Manchester
architecture simulates what the performance might be if the comparison operation were fused with
a 64 bit Manchester Carry Adder. The Tree architecture simply uses a 3-stage gate tree. Finally,
the Synopsys architecture was automatically synthesized by Synopsys Design Compiler from a
verilog specification.
Ultimately, many of these architectures can operate well beyond the clock speed found in
most synchronous systems which is why they are generally allocated less than one pipeline stage.
Therefore, it is not particularly easy to compare these results with the approach presented in this
chapter.
There are four basic behaviors. During the "zero" condition, the input channel is source zeros
and the ≠ flag has not been resolved yet. During the "early-out" condition, all previous tokens
have been zero and the current token is not. This is the condition during which the ≠ flag is sent
early. Then, for the "internal" condition, the input channel is sourcing non-cap tokens and the
≠ flag has already been sent. Finally, in the "cap" condition, the cap token has arrived on the
input channel and the final flags are sent on the output. The "zero" and "internal" conditions have
identical performance metrics.
Interestingly, the overhead introduced by the integrated adaptive digit-serial approach is fairly
192
Type Transistors Token Frequency Energy/Token
Gate Parallel Precharge (64-bit) 79 5.21 GHz 52.26 fJ
Gate Parallel Manchester (64-bit) 220 0.90 GHz 6.05 fJ
Gate Parallel Tree (64-bit) 174 5.68 GHz 6.07 fJ
Gate Parallel Synopsys (64-bit) 470 4.20 GHz 61.65 fJ
Clocked Parallel Precharge (64-bit) 1871 5.10 GHz 260.15 fJ
Clocked Parallel Manchester (64-bit) 2012 0.87 GHz 269.15 fJ
Clocked Parallel Tree (64-bit) 1966 5.56 GHz 253.80 fJ
Clocked Parallel Synopsys (64-bit) 2770 5.19 GHz 576.76 fJ
Table 20. Raw performance measurements for the bit-parallel comparison operators.
Type Transistors Condition Token Frequency Energy/Token Latency
zero/internal 3.86 GHz 9.78 fJ -
Integrated Serial Adaptive 107 early-out 3.08 GHz 15.01 fJ 149 ps
cap 1.80 GHz 34.21 fJ 140 ps
Table 21. Raw performance measurements for the digit-serial comparison operator.
limited. Because there are no latches, the energy cost of each token is significantly less than most
other operators. Meanwhile, the frequency of the comparison operator sits well above all of the
other operators, meaning it will not be the bottleneck in any scenario.
To analyze the performance of this circuit, two things are required. The first is the bitwidth of
the input digit stream, determining how many internal conditions execute before the cap token.
The second is the alignment, or the number of zero bits before the least significant non-zero bit in
the input. This determines the latency of the ≠ flag and the average number of early out conditions
executed. The joint distribution of these two features are shown in Fig. 89. One should note that
this distribution includes any redundant tokens introduced by the add that generally precedes the
comparison operation during the execution of a branch.
The utilization is computed from this distribution as follows. max_bitwidth(A) and
max_alignment(A) represent the maximum bitwidth and alignment respectively of the input
operand A . The probability at each coordinate in Fig. 89 is sampled with P(bitwidth(A) == b
and alignment(A) == a) , then the average number of cycles per stream is computed for each
condition. With a token or packet size of 4, the zero bits that represent the alignment are tokenized
to determine the number of tokens in the "zero" condition. If A is zero, then the alignment a
will be equal to the bitwidth b , and all of the non-cap tokens will execute the "zero" condition.
193
Fig. 89: Probability distribution for the input bitwidth.
Then, if A is not zero, the comparison unit will execute an "early-out" condition, followed by
some number of "internal" conditions determined by the number of tokens left in the digit stream.
Finally, the "cap" condition is executed sending out the remaining comparison results.
u = {'zero': 0, 'early-out': 0, 'internal': 0, 'cap': 0}
for b in range(1, max_bitwidth(A)+1):
for a in range(0, max_alignment(A)):
p = P(bitwidth(A) == b and alignment(A) == a)
btok = int((b+packet-2)/packet)
atok = int(a/packet)
if atok < btok:
u['zero'] += atok*p
u['early-out'] += p
u['internal'] += (btok-atok-1)*p
else:
u['zero'] += btok*p
u['cap'] += p
For a max bitwidth of 64 bits, the average length of the digit stream on A is around 3.4 tokens.
Furthermore, there are only around 1.1 tokens before the early out condition is executed. This
means that the ≠ flag has an average latency of 444 ps from the arrival of the first token on the
194
input channel A while the others have an average latency of 791 ps.
On average, when the overhead of the bit-parallel pipeline is ignored, the adaptive digit-serial
comparison operator is competitive with the bit-parallel comparator generated by Synopsys
Design Compiler with only 10% lower throughput per transistor but 4% less energy per operation.
Furthermore, while the throughput of the digit-serial operator is only 12% of the max throughput
of the fastest bit-parallel comparator, the real throughput of the bit-parallel comparator will be
determined by the system as a whole. Meanwhile, while the digit-serial operator uses 9.8 times as
much energy as the most energy efficient comparator, the absolute difference is 53 fJ per operation
which is thoroughly accounted for by the other arithmetic operators.
When the overhead of the bit-parallel pipeline is accounted for, there is no longer any
competition. The digit-serial comparator has 2.84 times the throughput per transistor of the fastest
bit-parallel comparator and uses 77% less energy than the most energy efficient bit-parallel
comparator.
Overall, the digit-serial comparator is more energy efficient than all of the bit-parallel operators
at any bitwidth. However, the throughput per transistor is lower until around 32 bits.
8.3.0 Conditional Sink
Unfortunately, there is no comparable bit-parallel synchronous counterpart for the conditional
sink. In synchronous control-flow architectures, conditional actions are implemented by branches,
which are inseparably tied to the execution pipeline, the program counter, the branch predictors,
and the branch target buffer. Furthermore, it would take multiple conditional sinks to implement
all of the conditional routing implied by a single branch instruction. Meanwhile, synchronous
data-flow architectures cannot implement a conditional sink because data must always flow
through a clocked pipeline. Instead they implement conditional merges and clock-gate the
not-taken branch of the dataflow split. This means that the performance of that conditional action
is tied to the system-wide implementation of the clock-gating circuitry.
This means that it is impossible to evaluate the conditional sink against any particular baseline.
Fortunately, Table 23 shows that the conditional sink is relatively inexpensive. There are four
behavioral conditions. The "internal" condition deals with non-cap tokens when the branch is
taken. The "cap" condition deals with the cap token when the branch is taken. Similarly, the
"skip-internal" and "skip-cap" conditions deal with the non-cap and cap tokens when the branch
is not taken.
All of these conditions can operate faster than 3 GHz and use around 20 fJ per token. Because
other operators execute in the 2 GHz range, this will never be the bottleneck in the system.
195
Condition Average Cycles/Stream
zero 0.148
early-out 0.765
internal 1.242
cap 1.000
Total Cycles 3.155
Table 22. Utilization of each condition
for the comparison operator.
Fig. 90: Performance and energy averaged over the distribution in
Fig. 89 vs Transistor Count.
Fig. 91: Each point corresponds to the simulated energy per compare averaged for multiple compare operations
over the distribution in Fig. 89 for a given maximum bitwidth.
Type Transistors Condition Token Frequency Energy/Token Latency
internal 3.23 GHz 19.63 fJ 48 ps
cap 3.11 GHz 22.12 fJ 51 ps
Integrated Serial Adaptive 105
skip-internal 3.82 GHz 13.30 fJ -
skip-cap 3.55 GHz 21.09 fJ -
Table 23. Raw performance measurements for the digit-serial conditional sink operator.
196
CHAPTER 9
MULTIPLICATION
Multiplication has been studied extensively for decades with hundreds of people weighing in
on the subject. In 1956, Kolmogorov conjectured multiplication to be an operation of complexity
O(n^2) [108]. This conjecture was quickly disproved by Anatolii Karatsuba in 1960 when he
introduced a trick that brought the complexity to O(n^log2(3)) [107]. Since then, there has been
iterative progress, with lower bound complexity conjectured to be n*log(n) in [112] and finally
proven in [114] and [109].
Unfortunately, while these multiplication algorithms are asymptotically superior, they also
have successively larger constants making them largely inapplicable to computer arithmetic
circuitry. Instead, research in multiplier circuitry for general compute has focused heavily on
structures with O(n^2) complexity while the asymptotically more efficient algorithms have been
relegated to cryptography applications where larger multiplications are required. [115]
Today, bit-parallel multipliers fall into one of just four underlying architectures. Array
multipliers directly implement the long multiplication algorithm using n-bit adders to combine
each successive row of partial products with the sum from the previous row [117]. Carry-save
array multipliers use carry-save adders to sequentially sum the rows [124]. Tree multipliers
build on this idea by removing the sequentiality and summing each column in parallel with
a tree of adders [118][119][120][123]. While these multipliers use O(n^2) one-bit full adders
to multiply two numbers in O(1) cycles, iterative multipliers time-multiplex the operation.
Typically allocating O(n) one-bit full adders for O(n) cycles and accumulating the sum of the
rows in memory.
Digit-serial or “online” multipliers were first introduced in [125]. Like iterative multipliers,
they allocate O(n) one-bit full adders for O(n) cycles. However, instead of completing a whole
row per cycle, digit-serial multipliers complete a whole column per cycle. This allows them
to emit digits on the output as digits are received on the input. This approach can be trivially
made bit-parallel as in [126][127][129]. However, doing so requires the added overhead of shift
registers. All of the digit-serial multiplier implementations in the literature are ultimately a hybrid
parallel-serial design [131][134][130]. While they are all structured similarly, different strategies
for input and carry digit routing in the multiplier array yield slight architectural variations
[132][133][135].
While these architectures tend to focus on unsigned multiplication, the partial products may
197
Year Algorithm Complexity
1960 Karatsuba[107] O(n^log2(3))
1963 Toom[110] O(n^log3(5))
1966 Toom-Cook[111] O(n^logk(2k+1))
1971 Schönhage-Strassen[112] O(n*log(n)*log(log(n)))
2007 Fürer[113] O(n*log(n)*16^log*(n))
2019 Harvey-Hoeven[114] O(n*log(n))
Table 24. Short history of multiplication algorithms and their
complexity.
always be recoded to support multiplication of signed numbers with a two's complement encoding
[116][121][122]. These strategies are demonstrated concretely for digit-serial multipliers in [128].
This chapter explores neither the development of the multiplication algorithm nor the datapath
circuitry. In fact, [136] already endeavored to make self-timed control for a digit-serial multiplier.
Instead the goal of this chapter is to develop QDI control to make an existing digit-serial multiplier
architecture length adaptive. Specifically, in the context of a statically configured CGRA, there
are two constraints that select a particular variation of the digit-serial multiplier. First, because
the CGRA is statically configured, the route from one operation to the next must also be static.
Second, because arithmetic is designed to implement arbitrary length operations, the multiplier
must also support inputs of arbitrary length.
These two constraints select the underlying architecture demonstrated in Fig. 92. Specifically,
the node that receives the inputs also emits the output. This keeps the route through this operator
static regardless of the input digit length. It also allows successively more multiplier units to
be allocated higher on the stack as more input digits arrive. The architecture rendered in Fig.
92 shows the bit-serial variant as found in [131] while this chapter will develop the digit-serial
variant. Simply put, in the digit-serial variant, the AND gates become digit-multiplies, and the
one-bit adders become a multi-bit summation tree.
9.1 Behavioral Specification
Each node in the synchronous version of the architecture presented in Fig. 93 implements the
following behavioral specification assuming a digit size of M bits and a word size of N digits.
Take note that B is loaded as determined by the count stored in c . The upper bits of the result are
stored in x for the next digit and the lower bits are forwarded through Yo . Finally, data from Yi
is delayed by one cycle through y which is initialized to 0.
198
Fig. 92: The underlying multiplier architecture used in this chapter.
x := 0; y := 0; c := 0;
∗[[ c > 0 → s := A*B + y + x;
A?, Yi?y; Yo!s0:M, x := sM:2M;
c := c-1
▯ c = 0 → B?;
y := 0, x := 0;
c := N
]]
Converting this to a length-adaptive specification requires the introduction of control data
alongside the datapath storing whether this digit is a cap token. Doing so removes the need for
c and N . Unfortunately, it also greatly increases to complexity of the specification. So, this
specification will be derived from the original multiply algorithm to ensure correctness.
This derivation begins quite tautologically. The multiplier will receive values from its input
channels A and B , multiply those values, and send the result on the output channel Y .
∗[ A?a, B?b; Y!(a*b) ]
The first step in the derivation is to convert A and Y to length-adaptive digit-serial channels
Ai and Yo . Each digit that arrives on Ai is multiplied with the whole of B and the result is
accumulated in s . Every digit received on Ai allows for the least significant digit of s to be
shifted to the output channel Yo . Once the last digit on Ai has been received, that digit must
be held in place until the end of the multiply. This sign-extends Ai for the rest of the operation,
during which the remaining data in s is shifted out to Yo . Once that has completed, the last digit
on Ai is acknowledged with B and s is reset to zero.
Keep in mind that this algorithm only implements unsigned multiplication. A booth encoding
will be applied to the datapath after synthesis to implement signed multiplication.
199
s:=0;
∗[ ∗[ Aic=0 →
s:=Aid*B + sM:N*M;
Ai?; Yo!(0,s0:M)
];
i:=N; ∗[ i>0 → i:=i-1;
s:=Aid*B0:i*M + sM:(i+1)*M;
Yo!(i=0,s0:M)
];
Ai?; B?; s:=0;
]
Next, the algorithm needs to be broken in to simpler segments to facilitate circuit
implementation. To do this, B should be split yielding two new channels Bi carrying the least
significant digit of B , and Bn carrying the remaining digits. Similarly s is split into s0 handling
the computation for the least significant digit of B , and sn handling the computation for the
remaining digits.
s0:=0, sn:=0;
∗[ ∗[ Aic=0 →
s0:=Aid*Bi + sn0:M + s0M:2M;
sn:=Aid*Bn + snM:(N-1)*M;
Ai?, Yo!(0,s00:M)
];
i:=N-1; ∗[ i>0 → i:=i-1;
s0:=Aid*Bi + sn0:M + s0M:2M;
sn:=Aid*Bn0:i*M + sn{M:(i+1)*M};
Yo!(0,s00:M)
];
s0:=Aid*Bi + sn0:M + s0M:2M;
Yo!(1,s00:M);
Ai?; Bn?; Bi?; s0:=0, sn:=0
]
Unfortunately, this does not implement length-adaptivity for B . So, the usual control bit Bic
must be added to Bi to condition the rest of the computation on Bn . This moves the data on Bi
to Bid .
200
s0:=0, sn:=0;
∗[ ∗[ Aic=0 →
s0:=Aid*Bid + sn0:M + s0M:2M;
[ Bic=0 → sn:=Aid*Bn + snM:(N-1)*M ▯ else → skip ];
Ai?; Yo!(0,s00:M)
];
[ Bic=0 →
i:=N-1; ∗[ i>0 → i:=i-1;
s0:=Aid*Bid + sn0:M + s0M:2M;
sn:=Aid*Bn0:i*M + snM:(i+1)*M;
Yo!(0,s00:M)
]
▯ else → skip
];
s0:=Aid*Bid + sn0:M + s0M:2M;
Yo!(1,s00:M);
Ai?; [ Bic=0 → Bn? ▯ else → skip ]; Bi?; s0:=0, sn:=0;
]
In the next few steps, a process handling the least significant digit of B will be split from
this one using projection. During that transformation, Ai , Yo , Bi , and s0 will be placed in one
process while Bn and sn will be placed in the other. Unfortunately, the value assignment of
s0 directly uses sn in the current specification. Therefore, In order to split this process using
projection, a few extra channels, Ao and Yi , must be added to split up any interactions between
these two internal variables. Ao forwards the input digit on Ai from the process handling the
least significant digit of B to the process handling the remaining digits. Meanwhile, Yi returns
the result of the digit multiply in the opposite direction.
201
s0:=0, sn:=0, yd:=0;
∗[ ∗[ Aic=0 →
s0:=Aid*Bid + yd + s0M:2M;
[ Bic=0 →
Ao!Ai; sn:=Aod*Bn + snM:(N-1)*M;
Ao?; Yi!(0,sn0:M), Yi?y
▯ else → skip
];
Ai?; Yo!(0,s00:M)
];
[ Bic=0 →
i:=N-1; ∗[ i>0 → i:=i-1;
s0:=Aid*Bid + yd + s0M:2M;
Ao!Ai; sn:=Aod*Bn0:i*M + snM:(i+1)*M;
Yi!(i=0,sn0:M), Yi?y;
Yo!(0,s00:M)
];
Ao?
▯ else → skip
];
s0:=Aid*Bid + yd + s0M:2M;
Yo!(1,s00:M);
Ai?; [ Bic=0 → Bn? ▯ else → skip ]; Bi?; s0:=0, sn:=0, yd:=0;
]
Take note that Ao has been added such that it forwards the cap token from Ai multiple times.
This will yield different process specifications for the least significant digit and every other digit.
While only the least significant digit will be derived here, both specifications will be used in the
multiplier. To avoid this one could move Ao!Ai from inside the second loop to just before the
second loop. However, this would yield a less efficient circuit.
With those channels added, the projection transformation can proceed. This yields two
processes: the first process handling the least significant digit of a length-adaptive digit-serial
multiply, and the second process which, aside from the extra receives on Ao in the second loop,
is identical to the original digit-serial specification. This means that these transformations can be
repeated, recursively pulling the next least significant digit from the specification to build the full
digit-serial multiply.
202
s0:=0, yd:=0;
∗[ ∗[ Aic=0 →
s0:=Aid*Bid + yd + s0M:2M;
[ Bic=0 → Ao!Ai; Yi?y ▯ else → skip ];
Ai?; Yo!(0,s00:M)
];
[ Bic=0 →
∗[ yc=0 →
s0:=Aid*Bid + yd + s0M:2M;
Ao!Ai; Yi?y;
Yo!(0,s00:M)
]
▯ else → skip
];
s0:=Aid*Bid + yd + s0M:2M;
Yo!(1,s00:M);
Ai?; Bi?; s0:=0, yd:=0;
]∥
sn:=0;
∗[ ∗[ Aoc=0 →
sn:=Aod*Bn + snM:(N-1)*M;
Ao?; Yi!(0,sn0:M)
];
i:=N-1; ∗[ i>0 → i:=i-1;
sn:=Aod*Bn0:i*M + snM:(i+1)*M;
Ao?; Yi!(i=0,sn0:M)
];
Ao?; Bn?; sn:=0
]
The next goal is to flatten this specification for a multiply digit into the Dynamic
Single-Assignment (DSA) format to facilitate circuit implementation. During this transformation,
all of the internal loops are merged into the outer loop, and all of the conditions are combined into
a single condition inside the outer loop. This will yield a set of conditions that can be implemented
using standard WCHB techniques. For this process, there are two internal loops representing
the two phases of the digit-serial multiply. The first phase consumes the input digits from Ai ,
producing the first half of results on the output channel Yo . The second phase drains the second
203
half of the results from the multiplier without consuming any more digits from Ai .
Aside from these two loops, there is also the final cap token and reset condition. For most
of these conditions, there is enough internal state to create a unique predicate in the DSA.
Unfortunately, this does not hold for the reset condition. Therefore, a state variable vo must be
introduced to create unique predicates for the reset condition.
The naive approach to this would require a three-valued internal memory with 0 covering the
first loop, 1 covering the second loop, and 2 covering the reset condition. However, because
the process handling the next most significant digit is inactive during the reset condition of
this process, its internal memory will remain stable. This means that vo can be reduced to a
two-valued memory by taking advantage of the shared variable vi from the next most significant
digit to disambiguate what would be the vo=1 case in the naive approach.
As a side note, not all of the bits in s0 need to be stored from cycle to cycle. Ultimately, only
the upper half of s0 actually needs to be stored. Therefore, a new variable x will be introduced
to act as this storage.
vo:=0, x:=0, yd:=0;
∗[ ∗[ Aic=0 →
s0:=Aid*Bid + yd + x;
[ Bic=0 → Ao!Ai; Yi?y ▯ else → skip ];
Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M)
];
[ Bic=0 →
∗[ yc=0 →
s0:=Aid*Bid + yd + x;
Ao!Ai; Yi?y; vo:=0; x:=s0M:2M, Yo!(0,s00:M)
]
▯ else → skip
];
s0:=Aid*Bid + yd + x;
vo:=1; Yo!(1,s00:M);
Ai?; Bi?; x:=0, yd:=0;
]
The first loop introduces first two conditions in the DSA specification. Both are predicated by
Aic=0 , while the value of Bic=0 selects between the two. In the first condition, Aic=0 ∧ Bic=0
traces through just the first loop. First, it executes the datapath computation s0:=Aid*Bid + yd
+ x . Then, because Bic=0 , it executes Ao!Ai; Yi?y from the selection statement. Finally, it
204
finishes the loop by executing Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M) . All of the conditions are
generated in this way, tracing out a particular section of the above program.
The second loop introduces the next condition. Because the first loop consumes all of the
non-cap tokens on Ai , the second loop is predicated on Aic=1 . This is combined with the explicit
predicate of Bic=0 and the value of the internal memory vo and shared variable vi . vi=1
signifies two possible cases. If vo=0 , then the cap token has propagated all the way up and down
the multiplier and the next most significant digit has just executed its reset condition. However,
if vo=1 , then this is simply the first non-cap token being emitted by the multiplier after the reset
condition of the last multiply operation. Then, if vi=0 , this is one of many non-cap tokens emitted
by the multiplier. The condition introduced by the second loop selects for all of the non-cap tokens
in the second phase with vi=0 ∨ vo=1 .
Finally, per the first loop, the reset condition is predicated by Aic=1 . Then if this is the last
digit in the multiplier as signified by Bic=1 , or the cap token has propagated through the whole
multiplier and the next most significant digit has executed its reset condition as signified by vi=1
∧ vo=0 , then the reset condition of this digit may proceed.
vo:=1, x:=0, yd:=0;
∗[[ Aic=0 ∧ Bic=0 → s0:=Aid*Bid + yd + x; Ao!Ai; Yi?y; Ai?; vo:=0; x:=s0M:2M,
Yo!(0,s00:M)
▯ Aic=0 ∧ Bic=1 → s0:=Aid*Bid + yd + x; Ai?; vo:=0; x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=1 ∧ Bic=0 ∧ (vi=0 ∨ vo=1) → s0:=Aid*Bid + yd + x; Ao!Ai; Yi?y;
vo:=0; x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=1 ∧ (Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0) → s0:=Aid*Bid + yd + x; vo:=1;
Yo!(1,s00:M); Ai?; Bi?; x:=0, yd:=0;
]]
A bit of re-organization produces the final specification below. s0 is assigned the same
expression in every condition and can therefore be pulled out of the conditional. Once this is
done, the assignment order of the internal memories no longer matters and can therefore be made
parallel. In the third condition, vi=0 implies Bic=0 , simplifying the predicate. Finally in the
fourth condition, vi=1 implies Aic=1 again simplifying the predicate.
205
vo:=1, x:=0, yd:=0;
∗[s0:=Aid*Bid + yd + x;
[ Aic=0 ∧ Bic=0 → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=0 ∧ Bic=1 → Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; Yi?y; vo:=0, x:=s0M:2M,
Yo!(0,s00:M)
▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, x:=0, yd:=0,
Yo!(1,s00:M)
]]
To recap the first condition, the parallel data from Bi is not a cap token. This means that this
multiplier unit is internal to the multiplier and must forward the input request from Ai to Ao and
receive the result from Yi . Meanwhile, Ai is also not a cap token, so this condition behaves the
most like the specification of the synchronous digit-serial multiplier.
In the second condition, the parallel data from Bi is a cap token. This means that this
multiplier unit is the last one in the stack and should neither forward the request on Ao nor receive
any data on Yi . This keeps yd equal to zero and leaves any other multiplier units beyond this
inactive.
In the third condition, the serial input on Ai is a cap token, but the parallel data on Bi is
not. This means that the multiplier just needs to stream the high order digits of the multiply result
to the output Yo . While these digits are being received on Yi and the digit is not a cap token,
this condition will continue to execute. Importantly, this must sign extend the input on Ai by not
acknowledging it until a cap token arrives on Yi .
In the fourth condition, the serial input on Ai is a cap token and either the parallel input on Bi
is a cap token or the cap token has propagated through the multiplier and the next most significant
digit has executed its reset condition. Therefore, no token is forwarded on Ao , the final digit is
emitted on Yo , the parallel input on Bi is acknowledged, and all of the internal memories are
reset.
If these transformations were to be applied again to pull out the next least significant digit from
the overall multiplier specification, it would differ only slightly. Specifically, the third condition
would also receive on Ai as shown below. This ensures that every send on Yo is paired with
a receive on Ai and every receive on Yi with a send on Ao allowing for the use of exchange
channels in the implementation. This is a small enough difference that the two specifications can
be combined and selected using a multiplexer fairly efficiently.
206
vo:=1, x:=0, yd:=0;
∗[s0:=Aid*Bid + yd + x;
[ Aic=0 ∧ Bic=0 → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=0 ∧ Bic=1 → Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M,
Yo!(0,s00:M)
▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, x:=0, yd:=0,
Yo!(1,s00:M)
]]
9.2 Datapath
While the derived multiplier specification implements an unsigned multiplier, the goal of this
work is to implement signed multiplication. Therefore, this specification must be modified to add
a booth encoding.
In a booth encoding, the most significant bit of each digit, Bi , is multiplied by -1 . This
accounts for the sign designated by the most significant bit of B . However, for internal digits, this
must be reverted. Therefore, the most significant bit of each digit is multiplied by 2 and added
back in. This effectively shifts this value by one bit, specifically into the least significant bit of the
next digit. Effectively, the value of each digit Bi in B is now as follows assuming a four bit digit.
-8*Bi,3 + 4*Bi,2 + 2*Bi,1 + Bi,0 + Bi-1,3
This means that Bi overflows at 8 and starts multiplying by negative values, but with a
magnitude never greater than 8 . Therefore, when designing the booth encoder, the magnitude
is Bi,0:3+Bi-1,3 when Bi,3=1 and ¬Bi,0:3+¬Bi-1,3 when Bi,3=0 . This value M will be used to
generate the partial products for Aid*Bid as shown in Table 25.
If Bi,3=1 , then these partial products must be negated. However, doing so would require a
carry chain. Instead of negating the partial products directly, the partial products can be inverted.
Then, all that is left is to add 1 in an initial data to whichever of the partial products should be
negative. Now this booth encoding may be added into the datapath specification as follows.
207
M Aid*Bid
0 0
1 1*Aid
2 2*Aid
3 4*Aid - 1*Aid
4 4*Aid
5 4*Aid + 1*Aid
6 4*Aid + 2*Aid
7 8*Aid - 1*Aid
8 8*Aid
Table 25. Booth
encoding for a four-bit
digit multiply.
vo:=1, yd:=0;
∗[[ vo=1 → x:=booth_init(Bid) ▯ vo=0 → skip ];
s0:=booth(Bid, Aid) + yd + x;
[ Aic=0 ∧ Bic=0 → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=0 ∧ Bic=1 → Ai?; vo:=0, x:=s0M:2M, Yo!(0,s00:M)
▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; Yi?y; Ai?; vo:=0, x:=s0M:2M,
Yo!(0,s00:M)
▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, yd:=0,
Yo!(1,s00:M)
]]
Unfortunately, the datapath's clocking requirements introduce new difficulties. With Ai , Yi ,
and X , a naive approach to clocking these signals throughout the handshake might require up to
six layers of latching which would ultimately introduce a significant amount of latching overhead
beyond what is required for the bit-parallel approach. Furthermore, because this circuit is length
adaptive and booth encoded, the reset conditions are quite a bit more complex than the baseline
synchronous bit-serial approach. The design of the datapath in Fig. 93 is able to cut the latching
requirements down to three layers, but requires careful attention to detail regarding the clocking
order and places constraints on the control circuitry.
To remove the three latching layers that would be required for the internal memory X , the
latching of X and Yid can be folded together by adding them together before latching Yid .
Normally, Yid would be latched using Yie . Unfortunately, this would still require one more
n-latch layer along X beyond the p and n layers latching Yid . Removing this final layer requires
208
Fig. 93: Datapath architecture for each digit unit of the multiplier.
the latched data from Aid to remain stable until after Yid+X has been latched in. This forces an
ordering Yie⇂; Aie↾ or Yie⇂ • Aie↾ in the control. This transforms the datapath specification
as follows:
vo:=1;
∗[[ vo=1 → yd:=booth_init(Bid) ▯ vo=0 → skip ];
s0:=booth(Bid, Aid) + yd;
[ Aic=0 ∧ Bic=0 → Ao!Ai; yd:=Yid+s0M:2M; Yi?; Ai?; vo:=0, Yo!(0,s00:M)
▯ Aic=0 ∧ Bic=1 → yd:=0+s0M:2M; Ai?; vo:=0, Yo!(0,s00:M)
▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Ao!Ai; yd:=Yid+s0M:2M; Yi?; Ai?; vo:=0,
Yo!(0,s00:M)
▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Ai?, Bi?; vo:=1, Yo!(1,s00:M)
]]
Furthermore, Yie would only cycle when there is a token arriving on Yi . Meanwhile, X must
be clocked regardless of an incoming token on Yi . This means that a new strategy must be taken
to clock the combined Yid+X on every cycle as triggered by Ae0 , Ae1 , and Ae2 while maintaining
209
the ordering constraint. This strategy uses compound latches with the three clocking signals, and
pushes the synchronization problem to the control.
Next, while I[0:4] remains stable throughout the whole multiply operation, the multiplexer
selection signal vo does not. During the first token of a multiply operation, the booth encoders
reset data must remain stable until Yid+X has been latched in with Ae⇂ . Unfortunately, vo may
only transition in the set phase of the WCHB handshake while its next value is available on the
forward drivers, and Ae⇂ may only occur in the reset phase after the reset of the forward drivers
acknowledges the incoming request on Yi . This means that there is a short time before Ae⇂ in
which the signal multiplexing I[0:4] transitions. This is fixed by moving the multiplexer before
the p-latch for Yid+X . This ensures that the reset data for the booth multiplier is held stable once
Ae⇂ has occurred. This also means that the internal memory vo must wait for Ae⇂ before
transitioning.
Finally, at the most significant digit of the multiplier, signified by a cap token on Bi , the input
data from Yid must be zeroed. This is implemented by a multiplexer on Yid controlled by the
input request on Bi . Since that request remains valid throughout the operation, this multiplexer
will be stable.
9.3 Control
Now, the techniques in Chapter 3 are applied to this specification to generate efficient control
circuitry. Given that only the control is being synthesized and that the control operates
independently from any values on the datapath, much of the specification can be ignored. This
leaves the control specification shown below.
vo:=1;
∗[[ Aic=0 ∧ Bic=0 → Aoc!Aic; Yic?; Aic?; vo:=0, Yoc!0
▯ Aic=0 ∧ Bic=1 → Aic?; vo:=0, Yoc!0
▯ Aic=1 ∧ (vi=0 ∨ Bic=0 ∧ vo=1) → Aoc!Aic; Yi?; Aic?; vo:=0, Yoc!0
▯ Aic=1 ∧ Bic=1 ∨ Bic=0 ∧ vi=1 ∧ vo=0 → Aic?, Bic?; vo:=1, Yoc!1
]]
Of the four conditions, the first and second can be combined into a single condition handling
the first loop. The only difference between the two is when to forward the request on Ao and
receive the result on Yi , and that difference is quite easily conditioned using Bic since it remains
stable throughout the operation. Therefore, the first two conditions may be combined into a single
forward driver R0 . From there, Aico is enough to disambiguate the first and second conditions
from the third and fourth. However, because the first condition sends on Ao , the forward driver
must also acknowledge the channel's enable Yic as covered by Re . Because Re is low when
Bic0 is not set, Bic1 must be added to unblock the second condition.
210
Because the first two conditions only deal with Aic=0 , only the forward request Aoc0 needs to
be predicated. Furthermore, the output acknowledge Yic can also be predicated on Bic to create
an output enable Re .
Bic0 → _B0⇂ // delay
¬Bic0 → _B0↾
Bic1 → B1↾ // delay
¬Bic1 → B1⇂
¬_B0 ∧ ¬Ae0 → Aoc0↾
_B0 ∨ Ae0 → Aoc0⇂
¬Ae1 → Aoc1↾
Ae1 → Aoc1⇂
_B0 ∨ Yic0 ∨ Yic1 → Re⇂
¬_B0 ∧ ¬Yic0 ∧ ¬Yic1 → Re↾
The third condition always sends on Ao , thus R1 must acknowledge Re . Furthermore, it must
check vi and vo to ensure that the multiply operation has not yet completed.
R2 mostly implements the predicate of the fourth condition. However, because Aoc1 is
acknowledged by Yic through Re during the final execution of R1 , R2 must wait for the reset of
that acknowledgement through Re , allowing Re to further acknowledge the reset of Bic0 . Then,
as an unfortunate side effect of the shared variable vi , R2 must be predicated by Re . Because
vi is an internal variable of the next most significant digit, it will transition in the middle of the
handshake, allowing R2 to transition before the data from that operation has been received from
Yid . This is fixed with the Re predicate.
Each forward driver is then amplified and inverted to serve as clocking signals for the datapath.
In order to guarantee the bundled-data assumption, the output requests on Ao and Yo are
generated from the clocking signals in Ae . This ensures that the input data from the exchange
channels remains stable until the input latches have been closed by Ae .
en1 ∧ Aic0 ∧ (Re ∨ B1)            → R0↾
en1 ∧ Aic1 ∧ Re ∧ (vi0 ∨ vo1)      → R1↾
en1 ∧ Aic1 ∧ (Re ∧ vi1 ∧ vo0 ∨ B1) → R2↾
R0 → Ae0⇂ // amplify
R1 → Ae1⇂ // amplify
R2 → Ae2⇂ // amplify
Bie = Ae2
There are two things to notice. First, the value from Yi? is never used in the specification.
211
Instead, Yi communicates a valid value during the reset phase of this handshake and its value is
effectively recorded in the shared variable vi every cycle, so there is no way nor need to use the
value received directly from the channel. However, the least significant digit of the multiplier still
needs to produce the cap/not-cap value on Yo .
Second, for only the first digit of the multiply, both Ai and Yo have enable signals Aie
and Yoe . For Ai , the enable signal can be driven by pass transistor logic. For digits of greater
significance, this signal can simply be ignored without introducing any instability due to the
special QDI treatment for pass transistor logic. For the first digit, it is an efficient way to generate
the input enable from the clocking signals.
@Ae0 ∧ ¬_Ae2 ∨ @Ae2 ∧ ¬_Ae0 → Aie↾
¬@Ae0 ∧ _Ae0 ∨ ¬@Ae2 ∧ _Ae2 → Aie⇂
For Yo , en0 and en1 take the place of Yoe . For the first digit, both en signals are connected
to Yoe as a synchronous configuration.
en1 = Yoe
en0 = Yoe
For the further digits, the en signals are connected to GND and Vdd respectively. This
allows the handshake to bypass the acknowledgement check, and simply operate using the
acknowledgement from the exchange channels.
en1 = gVdd
en0 = gGND
Then, the request on Yo is assigned the cap/not-cap value using another pass transistor OR on
the amplified clocking signals Ae . The upgoing transitions are delayed to ensure the bundled-data
constraint in the datapath. Yoc0 handles the first, second, and third conditions while Yoc1 handles
the forth.
@Ae0 ∧ ¬_Ae1 ∨ @Ae1 ∧ ¬_Ae0 → _Yo0↾
¬@Ae0 ∧ _Ae0 ∨ ¬@Ae1 ∧ _Ae1 → _Yo0⇂
_Yo0 → Yoc0⇂
¬_Yo0 → Yoc0↾ // delay
Ae2 → Yoc1⇂
¬Ae2 → Yoc1↾ // delay
In the standard WCHB handshake, vo0 would be set by with _R0 or R1 and vo1 with _R2 .
However, vo plays a couple further roles. First, it is a shared variable vi during the set phase of
the next least significant digit. Since the set phase of the next least significant digit triggers the
212
set phase in this digit, the transition on vo must be delayed until the input request Aic1 has been
lowered. The other input request Aic0 does not need to be considered because only R0 drives
Aic0 and R0 does not depend upon a stable value on vi . Unfortunately, the reset of R2 must still
acknowledge Aic1⇂ to cover for when B1 is high.
Second, vo is a reset signal for the datapath, feeding the initialization of the booth multiplier
into the datapath instead of the carry digit from Yid+X . This means that the initialization data
must remain stable until after Yoe⇂ and Ae0↾,Ae1↾,Ae2↾ . Unfortunately, it would be impossible
to wait for the appropriate transitions on Ae because they happen after the reset phase of the
handshake and the transition on vo must happen before. Therefore, the datapath will have to latch
the initialization data appropriately using Ae and vo can replace _R with Ae to ensure that latch
remains stable.
Third, vo is used in the forward drivers of this digit. To protect those forward drivers from
instability, the transitions on vo must be guarded by the reset of Re . This also helps acknowledge
the reset of Re during the reset phase of R2 .
vo1 ∧ (Ae0 ∧ Ae1 ∨ Re) → vo0⇂
vo0 ∧ (Ae2 ∨ Re ∨ Aic1) → vo1⇂
¬vo1 ∨ (¬Ae0 ∨ ¬Ae1) ∧ ¬Re → vo0↾
¬vo0 ∨ ¬Ae2 ∧ ¬Re ∧ ¬Aic1 → vo1↾
R0 sends on Yo , receives on Ai , and sets vo0 . However, it does not always send on Ao .
This is covered by Re . If R0 sends on Ao , then Bic0 is high and Re is allowed to reflect Aoe .
However, if R0 does not send on Ao , then Bic0 is low and Re is held low. So it is enough to just
acknowledge Re . Remember, for the first digit, en0 is connected with Yoe . For the subsequent
digits, en0 is simply tied to GND .
R1 sends on Yo and Ao and sets vo0 . Either it is the first digit and Ai is not acknowledged,
but Yoe acknowledges the request on Yo through en1 or it is not the first digit and Ai is
acknowledged, resetting Aic1 while en1 is tied to Vdd .
R2 sends on Yo , receives on Ai and Bi , and sets vo1 . Setting vo1 covers the
acknowledgement of Bic0 through Re . Therefore the reset phase of the forward driver just has to
cover Yo , Bic1 , and Aic1 .
¬en0 ∧ ¬vo1 ∧ ¬Re ∧ ¬Aic0 → R0⇂
(¬en1 ∨ ¬Aic1) ∧ ¬vo1 ∧ ¬Re → R1⇂
¬en0 ∧ ¬vo0 ∧ ¬B1 ∧ ¬Aic1 → R2⇂
¬R0 → Ae0↾ // amplify
¬R1 → Ae1↾ // amplify
213
¬R2 → Ae2↾ // amplify
Finally, the input enables are reset following the typical WCHB reshuffling.
9.4 Evaluation
Overall, it seems that the vast majority of multiply operations executed on modern hardware
are limited to 32 bits. This means that the performance for the length-adaptive digit-serial 64-bit
operations would be inflated compared the standard bit-parallel approaches. Therefore, Fig. 94
shows the overall performance comparison for both 32 and 64 bit multiplier datapaths.
Note that there are three performance numbers for the length adaptive digit serial multiplier.
This multiplier consists of a collection of processes. The processes implementing the higher-order
digits are only active when executing a multiply in which B is long enough. Therefore, there are
three contexts in which this multiplier may be used. If it stands alone in the overall architecture
and the higher order digits cannot be allocated to another multiply when not in use, then the
multiplier digits are “statically” allocated. In this case, the transistor count for a 64-bit multiply
when computing throughput/transistor is simply 8 times the transistor count for a single digit
assuming a 4-bit datapath. However, in the context of a CGRA, the multiply digits are
“dynamically” allocated as needed. This means that the transistor count for the throughput/
transistor metric is now dependant upon the number of digits in B for a given multiply. Therefore,
the transistor count for the dynamically allocated multiplier is averaged over the distribution of
the bitwidth of B for the throughput/transistor metric. Third, the number of cycles executed by
the allocated processes in the second phase of the multiply is dependent upon the length of B .
Therefore, if it is possible to always assign the shorter of the two inputs to B , then some cycles
can be avoided thereby saving energy. This is the “sorted” multiplier.
Fig. 94 shows that the statically allocated length-adaptive digit-serial multiplier is 1.4 times
faster per transistor than the fastest bit-parallel multiplier while using 31% less energy for a
32-bit multiply. Meanwhile, the dynamically allocated multiplier is 4.81 times faster per transistor
assuming perfect scheduling. Of course, perfect scheduling is not possible. Therefore, the real
performance of the dynamically allocated multiplier will be somewhere in between. Finally, the
sorted dynamic multipler is 6.1 times faster per transistor using 43% less energy.
Two bit-parallel multipliers were synthesized for comparison. The first is a custom Dadda
Tree followed by a Manchester Carry Chain Adder. The second is automatically synthesized from
Verilog by Synopsys Design Compiler. Both implementations only latch the two inputs, leaving
the rest of the multiply unpipelined.
The digit-serial multiplier has three conditions. The “internal” condition covers the first two
conditions in the behavioral specification in which the input request on Ai is a non-cap token.
214
Fig. 94: Performance and energy averaged over the distribution in Fig. 95 vs Transistor Count.
Type Transistors Frequency Energy/Op Latency
Clocked Parallel Dadda Ripple (64-bit) 118122 374 MHz 16.497 pJ
Clocked Parallel Synopsys (64-bit) 130402 272 MHz 80.314 pJ
Table 26. Performance measurements for the bit-parallel bitwise operators.
The “extend” condition covers the third condition in the behavior specification in which the input
request on Ai is a cap token, but the second half of the output digits are still propagating out
of the multiplier. Finally, the “cap” condition covers the final token to be emitted out of a given
multiply digit, covering the final condition in the behavioral specification.
Ultimately, the multiplier is quite a bit slower than the other operations in this thesis, generally
operating around 1.4 GHz. This will likely be a limiting factor for the throughput of a CGRA.
Quite a bit more time should be spent to optimize the datapath to allow for shorter delay lines.
Furthermore, this operator tends to require 200-300 fJ per token which is quite a bit more than
any other operation in this thesis. It is likely that this could be optimized through intelligent usage
of pass-transistor logic. These per-token measurements are averaged over the input distribution
found in Fig. 95. Keep in mind that the bitwidth of the result of a multiply is the sum of the
bitwidths of its inputs.
This distribution has two interesting features. First, there are four cutoff points. The first seems
to be around 16-bits which would produce 32-bit results. This is likely from the algorithmic
workload required by the program. The second seems to be 26 bits with a pre-cutoff peak at
around 24 bits which would produce results around 48-bits wide. This matches the width of
the memory-bus on a 64-bit Intel processor and is therefore attributable to memory address
calculations. The third is at 32-bits which would produce 64-bit results, matching the width of the
datapath on a 64-bit Intel processor. The final cutoff point is a bit odd with one input around 32
bits and the other around 64 bits wide which would produce 96-bit results. It is unclear what this
might be from.
The other interesting feature is the strong diagonal representing the multiplication of two
215
Type Transistors Condition Token Frequency Energy/Token Latency
internal 1.46 GHz 243.48 fJ 561 ps
Integrated Serial Adaptive 1152 extend 1.40 GHz 171.37 fJ 607 ps
cap 1.29 GHz 284.91 fJ 550 ps
Table 27. Raw performance measurements for the digit-serial multiply operator.
numbers that have the same bitwidth. This is likely due to computing the square of a number,
which is one of the main operations in the iterative algorithm implementing the pow() function.
Two things must be computed from this distribution. The average energy requires the total
number of times a given condition was executed throughout the multiplier stack per operation.
Meanwhile, the average throughput per operation requires the total number of times a given
condition was executed by just the first digit in the multiplier stack.
For the average energy, the distribution in Fig. 95 is iterated over with a and b representing
given bitwidths for their respective inputs A and B . p is the probability of that combination of
bitwidths. Then, for a given packet size (4 in this case), the number of internal tokens is computed
for each input as atok and btok . If the input has one bit (0 or -1), then the cap token will be the
only token in the stream. Further bits are then allocated to internal tokens.
The “internal” condition of the multiplier is executed once per digit allocated by a token in
B per internal token in A or atok*(btok+1) times. The “extend” condition is executed in one
of the digits once per each digit of greater significance in the multiplier. This means if there are
two digits, then the top digit will execute it zero times, and the bottom digit one time. Pick's
Theorem is used to compute the number of discrete points in the resulting triangle for the square
triangle with side-length of btok . This comes out to int((btok*btok+btok)/2) . Finally, the
“cap” condition is executed once per allocated digit, or simply btok times.
u = {'internal': 0, 'extend': 0, 'cap': 0}
for a in range(1, max_bitwidth(A)+1):
for b in range(1, max_bitwidth(B)+1):
p = P(bitwidth(A) == a and bitwidth(B) == b)
atok = int((a+packet-2)/packet)
btok = int((b+packet-2)/packet)
u['internal'] += atok*(btok+1)*p
u['extend'] += (int((btok*btok+btok)/2))*p
u['cap'] +=    (btok+1)*p
For the distribution in Fig. 95, this yields the token counts in Table 28 with each condition
being executed 3 to 4 times for a total of 11.273 cycles.
216
Fig. 95: Probability distribution for the bitwidth of the left operand A and
right operand B for multiplication.
Condition Average Cycles/Stream
internal 4.191
extend 4.035
cap 3.047
Total Cycles 11.273
Table 28. Utilization of each condition
for a multiply.
For throughput per transistor, the distribution is iterated over like before. The first digit in the
multiplier executes the “internal” condition once per each internal token in A or atok times. The
“extend” condition is executed once per digit of greater significance or btok times, and the “cap”
condition is executed once per multiply.
217
Condition Average Cycles/Stream
internal 1.127
extend 2.047
cap 1.000
Total Cycles 4.174
Table 29. Utilization of each condition
for the least significant digit of the
multiply circuit.
Fig. 96: Throughput/transistor (left) and energy/op (right) metrics scaled by maximum bitwidth.
u = {'internal': 0, 'extend': 0, 'cap': 0}
for a in range(1, max_bitwidth(A)+1):
for b in range(1, max_bitwidth(B)+1):
p = P(bitwidth(A) == a and bitwidth(B) == b)
atok = int((a+packet-2)/packet)
btok = int((b+packet-2)/packet)
u['internal'] += atok*p
u['extend'] += btok*p
u['cap'] += p
For the distribution in Fig. 95, this yields the tokens counts in Table 29 with each condition
being executed 1 to 2 times for a total of 4.174 cycles.
These token calculations are expanded to various maximum bitwidths including 4, 8, 16,
32, and 64 bits by truncating the distribution as needed. This allows for a comparison of the
length-adaptive digit-serial multiplier against bit-parallel multipliers at varying bitwidths in Fig.
96. This shows that the digit-serial multiplier surpasses the best bit-parallel multiplier in both
metrics shortly before 32 bits.
Like the addition operator, the multiplication operator can introduce some redundant tokens
into the encoding of the result as shown in Fig. 97. These tokens come from the addition
operations that are internal to the multiplier.
218
Fig. 97: Probability distribution for the number of redundant bits introduced per operation by the multiplier.
219
CHAPTER 10
EXAMPLE ARRAY ARCHITECTURE
The operators designed in this thesis show dramatic improvements against industry standard
architectures and synthesis methods. While the implementation of each operator may be
complicated, the final circuitry is elegant and robust with simple plug-and-play interfaces. They
intelligently avoid unnecessary work, saving significant amounts of time and energy in each
operation. Overall, this thesis has completed the most difficult tasks necessary for length-adaptive
digit-serial computation. These operators form an underlying framework on which many highly
performant architectures may be constructed.
As an example of such an architecture, this final chapter implements the Arithmetic Cube
[266][268]. This architecture was designed in 1987 as a systolic digit-serial accelerator for the
Discrete Fourier Transform and other related operations. While this is not a fully configurable
CGRA, it is an array architecture that has some amount of configurability and showcases the
behaviors of the presented circuitry and its benefits. Overall, the architecture implements the
operation found in Fig. 98. This operation multiplies the signal matrix X by a weight matrix B
with values of only -1,0, or 1. Then, it does an element-wise multiply with the scaling matrix
H as signified by the ⊛ operator. For more information about how this is derived and used to
implement the Discrete Fourier Transform, see [266].
The Arithmetic Cube architecture as seen in Fig. 99 was chosen because it allows for a drop-in
replacement of the arithmetic operators with the operators developed in this thesis. It consists of a
two dimensional array of adders/subtracters followed by a row of multipliers.
Each adder/subtracter is configured using a two-bit element from B . The bottom row of adders
is sent the first column of B and the left column of adders is sent the first row of B , etc. Given j
in 0 < j < n1 and k in 0 < k < n2 , if Bk,j is 1 , then the node is configured as an adder. If
B{k,j} is -1 , then the node is configured as a subtracter. If Bk,j is 0 , then the node is bypassed
in both directions. Overall, each node receives a row of elements Xj,* on the left and a series of
partials P from above. For i in 0 < i < n0 , it forwards Xj,i to the right and Bk,j*Xj,i+Pi
down.
Each multiplier receives a row of bit-parallel elements Hk,* and multiplies each element
Hk,i with the partial received from the adder/subtracter array Pi . Therefore, Zk,i =
Hk,i*sum(Bk,j*Xj,i for 0 < j < n1) .
220
Fig. 98: Operation implemented by the arithmetic
cube.
Fig. 99: Architecture of the Arithmetic Cube.
10.1 Sum Process
The vast majority of the functionality required by the “sum” process in Fig. 99 is already
implemented by the adder/subtracter unit described in Chapter 7 with Xj,i routed to B , Pi
to A , and S to the next row's Pi . Ultimately, the most significant bit of Bk,j would be
connected directly to cfg . When Bk,j is -1 , then the most significant bit and therefore cfg is 1
configuring the unit to subtract Xj,i from Pi . When Bk,j is 1 , then the most significant bit and
therefore cfg is 0 configuring the unit to add Xj,i to Pi .
There are two functionalities that remain to be implemented. The first requires Xj,i connected
to B to be forwarded to the right. This can be done by adding a simple buffer and re-using the
platches that are already latching the data from B . This modifies the adder/subtracter unit very
slightly as shown in Fig. 100.
221
Boe ∧ Bc0 → Boc0↾
Boe ∧ Bc1 → Boc1↾
(Boc0 ∨ Boc1) ∧ (Sd0 ∧ Bx0 ∨ Sd1) → Be⇂
¬Boe ∧ ¬Bc0 → Boc0⇂
¬Boe ∧ ¬Bc1 → Boc1⇂
¬Boc0 ∧ ¬Boc1 ∧ (¬Sd0 ∨ ¬Bx0) ∧ Sd1 → Be↾
The second requires the unit to be bypassed when the least significant bit of Bk,j is 0 . This is
simply done with a few multiplexers. The final circuit is shown in Fig. 101.
10.2 Mul Process
The “mul” process is entirely implemented by the multiplier discussed in Chapter 9 with Hk,i
connected to B , Pi to A , and Zk,i to S . However, for the Arithmetic Cube architecture the
scaling matrix is optional. Therefore, the “mul” process will also need multiplexers to bypass the
unit. This can be seen in Fig. 102.
10.3 Evaluation
This evaluation will demonstrate how data flows through this architecture. Fig. 103 shows the
few steps of simulation for the first column in the Arithmetic Cube. On the left, each channel
is labelled and the configuration information is given. During the simulation, the value of the
internal memory is shown in each box. For the sum units, the internal memory represents the
one-bit carry-in for the next cycle. For the multiplier units, this represents the carry digit for the
next cycle and initial carry for the booth encoder. In this simulation, B0,1 and B0,2 are both 0 ,
bypassing those sum units entirely. Meanwhile B0,0 and B0,3 are both 1 adding their respective
X to the running total. X[3] receives the hexadecimal value 07C3DBD3D or 2,084,420,925 ,
X[0] receives the hexadecimal value 02367 or 9,063 , and H[0] receives the hexadecimal
value F6B2 or -2,382 . Overall the value computed is (2,084,420,925+9,063)*-2,382 =
-4,965,112,231,416 . This comes out to the hexadecimal value FFB7BF83FCA08 .
The carry digits in the multiplier are initialized per the booth encoding.
1. The parallel input for the first digit is 2 , and the last bit from the previous digit is 0 by
definition. This is lower than the threshold of 8 on the booth encoder. Following Table 25,
the multiplier is 2*A and the initial carry is 0 .
2. For the second digit, the parallel input is B and the last bit from the previous digit is
0 . Because the parallel input is greater than the threshold, the multiplier is
(¬B+¬0)*¬A=(4+1)*¬A=5*¬A . It follows that the initial carry is 5 to correctly implement
222
Fig. 100: Forwarding B .
Fig. 101: Sum process with
bypassing multiplexers.
Fig. 102: Mul process
with bypassing
multiplexers.
the negation.
3. The parallel input for the third digit is 6 and the last bit from the previous digit is 1 .
Following the table, the multiplier is (6+1)*A=7*A and is implemented as 8*A + ¬A . To
properly implement the negation, the initial carry is set to 1 .
4. The final digit is F and the last bit of the previous digit is 0 . F is greater than the threshold,
223
Fig. 103: Walk through of simulation of channels along the first column of the Arithmetic Cube.
so the multiplier is (¬F+¬0)*¬A=¬A making the initial carry 1 .
The simulation continues from this initial state as follows.
1. In the first step the least significant digit of each input arrives on the associated channel.
This is D or X[3] , 7 for X[0] , and all the values of H[0] in parallel. In the next step,
the top sum unit has received all of its inputs and may therefore execute 0+D+0=D . This
result bypasses the middle two sum units through the multiplexers to arrive at the input of
the bottom sum unit. Meanwhile the top sum unit enters its reset phase, acknowledging the
left input and sign-extending the top input.
2. In the second step, the bottom sum unit has received all of its inputs and may execute
D+7+0=14 . This forwards 4 on the output and stores the carry in memory for the next cycle.
Meanwhile, the top sum unit has exited the reset phase and is now receiving new inputs.
3. In the third step, the top sum unit has received a new token on its left input, allowing it to
execute 0+3+0=3 . Once again, this result bypasses the middle two sum units and arrives as
an input to the bottom sum unit. Meanwhile, the multiplier has just received its first serial
input token. The least significant digit in the multiplier now computes the booth encoded
result. 2*4+0+0=8 , emitting 8 on the output and storing 0 in the internal memory. This puts
that first multiplier unit in its reset phase.
4. In the fourth step, the bottom sum unit has received its second pair of input tokens, allowing
it to execute 3+6+1=A . This is forwarded to the multiplier, putting the sum unit in its reset
phase. Meanwhile, the first token has made its way to the second least significant digit in the
multiplier, allowing it to compute the booth encoded 5*¬4+5+0=5*B+5+0=3C . C is returned
224
to the least significant digit and 3 is stored in the digit carry.
5. In the fifth step, the top sum unit received new inputs and computes 0+D+0=D . The bottom
sum unit is in its reset phase. The least significant digit of the multiplier receives inputs
and computes 2*A+0+C=20 . The second multiplier digit is in its reset phase, and the
third multiplier digit computes 8*4+¬4+1+0=8*4+B+1+0=2C . C is returned to the second
multiplier digit and 2 is stored in the carry.
6. In the sixth step, the top sum unit is in its reset phase, and the bottom sum unit receives
inputs and computes D+3+0=10 . This forwards 0 to the multiplier and stores 1 in the carry.
The first multiplier digit is in its reset phase, the second multiplier digit receives inputs and
computes 5*¬A+3+C=5*5+3+C=28 , the third multiplier digit is in its reset phase, and the final
multiplier digit receives inputs and computes ¬4+1=B+1=C .
This is continued in the spice simulation of the full 4x4 Arithmetic Cube. The relevant part of
this simulation is shown in Fig. 104. As previously discussed, each channel has three parts: the
requests c0 (blue) and c1 (red) which signify not-cap or cap respectively, the enable e , and the
bundled data d[0:4] (blue, red, yellow, green).
There are a few things to notice here. First and foremost, the power consumption of the whole
4x4 array is fairly smooth, never going above 12 mW. While the multiplier is in its first phase, the
adder network is allowed to compute. Meanwhile, when the multiplier is in its second phase, the
adder network is stalled. This is reflected in the power by a slow increase and decrease in power
usage.
Second, the operating frequency of a linear self-timed pipeline is limited by the slowest
process. For this array, the slowest process is the multiplier, operating around 1.4 GHz. Normally,
the differences in pipeline length caused by the bypassed nodes in the array would slow down the
pipeline. Often slack matching is employed to counter this problem. However, in this case slack
matching is not necessary because the sum units in the network operate at around 2 GHz, which
is .6 GHz faster than the multiplier. This gives enough slack so the array's operating frequency is
not slowed beyond that of the multipliers.
Third, like most self-timed circuits, there is no single number that can be used to report
performance or power consumption. Both of these metrics depend heavily on the operation being
performed and no single signal keeps a constant frequency. The closest signal to sample for
operating frequency tends to be the enable. However, for conditional communication the enable
signals can stall for significant amounts of time. Furthermore, if the multipliers were disabled, the
array would then run at the operating frequency of the adders which is around 2 GHz and would
burn significantly less power. This means that for larger circuits, the only way to really compare
225
c0,c1
X[3] e
d[0:4] D 3 D B D 3 C 7 0
c0,c1
S[3][0] e
d[0:4] D 3 D B D 3 C 7 0
c0,c1
X[0] e
d[0:4] 7 6 3 2 0
c0,c1
S[0][0] e
d[0:4] 4 A 0 E D 3 C 7 0
c0,c1
Z[0] e
d[0:4] 8 0 A C F 3 8 F B 7 B F F
12
P (mW)
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Time (ns)
Fig. 104: Waveform (left) of channels along the first column (right) of the
Arithmetic Cube.
them is to check their average throughput and energy across a large testbench.
226
CHAPTER 11
CONCLUSION
This thesis proposed, implemented, and evaluated length-adaptive digit-serial arithmetic to
optimize capacity, throughput, and energy consumption for coarse-grained reconfigurable arrays
in the context of general compute. General compute workloads were analyzed to determine the
viability of such architectures and identify avenues for optimization. Novel micro-architectural
optimizations for QDI process design were discussed in great detail along with novel methods for
integrated QDI/BD design. These methods were then used throughout this thesis to implement
efficient arithmetic and supporting circuitry.
As summarized in Table 30, the length-adaptive digit-serial arithmetic circuitry developed in
this thesis successfully demonstrates many of the hypothesized benefits outlined at the end of
Chapter 2.
Every operator reduces energy consumption by a factor of two or more compared to their
clocked counterparts. While clocked circuits save some energy when the input values don't
change, the majority of energy overhead in a clocked system ultimately comes from driving the
clock to toggle the flops in each pipeline stage, and the circuits developed in this thesis specifically
seek to optimize that energy expenditure. Furthermore in larger system contexts like a CGRA, the
routing network can introduce a significant overhead. These circuits will send less data over the
network, saving even more energy beyond the metrics listed in Table 30.
The adder/subtractor and multiplier naturally increase throughput per transistor metrics
compared to their clocked counterparts by significant factors. For the multiplier, this largely
depends upon the scheduling efficiency of the dynamic allocation of multiplication resources in a
CGRA. However, even without that dynamic allocation, compute density is increased by a factor
of 1.4x compared to 32-bit multiplication operations. This means that while dynamic allocation in
hardware is an exciting opportunity, it is not necessary to get compute density improvements.
At first glance, the other operators reduce compute density compared to their clocked
counterparts. However when making this comparison, one must keep in mind that the operating
frequency of those operators in a clocked system are ultimately limited to the worst case. For
modern processors, this is somewhere between 1 and 4 GHz. Therefore, many of the throughput
numbers for the clocked comparison points are grossly inflated beyond what they would be in a
larger system context including 5 GHz for the counters, 10 GHz for the bitwise operators, 5.56
GHz for the comparison operator, and 3.34 GHz for the shift operators. The same is not true for
227
Operator Throughput per Transistor Energy per Operation
32-bit 64-bit 32-bit 64-bit
Counters 0.89 0.49
Add/Subtract (vs Han & Carlson) 1.4 2.8 0.61 0.31
Multiply (vs Dadda Ripple) 1.4 - 4.8 5.4 - 33.8 0.69 0.16
Bitwise 0.56 0.90 0.57 0.35
Shift (vs Custom) 0.34 0.44 0.84 0.46
Compare (vs Gate/Clk Tree) 0.12 - 1.3 0.25 - 2.8 16.4 - 0.45 9.75 - 0.23
Table 30. Comparison of average performance and energy for all operators against their
closest clocked counterparts. The multiplier throughput numbers depend on the
scheduling efficiency of the dynamic approach, and the compare numbers depend upon
the clock overhead associated with the clocked operator comparison point.
the circuits presented in this thesis because self-timed circuits, by definition, aren't synchronized
to any one signal.
Because these operators are digit serial, the implementation of a full arithmetic-logic unit
would require between 1000 and 2000 transistors with logic sharing. On a standard modern
processor chip with a three billion transistor budget, this could mean up to a million execution
nodes. This is orders of magnitude beyond the capacity of industry standard bit-parallel CGRAs
and well beyond the couple hundred thousand static instructions necessary to execute any program
in the SPEC2006 benchmark. Industry standard digit-serial CGRAs have similar capacities,
but sacrifice configurability and flexibility due to the lack of control flow and stream length
management. Instead, they are limited to accelerating specific problems with simple systolic
architectures in which data flow patterns are carefully baked into the architecture at design time.
The circuits presented in this thesis allow for the configurability and flexibility often found in
bit-parallel CGRAs while supporting capacities often found in digit-serial CGRAs.
For clocked systems, the next instruction must always wait for the next clock cycle to begin
execution as long as they are in separate pipeline stages. This is not the case for self-timed
systems. Instead self-timed systems have a forward-latency. This is the amount of time necessary
to do a full digit computation for the first digit. Note that this value is independent of operating
frequency. Because the next operation can begin computation as soon as it receives the first digit,
there is a difference between the operation throughput of the circuit and a CGRA's effective
throughput of sequential operations as shown in Table 31. Furthermore, when comparing the
effective sequential execution throughput, take note once again that the numbers for the clocked
comparison points are inflated compared to the throughput of those operators in larger system
contexts.
228
Effective Sequential
Operator Forward Latency Improvement
Execution Throughput
Add/Subtract (vs Han & Carlson) 172 ps 5.81 GHz 1.50
Multiply (vs Dadda Ripple) 561 ps 1.78 GHz 4.76
Bitwise 94 ps 10.64 GHz 1.06
Shift (vs Custom) 152 ps 6.58 GHz 1.97
Compare (vs Gate/Clk Tree) 538 ps 1.86 GHz 0.33
Table 31. Comparison of effective sequential execution throughput of all operators against
their closest clocked counterparts. The comparison operator is an average of 444 ps from
not-equal comparisons 73% of the time and 791 ps from the other comparisons the other 27%
of the time.
Finally, a memory designed for length adaptive digit-serial systems, will require 8 to 16
more read and write ports. However, those ports will have a much smaller bitwidth to match
the digit-serial datapath and will need to be fed less often. This will ultimately introduce some
overhead into the memory system. However, due to length-adaptivity, those buses would be able
to source a higher throughput compared to a single bit-parallel bus by communicating less data
overall. And given that modern memories have a large array of banks, an intelligent memory
scheduling algorithm should be able to mitigate the possible contention issues. Memories are
already addressed per-byte and it is unlikely that would need to change since the overhead
associated with reading an extra token from the bank is not ultimately too much.
Even with all of these benefits, there is quite a bit of work to do.
1. Further workload characterization can be done to identify common instruction groupings and
allow for a tighter microp fusion within each execution node. This would allow multiple
operations to be programmed to a single execution node.
2. The serial to parallel units can be redesigned to remove unnecessary pipelining. This would
save significant amounts of energy for the shift and multiply operations when loading the
second operand.
3. A constant time buffer could allow for automatic slack matching. This would solve any
and all deadlock problems in the CGRA network and enable the implementation of a
length-adaptive digit-serial division circuit.
4. A skip condition could be added to the AND and OR operators. This would eliminate any
redundant bits generated by the current implementation and nearly eliminate the need for
stream compression.
5. MSB first floating point digit-serial logic could be explored. This represents a significant
229
opportunity, particularly for dealing with comparison operators and division.
6. Bundled-data circuitry should be explored using the other reshufflings, particularly PCHB.
This might allow the delay lines from two stages to overlap eachother, providing for a
potential speedup.
7. Alternative methods for signal delay should be explored. Particularly, because configuring
an array of delay lines will end up being fairly expensive.
8. Micropipelining can be used to split the datapath computation in two and distribute those two
halves over the two phases of the QDI handshake. This would allow for the use of symmetric
delay lines, removing the overhead of the asymmetric delay line implementation. This would
also allow the delay lines to be shorter, increasing the overall throughput.
These kinds of optimizations could be made easier with sufficient tooling support. In
particular, a systematic application of the outlined synthesis procedure would allow for faster
iteration of designs and therefore quicker optimization. Verification tools targeting the Integrated
QDI/Bundled-Data design could make it easier to resolve various clocking problems. And a more
intuitive approach for visualizing the interacting event cycles in a QDI process could make it
easier to identify potential optimization steps along the way.
Eventually, these operators should be built into a comprehensive CGRA architecture for
general compute to explore the remaining hypotheses from Chapter 2. In particular, the process
of communicating data into and out from the array is currently inefficient. How should memory
be structured for such and architecture? How should memory be addressed for digit-serial
processing? These questions will require a significant amount of time to answer.
While the design process is complex the resulting circuitry is robust, fast, and extremely power
efficient. These circuits present a new opportunity for highly efficient array architectures, and this
thesis opens the door to many opportunities for further research.
230
APPENDIX A
CHP NOTATION
Communicating Hardware Processes (CHP) is a hardware description language used to
describe clockless circuits derived from C.A.R. Hoare's Communicating Sequential Processes
(CSP) [61]. A full description of CHP and its semantics can be found in [65]. Below is an informal
description of that notation listed top to bottom in descending precedence.
Dataless vs Datafull Dataless expressions operate on node voltages while Datafull operate
on delay insensitive encodings. Mixed expressions implicitly cast the datafull to dataless using
the encoding's validity. Specifically, for a datafull expression e its positive sense e is cast to a
validity check while its negative sense ¬e is cast to a neutrality check. null is defined to be a
neutral state of an encoding.
A Channel X consists of a request Xr and either an acknowledge Xa or enable Xe . The
acknowledge and enable serve the same purpose, but have inverted sense. With these signals, a
channel implements a network protocol to transmit data from one QDI process to another.
• Skip skip does nothing and continues to the next command.
• Dataless Assignment n↾ sets the voltage of the node n to Vdd and n⇂ sets it to GND .
• Assignment v := E waits until the datafull expression, E , is valid, then assigns that value to
the variable, v .
• Send X!E waits until the datafull expression E has a valid value, then sends that value across
the channel X . Ultimately, a send is expanded into a handshake on its underlying signals. The
standard four phase send on channel X is Xr := E; [Xa]; Xr := null; [¬Xa] for an
acknowledge channel or Xr := E; [¬Xe]; Xr := null; [Xe] for an enable channel.
• Receive X?v waits until there is a valid value on the channel X , then assigns that value to the
variable v . Ultimately, a receive is expanded into a handshake on its underlying signals. The
standard four phase receive on channel X is v := Xr; Xa↾; [¬Xr]; Xa⇂ for an acknowledge
channel or v := Xr; Xe⇂; [¬Xr]; Xe↾ for an enable channel.
• Dataless Channel Action If X is a dataless channel, then a send with an acknowledge channel
is indistinguishable from a receive with an enable channel and a send with an enable channel is
indistinguishable from a receive with an acknowledge channel. Therefore, we can simplify the
syntax for the dataless send X! or receive X? to X .
• Partial Send X := E executes only the first statement in the protocol of a channel send Xr :=
E on channel X without executing the remaining protocol. The remaining protocol may then
be executed by calling the send without providing data X! .
• Probe X? is used determine if the channel is ready for a receive action, returning the value
waiting on the request Xr without executing the receive. X! is used to determine if the channel
is ready for a send action, expanding into either ¬Xa given an acknowledge or Xe given an
enable. For dataless channels, the syntax is simplified to X .
231
• Simultaneous Composition S • T executes the programs S and T at the same time.
• Internal Parallel Composition S, T executes the programs S and T in any order.
• Sequential Composition S; T executes the programs S followed by T .
• Parallel Composition S ∥ T executes the programs S and T in any order.
• Deterministic Selection [G1 → S1▯…▯Gn → Sn] where Gi , called a guard, is a dataless
expression and Si is a program. The selection waits until one of the guards, Gi , evaluates to
Vdd , then executes the corresponding program, Si . The guards must be stable and mutually
exclusive. The notation [G] is shorthand for [G → skip] .
• Non-Deterministic Selection [G1 → S1|…|Gn → Sn] is the same as Deterministic Selection
except that the guards don't have to be stable or mutually exclusive. If two or more evaluate to
Vdd simultaneously, then one is picked arbitrarily (not necessarily random).
• Repetition ∗[G1 → S1▯…▯Gn → Sn] or ∗[G1 → S1|…|Gn → Sn] is similar to the selection
statements. However, the action is repeated until no guard evaluates to Vdd . ∗[S] is shorthand
for ∗[Vdd → S] .
A.1 Examples
This is a Buffer. It implements a single pipeline stage. For every cycle, this reads the value v
from the channel L and sends it on channel R .
∗[ L?v; R!v ]
This is a Conditional Split. It receives a control token from the channel C and a data value
from channel L every cycle. The value from C determines which of channels R0 or R1 the
received data value is sent. If c = 0 , then v is sent on channel R0 . If c = 1 , then v is sent on
channel R1 .
∗[ C?c; L?v;
[ c=0 → R0!v
▯ c=1 → R1!v
]
]
This is a Conditional Merge. Like the conditional split, this receives a control token from the
channel C every cycle. This control token determines from which of channels L0 or L1 to receive
the data value v . v is then forwarded on the channel R .
∗[ C?c;
[ c=0 → L0?v
▯ c=1 → L1?v
]; R!v
]
This is a Non-deterministic Merge. If a token arrives on channel L0 before L1 then the data
value v is read from L0 . If it arrives on channel L1 first, then v is read from L1 . If tokens arrive
232
on both channels simultaneously, then one is selected arbitrarily. v is then sent on channel R .
∗[[ L0 → L0?v
| L1 → L1?v
]; R!v
]
This is a Full Adder. It reads data from channels A , B and Ci and sends the resulting sum on
channel S and carry out on channel Co .
∗[ A?a, B?b, Ci?c; s := a+b+c; S!s0, Co!s1 ]
233
Fig. 105: Resulting gate for asymmetric
C-element (left) and pass-transistor XOR
(right).
APPENDIX B
PRS NOTATION
In a Production Rule Set (PRS), a Production Rule is a compact way to specify a single pull-up
or pull-down network in a circuit. An alias a = b aliases two names to one circuit node. A rule
G → A represents a guarded action where G is a guard (as described above) and A is a dataless
assignment as described above. A gate is made up of multiple rules that describe the up and down
assignments. The guard of each rule in a gate represents a part of the pull-up or pull-down network
of that gate depending upon the corresponding assignment. If the rules of a gate do not cover all
conditions, then the gate is state-holding with a staticizer. For such a gate driving a node X , the
internal node before the staticizor is referenced as _X . Finally, given a source S , a pass gate is
specified with @S ∧ G → A or ¬@S ∧ G → A depending upon the assignment. Example circuits
are presented in Fig. 105.
B.1 Examples
Fig. 105 shows two examples of gates expressed by production rules.
Asymmetric C-Element
(¬A ∨ ¬B) ∧ ¬C → S⇂
B ∧ (A ∨ D) → S↾
Pass-Transistor XOR
@B1 ∧ ¬A1 ∨ @B0 ∧ ¬A0 → S↾
¬@B1 ∧ A0 ∨ ¬@B0 ∧ A1 → S⇂
234
REFERENCES
R.1 History
[1] Raúl Rojas; Ulf Hashagen, “Reconstruction of the Atanasoff-Berry Computer.” MIT Press, 2002.
[2] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. “Parallel Computer Architecture: A Hardware/
Software Approach.” Gulf Professional Publishing, 1999, Pages 15-16.
[3] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google
Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin
A. Nowak, and Erez Lieberman Aiden. “Quantitative analysis of culture using millions of digitized books.”
Science. 2011.
[4] Stephen Dolan. “mov is Turing-complete.” Personal Publication, 2013.
[5] Chris Domas. “M/o/Vfuscator2.” Github, June 2015
[6] Patrice Roussel, et al. “Method and apparatus for staggering execution of a single packed data instruction
using the same circuit.” U.S. Patent No. 6,230,257. 8 May 2001.
[7] Ronny Ronen, Alexander Peleg, and Nathaniel Hoffman. “System and method for fusing instructions.” U.S.
Patent No. 6,675,376. 6 Jan. 2004.
[8] Patrice Roussel, et al. “Method and apparatus for staggering execution of an instruction.” U.S. Patent No.
6,425,073. 23 Jul. 2002.
[9] Intel. “Energy-Efficient, High Performing and Stylish Intel–Based Computers to Come with Intel® Core™
Microarchitecture.” Intel Developer Forum, San Francisco CA, March 2006. (mirror)
[10] John Backus. “Can programming be liberated from the von Neumann style?: a functional style and its algebra
of programs.” ACM Annual Conference. Seattle, Oct 1977.
[11] Francky Catthoor, et al. “Data access and storage management for embedded programmable processors.”
Springer Science & Business Media, 2013.
R.2 Process Technology
[12] Sanjay Natarajan, et al. “Process Development and Manufacturing of High-Performance Microprocessors on
300mm Wafers.” Intel Technology Journal, Volume 6 Number 2. May 2002. (mirror)
[13] Kelin J Kuhn. “CMOS Transistor Scaling Past 32nm and Implications on Variation.” Advanced
Semiconductor Manufacturing Conference, 2010. (mirror)
[14] Mark Bohr. “Silicon Technology Leadership for the Mobility Era.” Intel Developer Forum, 2012. (mirror)
[15] Bill Holt. “Advancing Moore's Law.” Investor Meeting Santa Clara, 2015. (mirror)
[16] Eugene S. Meieran. “21st Century Semiconductor Manufacturing Capabilities.” Intel Technology Journal. 4th
Quarter 1998. (mirror)
[17] Linley Gwennap, “Estimating IC Manufacturing Costs: Die size, process type are key factors in
microprocessor cost.” Microprocessor report, Volume 7. August 1993. (data mirror)
235
[18] Robert Chau. “Advanced Metal Gate/High-K Dielectric Stacks for High-Performance CMOS Transistors.”
Proceedings of the 5th International Conference on Microelectronics Interfaces. AVS, 2004.
[19] Brian Doyle, et al. “Transistor Elements for 30nm Physical Gate Lengths and Beyond.” Intel Technology
Journal 6.2, 2002.
[20] Chris Auth, et al. “A 22nm high performance and low-power CMOS technology featuring fully-depleted
tri-gate transistors, self-aligned contacts and high density MIM capacitors.” 2012 Symposium on VLSI
Technology (VLSIT). IEEE, 2012.
[21] Stefan Rusu, et al. “Trends and challenges in VLSI technology scaling towards 100 nm.” Proceedings of
ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h
International Conference on VLSI Design. IEEE, 2002.
R.3 Processor Performance
[22] Hadi Esmaeilzadeh, et al. “Dark silicon and the end of multicore scaling.” 38th Annual international
symposium on computer architecture (ISCA). IEEE, 2011.
[23] Andrew Danowitz, et al. “CPU DB: recording microprocessor history.” Queue 10.4 (2012): 10.
[24] John D. McCalpin. “STREAM: Sustainable Memory Bandwidth in High Performance Computers.”
Department of Computer Science School of Engineering and Applied Science University of Virginia, 1991.
Accessed: August 8, 2019. Available: https://www.cs.virginia.edu/stream/.
[25] Andre DeHon. “Comparing computing machines.” Configurable Computing: Technology and Applications.
Vol. 3526. International Society for Optics and Photonics, 1998.
R.4 Program Workload Analysis
[26] SPEC CPU Subcommittee. “SPEC CPU Benchmarks.” 1992.
[27] Christian Bienia, et al. “The PARSEC benchmark suite: Characterization and architectural implications.”
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. ACM,
2008.
[28] Reena Panda, et al. “Wait of a decade: Did spec cpu 2017 broaden the performance horizon?.” International
Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018.
[29] Christos Sakalis, et al. “Splash-3: A properly synchronized benchmark suite for contemporary research.”
International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2016.
[30] Raghunath Nambiar, Meikel Poess, Akon Dey, Paul Cao, Tariq Magdon-Ismail, Da Qi Ren, and Andrew
Bond. “Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems.”
Proceedings of the Conference on Performance Characterization and Benchmarking (TPCTC). Springer,
2014.
[31] “CPU Benchmarks.” Passmark Software. Accessed: October 15, 2019. Available:
https://www.cpubenchmark.net/.
[32] Vishal Aslot, et al. “SPEComp: A new benchmark suite for measuring parallel computer performance.”
International Workshop on OpenMP Applications and Tools. Springer, Berlin, Heidelberg, 2001.
236
[33] Ankur Limaye, and Tosiron Adegbija. “A workload characterization of the SPEC CPU2017 benchmark
suite.” 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
IEEE, 2018.
[34] Mark Stephenson, Jonathan Babb, and Saman Amarasinghe. “Bidwidth analysis with application to silicon
compilation.” ACM SIGPLAN Notices, Volume 35 Number 5. ACM, 2000.
[35] Emre Özer, Andy P. Nisbet, and David Gregg. “Stochastic bit-width approximation using extreme value
theory for customizable processors.” International Conference on Compiler Construction. Springer, Berlin,
Heidelberg, 2004.
[36] Andrew S. Huang, and John Paul Shen. “The Intrinsic Bandwidth Requirements of Ordinary Programs.”
Proceedings of the 7th International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS). ACM, October 1996.
[37] Alexandru Nicolau, and Joseph A. Fisher. “Measuring the parallelism available for very long instruction word
architectures.” Transactions on Computers, vol 11, pages 968-976. IEEE, 1984.
[38] Manoj Kumar. “Measuring parallelism in computation-intensive scientific/engineering applications.”
Transactions on Computers, vol 37 issue 9 pages 1088-1098. IEEE, 1988.
[39] Todd M. Austin, and Gurindar S. Sohi. “Dynamic Dependency Analysis of Ordinary Programs.” Proceedings
the 19th Annual International Symposium on Computer Architecture. IEEE, 1992.
[40] Heng Liao, and Andrew Wolfe. “Available Parallelism in Video Applications.” Proceedings of 30th Annual
International Symposium on Microarchitecture. IEEE, 1997.
[41] Rahul Sathe, and Manoj Franklin. “Available parallelism with data value prediction.” Proceedings of the Fifth
International Conference on High Performance Computing. IEEE, 1998.
[42] Bernard Goossens, et al. “De quoi est faite une trace d'exécution?.” Dissertation at Lirmm, 2012.
[43] Venkatesan Packirisamy, et al. “Exploring speculative parallelism in SPEC2006.” International Symposium
on Performance Analysis of Systems and Software. IEEE, 2009.
[44] Daniele Folegnani, and Antonio González. “Energy-effective issue logic.” Proceedings of the 28th
International Symposium on Computer Architecture (ISCA), Pages 230-239. ACM, Göteborg Sweden, July
2001.
[45] James Balfour. “Efficient Embedded Computing.” Stanford University Ph.D. Thesis, 2010.
[46] Ritesh A. Patel, et al. “Parallel lossless data compression on the GPU.” Innovative Parallel
Computing-Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012).
IEEE, 2012.
[47] John Michalakes, and Manish Vachharajani. “GPU acceleration of numerical weather prediction.”
International Symposium on Parallel and Distributed Processing. IEEE, 2008.
[48] Wen-meii Hwu, et al. “Performance insights on executing non-graphics applications on CUDA on the
NVIDIA GeForce 8800 GTX.” Hot Chips 19 Symposium (HCS). IEEE, 2007.
237
[49] Karl Ljungkvist. “Matrix-free finite-element computations on graphics processors with adaptively refined
unstructured meshes.” Proceedings of the 25th High Performance Computing Symposium. Society for
Computer Simulation International, 2017.
[50] Mark James Abraham, et al. “GROMACS: High performance molecular simulations through multi-level
parallelism from laptops to supercomputers.” SoftwareX 1, Pages 19-25. 2015.
[51] Wenhua Yu, et al. “Advanced features to enhance the FDTD method in GEMS simulation software package.”
International Symposium on Antennas and Propagation (APSURSI). IEEE, 2011.
[52] John C. Vernaleo, and Christopher S. Reynolds. “AGN feedback and cooling flows: problems with simple
hydrodynamic models.” The Astrophysical Journal, Volume 645, Issue 1, Page 83. 2006.
[53] James C. Phillips, John E. Stone, and Klaus Schulten. “Adapting a message-driven parallel application to
GPU-accelerated clusters.” Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Press, 2008.
[54] Greg Ruetsch, and Massimiliano Fatica. “A CUDA fortran implementation of BWAVES.” PGI Insider,
September 2010.
[55] Karl A. Wilkinson., et al. “Acceleration of the GAMESS‐UK electronic structure package on graphical
processing units.” Journal of Computational Chemistry, Volume 32, Issue 10, Pages 2313-2318. 2011.
[56] Guangming Tan, et al. “Single-particle 3d reconstruction from cryo-electron microscopy images on GPU.”
Proceedings of the 23rd international conference on Supercomputing. ACM, 2009.
[57] Avi Bleiweiss. “GPU accelerated pathfinding.” Proceedings of the 23rd ACM SIGGRAPH/
EUROGRAPHICS symposium on Graphics hardware. Eurographics Association, 2008.
[58] Guochun Shi, et al. “Design of MILC lattice QCD application for GPU clusters.” 2011 IEEE International
Parallel & Distributed Processing Symposium. IEEE, 2011.
[59] Eladio Gutierrez, et al. “Simulation of quantum gates on a novel GPU architecture.” International Conference
on Systems Theory and Scientific Computation. 2007.
[60] John Paul Walters, et al. “Evaluating the use of GPUs in liver image segmentation and HMMER database
searches.” International Symposium on Parallel and Distributed Processing. IEEE, 2009.
R.5 Asynchronous Design
[61] Sir Charles Antony Richard Hoare. “Communicating Sequential Processes”. Communications of the ACM,
Pages 666-677. ACM, 1978.
[62] Steven Burns; Alain Martin. “Syntax-Directed Translation of Concurrent Programs into Self-Timed Circuits.”
California Institute of Technology, 1988.
[63] Ivan E. Sutherland. “Micropipelines.” Communications of the ACM, Volume 32 Number 6 Pages 720-738.
ACM, 1989.
[64] Marly Roncken, et al. “Beyond carrying coal to newcastle: dual citizen circuits.” This Asynchronous World -
Essays dedicated to Alex Yakovlev on the occasion of his 60th birthday, Pages 241-261. Newcastle
University, 2016.
238
[65] Alain J. Martin. “Synthesis of Asynchronous VLSI Circuits”. Computer Science Department at California
Institute of Technology: Caltech-CS-TR-93-28, 1991.
[66] Kees van Berkel, Joep Kessels, Marly Roncken, Ronald Saeijs, Frits Schalij. “The VLSI-programming
language Tangram and its translation into handshake circuits.” Proceedings of the European Conference on
Design Automation. IEEE, 1991.
[67] Cortadella, Jordi, et al. “Petrify: a tool for manipulating concurrent specifications and synthesis of
asynchronous controllers.” IEICE Transactions on information and Systems 80.3, 1997.
[68] Andrew M. Lines. “Pipelined asynchronous circuits.” Computer Science Department at California Institute of
Technology: Caltech-CS-TR-95-21, 1998.
[69] Jens Spars, and Steve Furber. “Principles Asynchronous Circuit Design.” Kluwer Academic Publishers, 2002.
[70] Ken S. Stevens, Ran Ginosar, and Shai Rotem. “Relative timing [asynchronous design].” Transactions on
Very Large Scale Integration (VLSI) Systems, Volume 11 Number 1. IEEE, 2003.
[71] Cortadella, Jordi, et al. “Desynchronization: Synthesis of asynchronous circuits from synchronous
specifications.” Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 25
Number 10. IEEE, 2006.
[72] Christopher LaFrieda, and Rajit Manohar. “Reducing power consumption with relaxed quasi delay-insensitive
circuits.” Proceedings of the 15th International Symposium on Asynchronous Circuits and Systems (ASYNC).
IEEE, 2009.
[73] Christopher LaFrieda. “Relaxed Quasi Delay-Insensitive Circuits.” Cornell University, 2010.
[74] Steven Nowick, and Montek Singh. “High-performance asynchronous pipelines: An overview.” Design &
Test of Computers, Volume 28 Issue 5 Pages 8-22. IEEE, 2011.
[75] Ned Bingham and Rajit Manohar. “QDI Constant Time Counters”. Transactions on VLSI. IEEE, 2018.
[76] Ned Bingham and Rajit Manohar. “Self-Timed Length-Adaptive Digit-Serial Addition”. Transactions on
VLSI. IEEE, 2019.
[77] Ned Bingham and Rajit Manohar. “A Systematic Approach for Arbitration Expressions”. IEEE Transactions
on Circuits and Systems I. IEEE, 2020.
[78] Hans M. Jacobson, et al. “Synchronous interlocked pipelines.” Proceedings Eighth International Symposium
on Asynchronous Circuits and Systems. IEEE, 2002.
[79] Manohar, Rajit, and Alain J. Martin. “Quasi-Delay-Insensitive Circuits are Turing-Complete.” Proceedings of
the International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE, March
1996.
[80] Rajit Manohar, Tak-Kwan Lee, and Alain J. Martin. “Projection: A synthesis technique for concurrent
systems.” Proceedings of the International Symposium on Advanced Research in Asynchronous Circuits and
Systems (ASYNC). IEEE, 1999.
[81] Rajit Manohar. “An analysis of reshuffled handshaking expansions.” International Symposium on
Asynchronous Circuits and Systems (ASYNC). IEEE, 2001.
239
[82] Rajit Manohar, Yoram Moses. “Analyzing Isochronic Forks with Potential Causality.” International
Symposium on Asynchronous Circuits and Systems. pp. 69–76. IEEE, 2015.
[83] Rajit Manohar. “Folded FIFOs.” Computer Science Technical Reports, California Institute of Technology,
1995.
[84] Manohar, Rajit, and Alain J. Martin. “Slack elasticity in concurrent computing.” International Conference on
Mathematics of Program Construction. Springer, Berlin, Heidelberg, 1998.
[85] Tom Verhoeff. “Delay-insensitive codes—an overview.” Distributed computing 3.1 pages 1-8. 1988.
[86] Catherine Wong and Alain Martin. “High-level synthesis of asynchronous systems by data-driven
decomposition”. Proceedings of the 40th annual Design Automation Conference. Proceedings of the 40th
annual Design Automation Conference (DAC), pp. 508--513, June 2003.
[87] Nabil Imam, et al. “A digital neurosynaptic core using event-driven qdi circuits”. Asynchronous Circuits and
Systems (ASYNC), 2012 18th IEEE International Symposium on. IEEE, 2012.
[88] Navaneeth Jamadagni, and Jo Ebergen. “An asynchronous divider implementation”. Asynchronous Circuits
and Systems (ASYNC), 2012 18th IEEE International Symposium on. IEEE, 2012.
[89] Rajit Manohar. “ACT Toolset.” https://github.com/asyncvlsi/act, 2019.
[90] Alain Martin. “The limitations to delay-insensitivity in asynchronous circuits.” Beauty is our Business, Pages
302-311. Springer, New York, NY, 1990. 302-311.
[91] Sean Keller, Michael Katelman, and Alain J. Martin. “A necessary and sufficient timing assumption for
speed-independent circuits.” International Symposium on Asynchronous Circuits and Systems (ASYNC).
IEEE, 2009.
R.6 Counters
[92] Mircea R. Stan, Alexandre Tenca, and Milos Ercegovac. “Long and fast up/down counters”. IEEE
Transactions on computers 47.7 (1998): 722-735.
[93] David H. Eby. “Programmable ripple counter having exclusive OR gates”. US Patent 4,612,658, September
16, 1986.
[94] Kim H. Eckert. “Ripple counter with reverse-propagated zero detection”. US Patent 5,060,243, October 22,
1991.
[95] Larsson, Patrik. “High-speed architecture for a programmable frequency divider and a dual-modulus
prescaler”. IEEE Journal of Solid-State Circuits 31.5 (1996): 744-748.
[96] Kees Van Berkel. “VLSI programming of a modulo-N counter with constant response time and constant
power”. Proceedings of the Working Conference Asynchronous Design Methodologies, Manchester, U.K.,
1993, pp. 1-11.
[97] Kees Van Berkel. “Handshake Circuits: an Asynchronous Architecture for VLSI programming”. Vol. 5.
Cambridge University Press, 1993.
[98] Jon Tse and Derek Lockhart. “An Asynchronous Constant-Time Counter for Empty Pipeline Detection”.
jontse.com, 2009.
240
R.7 Adders
[99] Harris, David. “A taxonomy of parallel prefix networks.” Signals, Systems and Computers, 2004. Conference
Record of the Thirty-Seventh Asilomar Conference on. Vol. 2. IEEE, 2003.
[100] Gerald B. Rosenberger. “Simultaneous Carry Adder.” US Patent 2,966,305, December 27, 1957.
[101] Peter M. Kogge, Harold S. Stone. “A Parallel Algorithm for the Efficient Solution of a General Class of
Recurrence Equations.” IEEE Transactions on Computers, 1973, C-22, 783-791
[102] Richard P. Brent and H. T. Kung. “A regular layout for parallel adders.” IEEE transactions on Computers 3
(1982): 260-264.
[103] Tackdon Han, and David A. Carlson. “Fast area-efficient VLSI adders.” Computer Arithmetic (ARITH), 1987
IEEE 8th Symposium on. IEEE, 1987.
[104] T. Kilburn, D. B. G. Edwards, and D. Aspinall. “Parallel Addition in Digital Computers: A New Fast 'Carry'
Circuit.” Proceedings of the IEE-Part B: Electronic and Communication Engineering 106.29 (1959): 464-466.
[105] H. Ling. “High Speed Binary Adder.” IBM J. Reasearch. Dev., Vol. 25, No. 3, p.156, May, 1981.
[106] Samuel Naffziger. “A Sub-Nanosecond 0.5um 64b Adder Design.” IEEE International Solid-State Circuits
Conference 1996.
R.8 Multipliers
R.8.1 Algorithms
[107] Anatolii Alekseevich Karatsuba, and Yu P. Ofman. “Multiplication of many-digital numbers by automatic
computers.” Doklady Akademii Nauk, Volume 145, Issue 2, Pages 293-294. Russian Academy of Sciences,
1962.
[108] Anatolii Alexeevich Karatsuba. “The complexity of computations.” Proceedings of the Steklov Institute of
Mathematics-Interperiodica Translation 211, Pages 169-183. 1995.
[109] Peyman Afshani, Casper Benjamin Freksen, Lior Kamma, and Kasper Green Larsen. “Lower bounds for
multiplication via network coding.” International Colloquium on Automata, Languages and Programming,
Volume 132. 2019.
[110] Andrei L. Toom. “The complexity of a scheme of functional elements realizing the multiplication of integers.”
Soviet Mathematics Doklady, Volume 3, Issue 4. 1963.
[111] Stephen A. Cook, and Stål O. Aanderaa. “On the minimum computation time of functions.” Transactions of
the American Mathematical Society, Volume 142, Pages 291-314. 1969.
[112] Arnold Schönhage, Volker Strassen. “Schnelle Multiplikation großer Zahlen.” Computing 7, Pages 281–292.
Springer, 1971.
[113] Martin Fürer. “Faster integer multiplication.” Symposium on Theory of Computing, Volume 39, Pages 57-66.
ACM, 2007.
[114] David Harvey, Joris van der Hoeven. “Integer multiplication in time O(n log n).” 2019.
241
[115] Ciara Rafferty, Máire O’Neill, and Neil Hanley. “Evaluation of large integer multiplication methods on
hardware.” Transactions on Computers, Volume 66, Issue 8, Pages 1369-1382. IEEE, 2017.
R.8.2 Bit-Parallel
[116] Andrew D. Booth. “A signed binary multiplication technique.” The Quarterly Journal of Mechanics and
Applied Mathematics, Volume 4, Issue 2, Pages 236-240. Oxford University Press, 1951.
[117] Edward L. Braun. “Digital Computer Design.” Academic Press, New York, 1963.
[118] Christopher S. Wallace “A suggestion for a fast multiplier.” Transactions on Electronic Computers 1, Pages
14-17. IEEE, 1964.
[119] Luigi Dadda. “Some schemes for parallel multipliers.” Alta Frequenza, Volume 34, Issue 5, Pages 349–356.
May 1965.
[120] Tien Chi Chen. “A binary multiplication scheme based on squaring.” Transactions on Computers, Volume
100, Issue 6, Pages 678-680. IEEE, 1971.
[121] Charles R. Baugh, and Bruce A. Wooley. “A two's complement parallel array multiplication algorithm.”
Transactions on computers, Volume C-22, Issue 12, Pages 1045-1047. IEEE 1973.
[122] David Kroft. “Comments on “A Two's Complement Parallel Array Multiplication Algorithm”.” Transactions
on Computers, Volume C-23, Issue 12, Pages 1327-1328. IEEE, 1974.
[123] William J. Stenzel, William J. Kubitz, and Gilles H. Garcia. “A compact high-speed parallel multiplication
scheme.” Transactions on Computers, Volume C-26, Issue 10, Pages 948-957. IEEE, 1977.
[124] Jun Iwamura, Kazuo Suganuma, Sinji Taguchi, Minoru Kimura, and Kenji Maeguchi. “16-bit CMOS/SOS
multiplier-accumulator.” Conference on Circuits and Computers. IEEE, New York, NY, 1982.
R.8.2 Digit-Serial
[125] A. J. Atrubin, A. J. “A one-dimensional real-time iterative multiplier.” Transactions on Electronic Computers
Volume, EC-14, Issue 3, Pages 394-399. IEEE, 1965.
[126] Earl E. Swartzlander. “The quasi-serial multiplier.” Transactions on Computers Volume C-22, Issue 4, Pages
317-321. IEEE, 1973.
[127] T. G. McDonald, Ratan K. Guha. “The two's complement quasi-serial multiplier.” Transactions on
Computers, Volume C-24, Issue 12, Pages 1233-1235. IEEE, 1975.
[128] Richard Lyon. “Two's complement pipeline multipliers.” Transactions on Communications, Volume 24, Issue
4, Pages 418-425. IEEE, 1976.
[129] I-Ngo Chen, Robert Willoner. “An 0(n) parallel multiplier with bit-sequential input and output.” Transactions
on Computers, Volume 100, Issue 10, Pages 721-727. IEEE, 1979.
[130] Luigi Dadda. “Some schemes for fast serial input multipliers.” Symposium on Computer Arithmetic
(ARITH). IEEE, 1983.
[131] Per-Erik Danielsson. “Serial/parallel convolvers.” Transactions on Computers, Volume 100, Issue 7, Pages
652-667. IEEE, 1984.
242
[132] R. Gnanasekaran. “A fast serial-parallel binary multiplier.” Transactions on Computers, Volume 34, Issue 8,
Pages 741-744. IEEE, 1985.
[133] Paolo Ienne, Marc A. Viredaz. “Bit-serial multipliers and squarers.” Transactions on Computers, Volume 43,
Issue 12, Pages 1445-1450. IEEE, 1994.
[134] Yun-Nan Chang, Janardhan H. Satyanarayana, and Keshab K. Parhi. “Low-power digit-serial multipliers.”
International Symposium on Circuits and Systems (ISCAS), Volume 3. IEEE, 1997.
[135] Amar Aggoun, A. Ashur, and M. K. Ibrahim. “Area-time efficient serial-serial multipliers.” International
Symposium on Circuits and Systems (ISCAS), Volume 5. IEEE, 2000.
[136] M. B. Tosic, et al. “Speed-independent bit-serial multiplier.” Proceedings of International Conference on
Microelectronics, Volume 2. IEEE, 1995.
R.9 Computer Arithmetic
[137] Reto Zimmermann. “Computer Arithmetic: Principles, Architectures, and VLSI Design.” Personal
Publication, 1999.
[138] Ercegovac, Miloš D., and Tomas Lang. “Digital Arithmetic.” Elsevier, 2004.
R.9.1 Bit-Parallel
[139] Bojan Jovanović, and Milun Jevtić. “Optimization of the Binary Adder Architectures Implemented in ASICs
and FPGAs.” Soft Computing Applications, Pages 295-308. 2013.
[140] D. J. Kinniment. “An Evaluation of Asynchronous Addition.” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems 4.1 (1996): 137-140.
R.9.2 Digit-Serial
[141] M. Lehman, D. Senzig, and J. Lee. “Serial Arithmetic Techniques.” Proceedings of the AFIPS Joint Computer
Conference, Pages 715-725, November 1965.
[142] Richard Hartley, and Peter Corbett. “Digit-serial processing techniques.” Transactions on Circuits and
Systems, Volume 37 Number 6 Pages 707-719. IEEE, 1990.
[143] Stewart G. Smith, and Peter B. Denyer. “Serial-Data Computation.” Volume 39. Springer Science & Business
Media, 2012.
[144] Richard Hartley, and Keshab Parhi. “Digit-Serial Computation.” Pages 6, 15, and 25. Springer Science &
Business Media, 2012.
Least Significant Digit First
[145] Labros Bisdounis, et al. “VLSI implementation of digit-serial arithmetic modules.” Microprocessing and
Microprogramming, Volume 39 Numbers 2-5 Pages 251-254. Elsevier, December 1993.
[146] J. Povazanec and C. S. Choy and C. F. Chan “Asynchronous Logic in Bit-Serial Arithmetic.” Proceedings of
the International Conference on Electronics, Circuits, and Systems, Pages 175-178. IEEE, September 1998.
243
Most Significant Digit First
[147] Kishor S. Trivedi, and Milos D. Ercegovac. “On-line algorithms for division and multiplication.” Transactions
on Computers, Volume C-26 Number 7. IEEE, July 1977.
[148] Ercegovac, Milos D. “On-line arithmetic: An overview.” Real-Time Signal Processing VII. Vol. 495.
International Society for Optics and Photonics, 1984.
[149] Mary Jane Irwin, and Robert Michael Owens. “Digit-pipelined arithmetic as illustrated by the paste-up
system: A tutorial.” Computer, Volume 20 Number 4 Pages 61-73. IEEE, 1987.
R.10 Architecture Surveys
[150] Mary Jane Irwin, and Robert Michael Owens. “A Case for Digit Serial VLSI Signal Processors.” Journal of
VLSI Signal Processing Systems for Signal, Image and Video Technology, Volume 1, Number 4, Pages
321-334, April 1990.
[151] Andre DeHon. “Reconfigurable Architectures for General-Purpose Computing.” Artificial Intelligence
Laboratory of the Massachusetts Institute of Technology, 1996.
[152] DeHon, André. “Fundamental underpinnings of reconfigurable computing architectures.” Proceedings of the
IEEE Volume 103 Number 3 Pages 355-378, 2015.
[153] Russell Tessier, Kenneth Pocek, and Andre DeHon. “Reconfigurable computing architectures.” Proceedings
of the IEEE Volume 103 Number 3 Pages 332-354, 2015.
[154] Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. “Coarse grained reconfigurable architectures in the past 25
years: Overview and classification.” 2016 International Conference on Embedded Computer Systems:
Architectures, Modeling and Simulation (SAMOS). IEEE, 2016.
[155] Ramya Raghavachari, and Moorthi Sridharan. “Review of recent trends in Coarse Grain Reconfigurable
Architectures for signal processing applications.” Advances in Systems Science and Applications, Volume 18
Number 1 Pages 41-58. 2018.
[156] Bjorn De Sutter, Praveen Raghavan, and Andy Lambrechts. “Coarse-grained reconfigurable array
architectures.” Handbook of signal processing systems, Pages 427-472. Springer, 2019
[157] Zaheer Tabassam, et al. “Towards Designing Asynchronous Microprocessors: From Specification to
Tape-Out.” IEEE Access Volume 7 Pages 33978-34003, March 2019.
R.11 Micro Processors
R.11.1 Bit-Parallel
Synchronous
[158] Donald M. Chiarulli, Walter G. Rudd, and Duncan A. Buell. “DRAFT: A dynamically reconfigurable
processor for integer arithmetic.” Proceedings of the 7th Symposium on Computer Arithmetic (ARITH).
IEEE, 1985.
244
[159] David Brooks, and Margaret Martonosi. “Dynamically exploiting narrow width operands to improve
processor power and performance.” Proceedings of the 5th International Symposium on High-Performance
Computer Architecture. IEEE, 1999.
[160] Victor Goulart and Kazuaki Murakami. “Dynamic effective precision matching computation.” Proceedings of
the 11th Workshop on Synthesis and System Integration of Mixed Information Technologies (SASIMI).
Hiroshima, 2003.
Self-Timed
[161] Alain J. Martin, et al. “The design of an asynchronous microprocessor.” Computer Science Department at
California Institute of Technology: CALTECH-CS-TR-89-2, 1989.
[162] Mark Edward Dean. “STRiP: A self-timed RISC processor.” Dissertation to the Department of Electrical
Engineering at Stanford University, 1992.
[163] K-R Cho, Kazum Okura, and Kunihiro Asada. “Design of a 32-bit fully asynchronous microprocessor
(FAM).” Proceedings of the 35th Midwest Symposium on Circuits and Systems. IEEE, 1992.
[164] Erik Brunvand. “The NSR processor.” Proceedings of the 26th Hawaii International Conference on System
Sciences. Vol. 1. IEEE, 1993.
[165] Nigel Charles Paver. “The design and implementation of an asynchronous microprocessor.” Dissertation at the
University of Manchester, 1994.
[166] Stephen B. Furber, et al. “AMULET1: a micropipelined ARM.” Proceedings of the 39th Computer Society
International Conference (COMPCON). IEEE, 1994.
[167] Takashi Nanya, et al. “TITAC: Design of a quasi-delay-insensitive microprocessor.” Design & Test of
Computers, Volume 11 Number 2 Pages 50-63. IEEE, 1994.
[168] Jose A. Tierno, et al. “A 100-MIPS GaAs asynchronous microprocessor.” Design & Test of Computers,
Volume 11 Number 2 Pages 43-49. IEEE, 1994.
[169] Shannon V. Morton, Sam S. Appleton, and Michael J. Liebelt. “ECSTAC: A fast asynchronous
microprocessor.” Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies.
IEEE, 1995.
[170] William F. Richardson, and Erik Brunvand. “Fred: An architecture for a self-timed decoupled computer.”
Proceedings of the 2nd International Symposium on Advanced Research in Asynchronous Circuits and
Systems (ASYNC). IEEE, 1996.
[171] John Corrie Elston, et al. “Hades-An asynchronous superscalar processor.” Colloquium on Design and Test of
Asynchronous Systems. IEEE, 1996.
[172] Alain J. Martin, et al. “The Design of an Asynchronous MIPS R3000 Microprocessor.” Proceedings of the
Conference on Advanced Research in VLSI (ARVLSI), Volume 97. 1997.
[173] Alain J. Martin, Mika Nyströem, Paul Penzes, Catherine Wong. “Speed and Energy Performance of an
Asynchronous MIPS R3000 Microprocessor.” California Institute of Technology, Pasadena CA, 2001.
245
[174] Akihiro Takamura, et al. “TITAC-2: An asynchronous 32-bit microprocessor based on
scalable-delay-insensitive model.” Proceedings of the International Conference on Computer Design VLSI in
Computers and Processors. IEEE, 1997.
[175] Jamin MC Tse, and Daniel PK Lun. “ASYNMPU: a fully asynchronous CISC microprocessor.” Proceedings
of the IEEE International Symposium on Circuits and Systems: Circuits and Systems in the Information Age
(ISCAS), Volume 3. IEEE, 1997.
[176] Kare T. Christensen, et al. “The design of an asynchronous TinyRISC TR4101 microprocessor core.”
Proceedings 4th International Symposium on Advanced Research in Asynchronous Circuits and Systems
(ASYNC). IEEE, 1998.
[177] Marc Renaudin, Pascal Vivet, and Frédéric Robin. “ASPRO-216: a standard-cell QDI 16-bit RISC
asynchronous microprocessor.” Proceedings of the 4th International Symposium on Advanced Research in
Asynchronous Circuits and Systems (ASYNC). IEEE, 1998.
[178] Hans Van Gageldonk, et al. “An asynchronous low-power 80C51 microcontroller.” Proceedings of the 4th
International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE, 1998.
[179] Rakefet Kol, and Ran Ginosar. “Kin: A high performance asynchronous processor architecture.” Proceedings
of the 12th international conference on Supercomputing. ACM, 1998.
[180] Stephen B. Furber, et al. “AMULET2e: An asynchronous embedded controller.” Proceedings of the IEEE,
volume 87 number 2 Pages 243-256, 1999.
[181] Jim D. Garside, Stephen B. Furber, and S-H. Chung. “AMULET3 revealed.” Proceedings of the 5th
International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE, 1999.
[182] Je-Hoon Lee, Won-Chul Lee, and Kyoung-Rok Cho. “A novel asynchronous pipeline architecture for CISC
type embedded controller, A8051.” Proceedings of the 45th Midwest Symposium on Circuits and Systems
(MWSCAS), Volume 2. IEEE, 2002.
[183] Alain J. Martin, et al. “The Lutonium: A sub-nanojoule asynchronous 8051 microcontroller.” Proceedings of
the Ninth International Symposium on Asynchronous Circuits and Systems. IEEE, 2003.
[184] Qianyi Zhang, and Georgios Theodoropoulos. “Towards an asynchronous MIPS processor.” Proceedings of
the Asia-Pacific Conference on Advances in Computer Systems Architecture. Springer, Berlin, Heidelberg,
2003.
[185] Qianyi Zhang, and G. Theodoropoulos. “Modelling SAMIPS: A synthesisable asynchronous MIPS
processor.” Proceedings of the 37th annual symposium on Simulation. IEEE Computer Society, 2004.
[186] Clinton Kelly, Virantha Ekanayake, and Rajit Manohar. “SNAP: A sensor-network asynchronous processor.”
Proceedings of the 9th International Symposium on Asynchronous Circuits and Systems, IEEE 2003.
[187] “Handshake Solutions HT80C51 User Manual.” Handshake Solutions, 2005. Accessed: August 5, 2019.
Available: http://www.keil.com/dd/chip/3931.htm
[188] Kok-Leong Chang, and Bah-Hwee Gwee. “A low-energy low-voltage asynchronous 8051 microcontroller
core.” Proceedings of the International Symposium on Circuits and Systems. IEEE, 2006.
246
[189] Luca Necchi, et al. “An ultra-low energy asynchronous processor for wireless sensor networks.” Proceedings
of the 12th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2006.
[190] Arjan Bink, and Richard York. “ARM996HS: The first licensable, clockless 32-bit processor core.” IEEE
Micro Volume 27 Number 2 Pages 58-68. 2007.
[191] “TAM16: 16-bit Microcontroller IP core.” Tiempo, 2008. Accessed: August 5, 2019. Available:
http://www.tiempo-ic.com/products/ip-cores/TAM16.html
[192] Meng-Chou Chang, and Da-Sen Shiau. “Design of an asynchronous pipelined processor.” Proceedings of the
International Conference on Communications, Circuits and Systems. IEEE, 2008.
[193] Je-Hoon Lee, Young Hwan Kim, and Kyoung-Rok Cho. “A low-power implementation of asynchronous 8051
employing adaptive pipeline structure.” Transactions on Circuits and Systems II: Express Briefs, Volume 55
Number 7 Pages 673-677. IEEE, 2008.
[194] Chang-Jiu Chen, et al. “A pipelined asynchronous 8051 soft-core implemented with Balsa.” Proceedings of
the Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 2008.
[195] Chang-Jiu Chen, et al. “A quasi-delay-insensitive microprocessor core Implementation for Microcontrollers.”
Journal of Information Science and Engineering, Volume 25 Number 2 Pages 543-557. 2009.
[196] Yijun Liu, et al. “Designing an asynchronous FPGA processor for low-power sensor networks.” Proceedings
of the International Symposium on Signals, Circuits and Systems. IEEE, 2009.
[197] Tsai Hung-Yue, et al. “A self-timed dual-rail processor core implementation for microcontrollers.”
Proceedings of the International Conference on Electronic Devices, Systems and Applications (ICEDSA).
IEEE, 2011.
[198] Otero, Carlos Tadeo Ortga, et al. “ULSNAP: An ultra-low power event-driven microcontroller for sensor
network nodes.” Quality Electronic Design (ISQED), 2014 15th International Symposium on. IEEE, 2014.
[199] Moises Herrera, and Francisco Viveros. “Asynchronous 8-bit processor mapped into an FPGA device.”
Proceedings of the Colombian Conference on Communications and Computing (COLCOM). IEEE, 2014.
[200] Ron Diamant, Ran Ginosar, and Christos Sotiriou. “Asynchronous sub-threshold ultra-low power processor.”
Proceedings of the 25th International Workshop on Power and Timing Modeling, Optimization and
Simulation (PATMOS). IEEE, 2015.
[201] Sean Keller, Alain J. Martin, and Chris Moore. “DD1: A QDI, Radiation-Hard-by-Design, Near-Threshold
18uW/MIPS Microcontroller in 40nm Bulk CMOS.” Proceedings of the 21st International Symposium on
Asynchronous Circuits and Systems (ASYNC). IEEE, 2015.
[202] Dipanjan Bhadra, and Kenneth S. Stevens. “Design of a low power, relative timing based asynchronous
MSP430 microprocessor.” Proceedings of the Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE, 2017.
[203] Charles E. Molnar, Robert F. Sproull, and Ivan E. Sutherland. “Counterflow pipeline processor architecture.”
Sun Microsystems, Inc. Mountain View, CA. 1994.
247
[204] Tony Werner, and Venkatesh Akella. “An asynchronous superscalar architecture for exploiting
instruction-level parallelism.” Proceedings 7th International Symposium on Asynchronous Circuits and
Systems (ASYNC). IEEE, 2001.
[205] Soumik Ghosh, Jared Tessier, and Magdy A. Bayoumi. “ASPEN: An asynchronous signal processor for
energy efficient sensor nodes.” Proceedings of the 17th International Conference on Electronics, Circuits and
Systems. IEEE, 2010.
[206] Simon Moore, Peter Robinson, and Steve Wilcox. “Rotary pipeline processors.” Proceedings of Computers
and Digital Techniques Volume 143 Number 5 Pages 259-265. IEEE, 1996.
[207] Philip Brian Endecott. “SCALP: A superscalar asynchronous low-power processor.” Dissertation at the
University of Manchester, 1996.
[208] Andrew Lines. “The Vortex: A superscalar asynchronous processor.” Proceedings of the 13th International
Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2007.
[209] Michel Laurence. “Introduction to octasic asynchronous processor technology.” Proceedings of the 18th
International Symposium on Asynchronous Circuits and Systems (ASYNC). IEEE, 2012.
[210] Mickael Fiorentino, et al. “A practical design method for prototyping self-timed processors using FPGAs.”
Proceedings of the International Symposium on Circuits and Systems (ISCAS). IEEE, 2016.
[211] Mickael Fiorentino, Yvon Savaria, and Claude Thibeault. “FPGA implementation of Token-based Self-timed
processors: A case study.” Proceedings of the 15th International New Circuits and Systems Conference
(NEWCAS). IEEE, 2017.
[212] Rajit Manohar and Mark Heinrich. “The Branch Processor Architecture”. Cornell Computer Systems Lab
Technical Report CSL-TR-1999-1000, November 1999.
[213] Rajit Manohar. “Width-adaptive data word architectures.” Proceedings of the Conference on Advanced
Research in VLSI. IEEE, 2001.
R.11.2 Digit-Serial
Synchronous
[214] John Von Neumann. “First Draft of a Report on the EDVAC.” Annals of the History of Computing, Volume
15 Number 4 Pages 27-75. IEEE, 1993.
[215] J. Presper Eckert Jr, et al. “The UNIVAC system.” Proceedings of the International Workshop on Managing
Requirements Knowledge. IEEE, 1951.
[216] Albert A. Auerbach, et al. “The Binac.” Proceedings of the IRE, Volume 40 Number 1 Pages 12-29. IEEE,
January 1952.
[217] A. D. Beard, D. L. Nettleton, L. S. Bensky, G. E. Poorte. “Characteristics of the RCA BIZMAC Computer.”
Proceedings of the Joint ACM-AIEE-IRE Western Computer Conference. 1956.
[218] Z. G. Vranesic, V. C. Hamacher, Y. Y. Leung. “Design of a Fully Variable-Length Structured Minicomputer.”
Proceedings of the 1st International Symposium on Computer Architecture (ISCA), Pages 251-255. December
1973.
248
[219] Mary Jane Irwin. “An Arithmetic Unit for On-line Computation.” University of Illinois at Urbana-Champaign,
1977.
[220] Catherine Y Chow. “A Variable Precision Processor Module.” University of Illinois at Urbana-Champaign,
Ann Arbor MI, 1980.
[221] James Robertson. “Design of the Combinational Logic for a Radix-16 Digit-Slice for a Variable Precision
Processor Module.” Proceedings of the International Conference on Computer Design: VLSI in Computers
and Processors, IEEE 1983.
[222] Marty S. Cohen, Thomas E. Hull, and V. Carl Hamacher. “CADAC: A controlled-precision decimal
arithmetic unit.” Transactions on Computers, Volume 4 Pages 370-377. IEEE, 1983.
[223] Kaoru Uchida, Tsutomu Temma. “Pipelined Dataflow Processor Architecture Based on a Variable Length
Token Concept.” Proceedings of the International Conference on Parallel Processing, IEEE 1988.
[224] Tony M. Carter. “Cascade: Hardware for high/variable precision arithmetic.” Proceedings of 9th Symposium
on Computer Arithmetic. IEEE, 1989.
[225] Alain Guyot, Yvan Herreros, and J-M. Muller. “JANUS, an on-line multiplier/divider for manipulating large
numbers.” Proceedings of 9th Symposium on Computer Arithmetic. IEEE, 1989.
[226] Michael J. Schulte, and Earl E. Swartzlander Jr. “Hardware design and arithmetic algorithms for a
variable-precision, interval arithmetic coprocessor.” Proceedings of the 12th Symposium on Computer
Arithmetic (ARITH), July 1995.
[227] Michael J. Schulte, and Earl E. Swartzlander. “Variable-precision, interval arithmetic coprocessors.” Reliable
Computing, Volume 2 Number 1 Pages 47-62. Springer, 1996.
[228] Alexandre Ferreira Tenca. “Variable Long-Precision Arithmetic (VLPA) for Reconfigurable Coprocessor
Architectures.” PhD Thesis, University of California, Los Angeles, 1998.
[229] Ramon Canal, Antonio González, and James E. Smith. “Very low power pipelines using significance
compression.” Proceedings of the 33rd International Symposium on Microarchitecture. ACM/IEEE, 2000.
[230] Javier Hormigo, Julio Villalba, Emilio L. Zapata. “CORDIC Processor for Variable-Precision Interval
Arithmetic.” The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, 2004.
[231] Xin Cai. “A Serial Bitstream Processor for Smart Sensor Systems.” Dissertation at Duke University, 2010.
Self-Timed
[232] Virantha N. Ekanayake, Clinton Kelly, and Rajit Manohar. “Bitsnap: Dynamic Significance Compression for
a Low-energy Sensor Network Asynchronous Processor.” Proceedings of the 11th International Symposium
on Asynchronous Circuits and Systems (ASYNC). IEEE, 2005.
[233] Virantha Namal Ekanayake. “BitSNAP: A low-energy sensor network asynchronous processor with dynamic
significance compression.” Dissertation at Cornell University, 2005.
249
R.12 Reconfigurable Arrays
R.12.1 Bit-Parallel
Synchronous
[234] Reiner Hartenstein, Jurgen Becker, Rainer Kress, and Helmut Reinig. “High‐performance computing using a
reconfigurable accelerator.” Concurrency and Computation Practice and Experience, Volume 8 Number 6
Pages 429-443. 1996.
[235] Takashi Miyamori and Kunle Olukotun. “REMARC: Reconfigurable multimedia array coprocessor.”
Transactions on Information and Systems. IEICE, 1999.
[236] A. Alsolaim, J. Becker, M. Glesner, and J. Starzyk, “Architecture and application of a dynamically
reconfigurable hardware array for future mobile communication systems.” Proceedings of the International
Symposium on Field-Programmable Custom Computing Machines. IEEE, 2000.
[237] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho, “MorphoSys: an integrated
reconfigurable system for data-parallel and computation-intensive applications.” Transactions on Computers,
Volume 49, Number 5. IEEE, May 2000.
[238] U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, “The Imagine stream processor.”
Proceedings of the International Conference on Computer Design. IEEE, September 2002.
[239] R. David, D. Chillet, S. Pillement, and O. Sentieys, “DART: a dynamically reconfigurable architecture
dealing with future mobile telecommunications constraints.” Résumé, 2003.
[240] V. Baumgarte, G. Ehlers, F. May, A. Nückel, M. Vorbach, and M. Weinhardt, “PACT XPPA
self-reconfigurable data processing architecture.” The Journal of Supercomputing, Volume 26, Number 2.
2003.
[241] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “Adres: An architecture with tightly coupled
VLIW processor and coarse-grained reconfigurable matrix.” Field Programmable Logic and Application.
Springer, 2003.
[242] Paul M. Heysters, Gerard JM. Smit, and Egbert Molenkamp. “Energy-efficiency of the MONTIUM
reconfigurable tile processor.” Proceedings of the International Conference on Engineering of Reconfigurable
Systems and Algorithms (ERSA), Pages 38-44. Las Vegas NV, 2004.
[243] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ranganathan, D. Burger, S. W. Keckler, R. G.
McDonald, and C. R. Moore, “Trips: A polymorphous architecture for exploiting ilp, tlp, and dlp.”
Transactions on Architecture and Code Optimization. ACM, 2004.
[244] Masayasu Suzuki, et al. “Stream applications on the dynamically reconfigurable processor.” Proceedings of
the International Conference on Field-Programmable Technology. IEEE, 2004.
[245] S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers,
“The WaveScalar architecture.” Transactions on Computer Systems, Volume 25 Number 2. ACM, May 2007.
250
[246] Marco Lanuzza, Stefania Perri, Pasquale Corsonello, and Martin Margala. “A new reconfigurable coarse-grain
architecture for multimedia applications.” Conference on Adaptive Hardware and Systems, Pages 119-126.
NASA/ESA, Edinburgh UK, 2007
[247] Sami Khawam, et al. “The reconfigurable instruction cell array.” Transactions on very large scale integration
(VLSI) systems, Volume 16 Number 1 Pages 75-85. IEEE, 2007.
[248] H. Park, Y. Park, and S. Mahlke, “Polymorphic pipeline array: A flexible multicore accelerator with
virtualized execution for mobile multimedia applications.” Proceedings of the International Symposium on
Microarchitecture (MICRO). IEEE/ACM 2009.
[249] Y. Park, H. Park, and S. Mahlke, “CGRA express: Accelerating execution using dynamic operation fusion.”
Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded
Systems (CASES). ACM, 2009.
[250] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A fully pipelined and dynamically composable architecture
of CGRA.” Proceedings of the 22nd Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2014.
[251] J. D. Souza, L. C. M. B. Rutzig, and A. C. S. B. Filho1, “A reconfigurable heterogeneous multicore with a
homogeneous ISA.” Proceedings of the Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE 2016.
[252] D. C. Chen and J. M. Rabaey, “A reconfigurable multiprocessor ic for rapid prototyping of
algorithmic-specific high-speed DSP data paths,” Journal of Solid-State Circuits, Volume 27, Number 12.
IEEE, December 1992.
[253] A. K. Yeung and J. M. Rabaey, “A reconfigurable data-driven multi-processor architecture for rapid
prototyping of high throughput DSP algorithms,” Proceedings of the 26th Hawaii International Conference on
System Sciences, Volume 1. IEEE, 1993.
[254] E. Mirsky and A. DeHon. “MATRIX: a reconfigurable computing architecture with configurable instruction
distribution and deployable resources.” Proceedings of the Symposium on FPGAs for Custom Computing
Machines. IEEE, 1996.
[255] Ray Bittner, Peter M. Athanas, Mark Musgrove. “Colt: An experiment in wormhole run-time
reconfiguration.” Photonics East Conference on High-Speed Computing, Digital Signal Processing, and
Filtering Using FPGAs, 1996.
[256] J. R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a reconfigurable coprocessor,‘ Proceedings of
the 5th Annual Symposium on Field-Programmable Custom Computing Machines. IEEE, Apr 1997.’”
[257] D. C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. Ebeling. “Architecture design of reconfigurable
pipelined datapaths.” Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI. IEEE,
1999.
[258] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R. Taylor, “PipeRench: a reconfigurable
architecture and compiler.” Computer Volume 33 Number 4, April 2000.
251
[259] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “CHIMAERA: a high-performance architecture with a
tightly-coupled reconfigurable functional unit.” Proceedings of the 27th International Symposium on
Computer Architecture, June 2000.
[260] Michael Taylor, et al. “The raw microprocessor: A computational fabric for software circuits and
general-purpose programs.” IEEE Micro Volume 22 Number 2 Pages 25-35, 2002.
Self-Timed
[261] Khodor Ahmad Fawaz, et al. “A dynamically reconfigurable asynchronous processor for low power
applications.” Proceedings of the Conference on Design and Architectures for Signal and Image Processing
(DASIP). IEEE, 2010.
R.12.2 Digit-Serial
Synchronous
[262] Lakshmi N. Goyal. “Design of an arithmetic element for serial processing in an iterative structure.”
Proceedings of the 3rd Symposium on Computer Arithmetic (ARITH), IEEE 1975.
[263] Lakshmi Goyal. “A Study in the Design of an Arithmetic Element for Serial Processing in an Iterative
Structure.” Dissertation at the Department of Electrical Engineering, University of Illinois, Urbana. 1976.
[264] Buric, Misha R., and Carver A. Mead. “Bit-Serial Inner Product Processors in VLSI.” Pages 155-164. 1981.
[265] Hillis, W. Daniel. “The Connection Machine: A Computer Architecture Based on Cellular Automata.”
Physica D: Nonlinear Phenomena, Volume 10 Number 1 Pages 213-228. 1984.
[266] Robert Michael Owens, and Mary Jane Irwin. “The arithmetic cube.” Transactions on Computers, Volume
100 Number 11 Pages 1342-1348. IEEE, 1987.
[267] Mary Jane Irwin, Robert Michael Owens. “Digit Pipelined Processors.” The Journal of Supercomputing,
Volume 1, Pages 61-86, 1987.
[268] Robert Michael Owens, et al. “The design and implementation of the Arithmetic Cube II, a VLSI signal
processing system.” Transactions on Very Large Scale Integration (VLSI) Systems, Volume 1 Number 4
Pages 491-502. IEEE, 1993.
[269] Thomas P. Kelliher. “The arithmetic cube II and memory-based architectures for data structure manipulation.”
Dissertation at Pennsylvania State University. 1993.
[270] Peter John Waldernar Graumann. “Implementing digital signal processing algorithms using serial arithmetic.”
Calgary, 1996.
[271] Tsuyoshi Isshiki. “High-Performance Bit-Serial Datapath Implementation for Large-Scale Configurable
Systems.” Penn State University, Page 33. April 1996.
[272] Jyh-Huei Guo, and Chin-Liang Wang. “A novel digit-serial systolic array for modular multiplication.”
Proceedings of the International Symposium on Circuits and Systems (ISCAS), Volume 2. IEEE, 1998.
[273] Amar Aggoun, Mohammad K. Ibrahim, and Ahmed Ashur. “Bit-level pipelined digit-serial array processors.”
Transactions on Circuits and Systems II: Analog and Digital Signal Processing Volume 45 Number 7 Pages
857-868. IEEE, 1998.
252
[274] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutchings, “A reconfigurable arithmetic array
for multimedia applications.” Proceedings of the Seventh International Symposium on Field programmable
Gate Arrays. ACM/SIGDA, 1999.
[275] Hanho Lee. “Reconfigurable VLSI architecture and design for digit-serial DSP applications.” Dissertation at
the University of Minnesota, 2000.
[276] Phillip A. Marshall, Vincent C. Gaudet, and Duncan G. Elliott. “Deeply pipelined digit-serial LDPC
decoding.” Transactions on Circuits and Systems I, Volume 59 Number 12 Pages 2934-2944. IEEE, 2012.
[277] Yue Lu, Tom J. Kazmierski, and Lianxi Liu. “A Bit-Serial Variable-Accuracy FFT Processor For
Energy-Harvesting Systems.” Proceedings of the Asia Pacific Conference on Circuits and Systems
(APCCAS). IEEE, 2018.
Self-Timed
[278] Achim Rettberg, et al. “A fully self-timed bit-serial pipeline architecture for embedded systems.” Design,
Automation and Test in Europe Conference and Exhibition. IEEE, 2003.
[279] Achim Rettberg, et al. “A new approach of a self-timed bit-serial synchronous pipeline architecture.”
Proceedings of the 14th International Workshop on Rapid Systems Prototyping. IEEE, 2003.
[280] Jianchuan Li. “Self-timed bit-serial architectures for digital signal processing.” Dissertation at University of
Calgary (Canada), 2005.
[281] Florian Dittmann, Achim Rettberg, and Raphael Weber. “Optimization techniques for a reconfigurable,
self-timed, and bit-serial architecture.” Proceedings of the 20th annual conference on Integrated circuits and
systems design. ACM, 2007.
[282] Andrew Przybylski, Kashfia Haque, and Paul Beckett. “The Bel array: An asynchronous fine-grained
co-processor for DSP.” Proceedings of the 10th International Conference on Signal Processing and
Communication Systems (ICSPCS). IEEE, 2016.
253