A Reconfigurable Analog Substrate for Highly Efficient Maximum Flow Computation Gai Liu and Zhiru Zhang School of Electrical and Computer Engineering, Cornell University, Ithaca, NY {gl387, zhiruz}@cornell.edu Abstract We present the design and analysis of a novel analog reconfigurable substrate that enables fast and efficient computation of maximum flow on directed graphs. The substrate is composed of memristors and standard analog circuit components, where the on/off states of the crossbar switches encode the graph topology. We show that upon convergence, the steady-state voltages in the circuit capture the solution to the maximum flow problem. We also propose techniques to minimize the impacts of variability and non-ideal circuit components on the solution quality, enabling practical implementation of the proposed substrate. Our performance evaluation indicates two to three orders of magnitude improvements in speed and energy efficiency compared to a standard CPU implementation. In the last part of this report, we also discuss the major limitations of the current design, and suggest promising research directions. 1. Introduction Overcoming the performance and efficiency limitations of traditional von Neumann architecture has been one of the major themes in the recent computing research. Exploring novel architectures and computing paradigms becomes even more imperative as power-limited technology scaling fails to offer the same performance improvement that we have seen in the past few decades [14]. This trend is motivating researchers to investigate new technologies and alternative models of computations to provide the next wave of major improvements in computing. An active direction of research seeks to improve performance and energy efficiency through domain-/application-specific hardware customization using specialized accelerators such as ASICs, FPGAs, and GPGPUs [9]. Parallel to this direction, more disruptive computing paradigms are being examined, including approximate computing [20, 29], neuromorphic computing [33], and quantum computing [26]. These disruptive computing paradigms are expected to achieve significant improvements in performance and energy efficiency for certain domains of applications compared to the traditional approaches. In this work, we investigate using a physics-based analog computing system as an unconventional approach to solving an important optimization problem — the maximum flow (max-flow) problem. The max-flow problem is one of the most pervasive optimization problems in applications such as transportation [38], Internet traffic scheduling [40], VLSI CAD [10], and many other emerging applications including computer vision [6] and data mining [15]. These emerging applications often deal with large-scale graphs, which need to be analyzed under ever-demanding performance and energy efficiency requirements. Needless to say, a new method that solves the max-flow problem 1 efficiently in both time and energy consumption would be of great interest to the CAD community as well as the field of computing in general. However, the max-flow problem has been shown to be computationally demanding [19] and difficult to parallelize [18]. We take a complementary approach to solving the max-flow problem by constructing carefully engineered electrical circuit and mapping the steady-state behavior of the circuit to the solution of the corresponding max-flow problem. We propose a reconfigurable analog substrate composed of traditional analog devices including resistors, diodes and operational amplifiers (op-amps), as well as an emerging device called memristor. The substrate encodes graph topology using the on/off states of the memristor switches, and leverages Kirchhoff’s circuit laws to enforce flow constraints, which are represented in the form of circuit node voltage values. The max-flow objective, defined as the net flow out of the source vertex, is then driven to its maximum value. Upon convergence, the steady-state node voltages represent the solution to the max-flow problem. This work is inspired by recent proposals on solving optimization problems using analog circuits [42, 41, 24] and approximate algorithms for the max-flow problem [8, 3, 32]. In particular, several key circuit structures of our substrate build on Vichik and Borrelli’s seminal work on constructing problem-specific analog circuits to solve linear and quadratic programs [42]. Our work differs from [42] by specializing the analog circuits for the max-flow problem, which essentially is a more restricted linear program. In addition, our analog substrate is programmable and capable of solving different instances of the max-flow problem by reconfiguring the electrical properties of the underlying circuit components. More specifically, our main technical contributions are as follows: 1. We present a reconfigurable analog substrate that efficiently solves max-flow problems, and show that the proposed substrate optimally solves the max-flow problem under ideal assumptions. 2. We study the convergence time and power consumption of the reconfigurable analog substrate, and demonstrate promising results in performance and energy efficiency. 3. We analyze the influence of non-ideal circuit components and process variation on the solution quality, and propose techniques to mitigate the impact of these non-ideal factors. This report extends our previous paper in [30] by including more elaborate discussions on the comparison with related work, major limitations of our current proposal, as well as the potential research directions for addressing these limitations. The remainder of the report is structured as follows: Section 2 describes the circuit mechanism of our substrate; Section 3 introduces the reconfigurable crossbar architecture; Section 4 provides the implementation details of the substrate; Section 5 compares the performance of our proposed analog substrate with the software implementation on a CPU; Section 6 further discusses the circuit dynamic behaviors and points out the current limitations of this work and future directions; Related work is discussed in Section 7 followed by conclusions in Section 8. 2. Computing Max-flow using Analog Circuit The max-flow problem we are interested in solving is stated as follows: Given a directed graph1 G = (V, E) with n = |V | vertices and m = |E| edges, we distinguish two vertices, a source vertex s and sink vertex t, and assign each edge e a nonzero integral capacity ce. A s − t flow is a function f : E → R such that for every vertex ni other than s and t, the flow coming into ni is equal to the flow going out of ni, and 0 ≤ f (e) ≤ ce for all e ∈ E. The value |f | of a s − t flow is defined to be 1 Max-flow problem on an undirected graph can be converted to an equivalent problem on a directed graph by allocating two opposite edges with the same capacity as the edge in the original undirected graph. 2 the net flow out of the source vertex. The max-flow problem asks for finding a feasible s − t flow in G of the maximum value. The circuit used to compute max-flow is constructed in a direct mapping manner, i.e., for every edge, there is a circuit component to ensure edge capacity constraint; and for every node, there is another circuit component to enforce flow conservation constraint; there is also a circuit component for realizing the functionality of max-flow objective function. It is worth noting that several important circuit structures are based on the proposal in [42]. While the analog circuits in [42] are designed for generic linear and quadratic programming, ours are customized to fit the max-flow problem for enabling reconfigurability and reduced circuit complexity. 2.1 Edge Capacity Constraint Edge xi with capacity ci ci xi Corresponding circuit node for edge xi xi ci Figure 1: Circuit component for edge capacity constraint. For each edge capacity constraints on edge xi, we create a circuit component composed of two diodes and one voltage source, as shown in Figure 1, where the capacity on edge xi is ci. One of the two diodes is directly connected to the circuit node xi, with its anode connected to ground. Ideally, the diode will turn on whenever the voltage on circuit node xi is below zero2. Since the ideal diode in on-state behaves like an ideal wire, the voltage on circuit node xi will always be non-negative. Similarly, the other diode and a voltage source of value ci enforce that the circuit node voltage at xi never exceeds ci. Thus we have the following relation 0 ≤ Vxi ≤ ci (1) for every node xi that corresponds to an edge xi in the original graph, where Vxi is the voltage value on circuit node xi, and ci is the corresponding edge capacity on edge xi. 2.2 Flow Conservation Constraint … … … x1 y1 x2 nt y2 xk yj x1 P x1- x2 -R1 x2- … xk -R2 xk- -Rk nt -R y1 y2 yj Figure 2: Circuit component for flow conservation constraint. 2 Note that in practice we need to account for the turn-on voltage of the diode by adjusting the values of the voltage sources. 3 Flow conservation constraint is enforced using the circuit in Figure 2. For each outgoing edge yi (i = 1, 2, . . . , j) of a vertex nt in the original graph, the corresponding circuit node yi with voltage value Vyi is connected to node nt via a resistor; while for each incoming edge xi(i = 1, 2, . . . , k) of the vertex nt, the corresponding node of negated voltage value Vx−i is connected to node nt. All positive resistors have the same resistance value r. Negative resistors −R1, . . . , −Rk all have the same resistance value of −r/2, and the negative resistor connected to nt has resistance value R = r N , where N = j + k. Compared to the equality constraint circuit in [42], the main simplification is that we set all the positive resistances to the same value r, since the flow conservation constraints all have 1 or −1 coefficients. In the following, we show that the circuit in Figure 2 indeed enforces flow conservation constraint on node nt using Kirchhoff’s current law (KCL). If we denote the voltage value on node P in Figure 2 as VP , we have the following relation according to KCL: Vx1 − VP + Vx−1 − VP = VP r r −R1 (2) Since −R1 = −r/2, we have Vx1 = −Vx−1 Similarly, we have Vxi = −Vx−i for all i = 1, 2, . . . , k. Furthermore, based on KCL we also have the following relation on node nt: (3) k Vx−i − Vnt + j Vyi − Vnt = Vnt r r −R i=1 i=1 Plugging R = r k+j into Equation (4): (4) kj Vx−i + Vyi = 0 i=1 i=1 (5) Using the result from Equation (3), we obtain the following relation that satisfies flow conservation kj constraint: Vxi = Vyi. i=1 i=1 2.3 Max-flow Objective The objective function is implemented using the circuit in Figure 3, which is also similar to the one used in [42], but with the simplification that all positive resistors have the same resistance value of r. Intuitively, the voltage source Vflow will drive the node voltages as large as possible, subject to the capacity and conservation constraints. As a result, the steady-state voltages of the circuit nodes reflect the maximum possible flow through the network that does not violate the requirements of the constraints. In the following, we derive the steady-state solution of the circuit using basic circuit theorems, and show that the steady-state node voltages represent the solution for the max-flow problem. We will first prove that the value of s − t flow will strictly increase as Vflow increases by showing that the constructed resistor network is passive. Combining this fact with the flow conservation constraint and 4 … … …… … x1 s x2 xt Vflow x1 x2 xt Figure 3: Circuit component for implementing max-flow objective. x1 r S x2 -R1 Vflow xt -R2 -R3 (a) Original resistor network. x1 r S x2 -R1 Vflow xt -R2 -R3 (b) Resistor network after reducing the resistance of certain positive resistors to zero. Iflow Vflow x1 Req(1) S x2 Req(2) xt Req(t) (c) Lumped resistor network with equivalent resistance. Figure 4: Replacing resistor network with equivalent resistance. edge capacity constraint derived in Sections 2.1 and 2.2, we can show the optimality of the circuit solution. A formal proof of optimality for general LP and QP analog circuit has been proposed in [42]. Their proof of optimality also applies to our max-flow circuit. However, since the max-flow problem is a 5 restricted form of linear programs (i.e., all the coefficients in the objective function and the linear constraints are one, and all the inequality constraints only involve a single unknown variable), we can devise a more concise and intuitive proof of optimality for the max-flow problem, as detailed below. We first show that the equivalent resistance of the resistor network in the max-flow circuit is positive. Figure 4a shows a generic circuit for solving the max-flow problem, where nodes x1 to xt are the subset of nodes that are directly connected to Vflow. In Figure 4a, we mark the resistor that directly connects to node x1 with its resistance value r. We also list all the negative resistors in the network as −R1, −R2, . . . , −Rk. Note that according to the construction of the circuit, we have | − Ri| < r for all the negative resistors. Now we find the equivalent resistance of the resistor subnetwork S to the right of node x1, shown in the dashed box in Figure 4a. If we set all the resistance of the positive resistors except r in S to be zero, the resulting resistor network consists only of the resistors shown in Figure 4b. Since each of the resistance |Ri| is strictly less than r, all the parallel negative resistors together also have an equivalent resistance that is less than r. As a result, the equivalent resistance of S, which is the sum of r and the equivalent resistance of these negative resistors, is always positive. Figure 4c shows the equivalent circuit of Figure 4a, where Req(i) > 0 for all i = 1, 2, . . . , t. Each circuit branch in Figure 4c is a basic voltage divider. Since all the remaining positive resistors have a resistance value of r, we have Vxi = V Req(i) f low Req(i)+r for all i = 1, 2, . . . , t. Note that we have shown Req(i) > 0 for all i = 1, 2, . . . , t, thus the steady-state solution Vxi will strictly increase as Vflow increases, as long as edge capacity constraints and flow conservation constraints are satisfied. Summing up all the node voltages, we have t i=1 Vxi = Vf low t i=1 Req(i) Req(i) + r = Vf low t (1 − i=1 r Req(i) ) +r = tVflow − rIflow J (6) J in Equation (6) plays the role of objective function in the max-flow problem. Vflow in the circuit is set to a high voltage value, so that Vflow will try to maximize the node voltages in the circuit. Combining this result with the derived edge capacity constraint and flow conservation constraint, we have: t J = tVflow − rIflow = Vxi i=1 kj Vxi = Vyi i=1 i=1 0 ≤ Vxi ≤ ci (7a) (7b) (7c) where Equation (7b) holds for all vertices with incoming edges xi and outgoing edges yi, and Equation (7c) holds for all edges in the graph. Equation (7) is identical to the original max-flow problem. Thus we conclude that the circuit solution is equivalent to the solution of the corresponding max-flow problem. 2.4 An Example Figure 5b shows an example circuit that solves a max-flow problem instance in Figure 5a. A step function is first applied to Vflow that drives Vx1 to its maximum value 3V . Then the circuit components 6 2 n2 1 s 3 x1 n1 x2 1 x3 n3 x4 2 x5 t (a) An example max-flow problem. 2V 3V x2- x1 x1- x2 -R -R2 Vflow -R -R1 x3 1V -R x3- -R3 x4 1V 2V x5 Voltage (V) 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0E+00 (b) The corresponding circuit. V(x1) V(x2) V(x3) V(x4) V(x5) 5.0E-09 1.0E-08 1.5E-08 Time (s) 2.0E-08 2.5E-08 (c) Waveform of the node voltages. Figure 5: An example of solving max-flow using analog circuit. implementing flow conservation constraints start working together to bring the other nodes to steady state. Vx3 and Vx4 will reach their maximum values of 1V , corresponding to the case that flows on edge ex3 and ex4 reach their maximum value of 1. Then Vx1 and Vx2 in return bring Vx1 to its final value of 2V . Figure 5c shows the waveform of the node voltages as a function of time. 3. Reconfigurable Crossbar Architecture Constructing a custom circuit for each graph instance is generally impractical as fabricating a custom circuit is both expensive and time consuming. We propose to construct a reconfigurable substrate 7 n rows s Vflow n1 s n1 n2 n3 … t n2 n3 t -Rs -R1 -R2 -R3 -Rt n columns Memristor switch ni xij Circuit widget for eij -R nj Figure 6: Reconfigurable crossbar architecture. that can be configured to encode different graph topologies, enabling the solution of various problem instances using a single substrate. We propose to use memristors as the reconfigurable switches in our substrate. The memristor [22] is a two terminal device with a metal-insulator-metal sandwiched structure whose resistance (called memristance) can be modulated by external stimulus. Memristor can be switched between the high-resistance state (HRS) and the low-resistance state (LRS) by external voltage stimulus, and retains its resistance value when external stimulus is removed or below a certain threshold. Our reason for choosing memristors as switches is twofold: (1) memristor-based switches are more area-efficient than the traditional SRAM-based switches; (2) since resistors are widely used on our substrate, we can directly replace the resistors with memristors so that one single memristor acts both as a switch and as a resistor. Namely, A memristor in HRS represents a disconnected switch; A memristor in LRS represents both a connected switch and a resistor whose resistance is equal to the LRS memristance. Additionally, using LRS memristors as normal resistors enables the fine-tuning of memristance in LRS, which helps mitigate the impact of process variation and parasitic resistance, as will be discussed in Section 4.3.2. Our substrate is composed of an n × n crossbar as shown in Figure 6. At each intersection (ni, nj), there is a small circuit widget connecting into the crossbar through a memristor switch. This circuit widget represents an edge xij. Essentially, we can view the crossbar as a physical representation of the adjacency matrix of the graph. If edge (i, j) is present in the graph, the memristor switch at position (ni, nj) will be set to LRS. Otherwise, the memristor switch at position (ni, nj) will be set to HRS. When properly configured, the crossbar architecture implements the circuit described in Section 2. We implement the max-flow objective using the first row of the crossbar, where the memristor switch at position (s, ni) will be turned on if and only if edge (s, i) is present in the graph. We enforce the edge capacity constraint using two diodes at each intersection, and honor the flow conservation constraint by connecting column ni with all the edges incident to ni. Figure 7 shows the mapping of the example circuit to the crossbar. By examining the circuit widgets at the intersections, we can verify that the crossbar correctly implement the circuit in Figure 5b.3 3 Since there is no outgoing edge from sink t, the negative resistors in the circuit widgets of column t can be omitted without affecting the correctness of the circuit. 8 widget for es1 widget for e12 3V Vflow x1 -R x1-R1 widget for e13 widget for e2t 2V x2- x2 -R -R2 1V x3 -R x3--R3 widget for e3t x4 x5 Figure 7: Mapping the example circuit to the crossbar. 1V 2V 3.1 Configuring the Crossbar In the configuration stage, we program the memristors in the crossbar to the desired resistance state using voltage pulses. Vflow is set to zero in this stage. The programming stage takes n cycles to complete, one cycle for each row. At the ith programming cycle, row i is activated by setting the row wire to a low voltage Vlow, while keeping all the other rows at 0V. For every memristor that needs to be set to LRS, the corresponding column will be set to a high voltage Vhigh, while keeping all the other columns at 0V. By setting the voltage difference (Vhigh − Vlow) greater than the memristor threshold voltage, the desired memristors will be set to LRS, while all the other memristors stay in HRS. After n cycles of configuration, all the memristor switches are set to the desired state. 3.2 Computing Max-flow on the Crossbar At the beginning of the computing stage, a step function is applied to Vflow, and the circuit starts converging to steady state. To avoid the complexity of detecting steady-state condition, we run the circuit for a period of time t. t should be long enough to ensure that the circuit converges, which can be determined by profiling the worst-case execution time of the substrate. After convergence, we measure the current through the voltage source Vflow and calculate the objective value (i.e. value of max-flow) based on Equation (7a). 4. Implementation We have shown that the proposed substrate generates optimal solution for max-flow problems under several assumptions: (1) one voltage source for each edge with exact voltage level; (2) ideal negative resistors and diodes; (3) no process variations or parasitic resistance. In this section, we analyze the impact of these assumptions on the solution quality, and propose techniques that enable a practical implementation of the proposed substrate. 4.1 Voltage Level Quantization In practice, it is too expensive to use one distinct voltage source on the substrate for each edge. Also, the output voltage level of a voltage source cannot be arbitrarily large. To address these two issues, we propose the following quantization scheme that maps the edge capacities to a set of discrete voltage levels. Given a supply voltage of value Vdd, we uniformly divide the voltage interval [0, Vdd] into N 9 voltage levels: V = { 1 N Vdd , 2 N Vdd, . . ., N −1 N Vdd, Vdd }. For each voltage level in V, we construct one voltage source with the corresponding voltage value. One voltage source will be used for multiple edges if these edges share the same voltage level after quantization. The quantization function Q uniformly maps the edge capacities of a max-flow instance into the voltage levels in set V. Specifically, given an edge capacity x, Q maps x to one of the voltage levels in V: Q(x) = (x/C )N N Vdd, where C is the largest edge capacity of the max-flow instance. The solution Y = {Vx1, Vx2, . . . , Vxm} obtained from the circuit is then mapped back to the interval [0, C ]: Y˜ = Y× .Vdd C Y˜ is an approximate solution of the original problem with quantization errors. The worst-case quantization error e of the flow on one edge is bounded by the quantization step: e = C N with C being the largest edge capacity and N being the number of voltage levels. Note that the choice of N provides a trade-off between accuracy and cost— increasing N reduces the worst-case quantization error, but also increases circuit complexity and cost. Figure 8 illustrates the application of voltage quantization on the same graph in Figure 5a with N = 20 and Vdd = 1V , the result shows a 5% deviation in the final flow value |f |. 2 n2 1 Vdd=1V, N=20 0.65V n2 0.35V s 3 x1 n1 x2 1 x4 t 2 s 1V x1 n1 x2 0.35V x4 0.65V t x3 n3 x5 x3 n3 x5 Before quantization Exact solution: |f|=2 After quantization Circuit solution: 0.7V Approximate solution: |f|=2.1 Figure 8: An example of voltage level quantization. 4.2 Negative Resistors We can view a negative resistor as a current-controlled voltage source with the transresistance equal to −R, which can be implemented using an op-amp, as shown in Figure 9a. The precision of the negative resistor is determined by the open loop gain A of the op-amp. To derive the precision of the negative resistor, we calculate the effective resistance Reff of the negative resistor as Reff = −(1 + )R ,1 R0 A Rtarget target where R0 Rtarget is a constant that is close to 1. Thus the precision of the negative resistor is inversely proportional to the gain A of the op-amp. Since modern op-amps can easily achieve gain A > 1000, indicating a precision of ±0.1%, we conclude that the implementation of op-amp has negligible impact on the precision of the negative resistor. 4.3 Process Variation and Parasitic Resistance The actual resistance of the resistors deviates from their nominal values due to unavoidable process variation and parasitic effects, which degrade the solution quality. In this section, we first make an important observation that the solution quality depends only on the ratio of the resistance values instead of the absolution resistance values. This enables the use of layout matching techniques to minimize variations. We will also discuss post-fabrication resistance tuning to further reduce the impact of resistance variation. 10 Reff + R0 t Rtarget - R0 (a) Negative resistor using operational amplifier. x r1 P r2 x- -R3 Vx− = − r2 r1 Vx + VP(1 + r2 r1 − r2 R3 ) (b) Resistance tuning circuit. Figure 9: Negative resistor and the resistance tuning circuit. 4.3.1 Resistance Matching Although most integrated resistors have tolerances of ±20 to 30%, the tolerance of the ratio between two resistors can be controlled to be better than ±1%, and in many cases, to be within ±0.1% [21]. We have shown in Section 2 that the circuit node voltages are only functions of ratios of resistances. Since the circuit node voltages in steady state represent the flow value, we are encouraged to discover that the solution quality is also only a function of resistance ratios, independent of the absolute resistance values. Consequently, the substrate is expected to be highly insensitive to process variation. However, resistance compensation techniques are still needed to counter the other sources of variations such as parasitic resistance. 4.3.2 Resistance Tuning The fact that all the resistances in the circuit are realized by memristors makes it possible to conduct post-fabrication fine-grained resistance tuning [28]. This is especially useful to minimize the influence of parasitic resistance. We outline the tuning process as follows. To begin with, we configure the substrate to implement the tuning circuit shown in Figure 9b, which is a simple circuit that enforces Vx = −Vx−. For each of the tuning circuit, we first set Vx = 0 and modulate R3 until Vx− = 0. This step ensures 1 R3 = 1 r1 + 1 r2 . We then set Vx = 1V , and collectively change the values of r1 and r2 until Vx− = −1V . These two steps can be iterated for a few times for better tuning precision. We note that in principle the tuning process has to be done only once right after fabrication, thus no performance overhead is incurred afterwards. However, since the memristance may slowly drift over a long period of time, the tuning process can be repeated based on the endurance of the memristors. 5. Performance Evaluation We construct a 1000 × 1000 substrate in circuit-level netlist and simulate the substrate in SPICE [35]. The detailed parameters used in the simulation are listed in Table 1. The parameters for memristors 11 and op-amps are set to typical values from recent literature [5, 13]. We will first evaluate the execution time for various benchmarks, then estimate the power consumption of the substrate using an analytical method. Table 1: Design parameters for the max-flow computing substrate. Memristor LRS resistance (kΩ) Memristor HRS resistance (kΩ) Objective function voltage Vflow (V ) Open loop gain of op-amp Gain-bandwidth product of op-amp (GHz) Number of columns in the crossbar Number of rows in the crossbar Number of voltage levels 10 1000 3 1 × 104 10 to 50 1000 1000 20 5.1 Convergence Time The convergence time of the substrate is measured as the time interval between the rising edge of Vflow and the timestamp when the flow value is within 0.1% of the final value. To model the effect of parasitic capacitances, we add a parasitic capacitance of 20fF to each circuit net. We compare the convergence time against the execution time of the widely used push-relabel algorithm running on a server with 3GHz Intel Xeon processor and 16GB of RAM. The push-relabel algorithm is compiled using GCC 4.4.7 with -O3 optimization option. When measuring execution time on CPU, we exclude the time taken to read input file and the time taken to generate internal data structure for fair comparison. We use R-MAT synthetic graph generator [7] to generate both dense (|E| ∝ |V |2) and sparse (|E| ∝ |V |) graphs. The number of nodes in the generated graph ranges from 200 to 1000, and the number of edges ranges from 500 to 8000. We plot the convergence time of the substrate and the execution time of the push-relabel algorithm in Figure 10. Using op-amps with 10G gain-bandwidth product (GBW), our substrate achieves a range of speedups between 150X and 1500X compared to the software implementation. If we further increase the GBW to 50G, we can achieve an additional 10X speedup. Such dramatic speedup is not too surprising as the circuit operates vastly different and more efficient than the CPU — instead of sequentially executing millions of instructions, the circuit converges to the optimal solution in a massively concurrent fashion, with physics being the driving force, e.g., pulling up and pulling down of the node voltages. We also plot the relative error of the circuit solution compared to the optimal solution in Figure 10. The deviations from the optimal solutions are due to the quantized voltage levels and non-ideal circuit components. We can observe that in all cases the relative error is within 8%, with an average relative error of 3.7% for dense graphs and 5.4% relative error for sparse graphs. 5.2 Analysis of Power and Energy Consumption We analytically model the power consumption of the substrate with the goal of showing the general trend of power consumption as a function of circuit size. The model provides insights on how a given power budget impacts the overall design. Here we only consider the power consumption of the substrate itself, thus the power consumption of the analog-digital conversion between the digital computer and the analog substrate is not included. The majority of power is consumed by op-amps 12 Relative Error Convergence Time (s) 1.0E-02 Circuit GBW=50G Push-relabel Circuit GBW=10G Relative Error 10% 1.0E-03 8% 1.0E-04 6% 1.0E-05 4% 1.0E-06 2% 1.0E-07 0% 256 320 384 448 512 576 640 704 768 832 896 960 Number of Vertices (a) Dense graph: |V | ∝ |E|2. Circuit GBW=50G Circuit GBW=10G Push-relabel Relative Error 1.0E-02 10% 1.0E-03 8% 1.0E-04 6% 1.0E-05 4% 1.0E-06 2% 1.0E-07 0% 256 320 384 448 512 576 640 704 768 832 896 960 Number of Vertices (b) Sparse graph: |V | ∝ |E|. Figure 10: Convergence time of synthetic benchmarks. Relative Error Convergence Time (s) and resistors. For resistors, we observe that it is possible to proportionally scale up the resistor values without influencing the correctness of the solution (Section 4.3.1). Scaling up the resistance reduces the average current through the resistors, thus reducing the power consumption of the resistors. Consequently, the op-amps are the major source of power consumption. According to the circuit structure, one op-amp is needed for each edge that is present in the graph.4 Additionally, one opamp is needed for each node to enforce flow conservation constraint. Thus the power consumption for solving a graph of |V | nodes and |E| edges is roughly (|E| + |V |)Pamp, where Pamp is the average power consumption of an op-amp. The maximum graph size that a substrate can support is limited by the power budget. Given a power budget of Ptot and assuming |V | |E|, then the substrate can support roughly up to Ptot/Pamp number of edges. Assuming a 1V supply voltage and an average current of 500µA for the op-amp, which is typical for the 32nm technology node, Pamp evaluates to 500µW. Given a power budget of 4 Note that if an edge is not present in the graph, the corresponding op-amp can be power gated, thus consuming no power. 13 Ptot = 5W, which is typical for an embedded system, we can accommodate up to 104 active edges on the substrate. Increasing the power budget to Ptot = 150W, a typical value for high-end servers, the number of active edges that the substrate can support boosts to 3 × 105. This estimation shows that the substrate is indeed promising in solving large-scale max-flow problems. Although in this case power consumption of the substrate is comparable to CPUs, the energy efficiency is drastically higher because our substrate is around 150X to 1500X faster than the CPUs. 6. Discussions and Future Directions In this section, we shall discuss the major limitations of our current work, as well as the potential directions to mitigate some of these limitation. Specifically, we propose techniques to handle large graphs, including clustered architectures, the dual min-cut formulation, and graph decomposition. We also discuss the dynamic behavior of the circuit and propose to investigate its asymptotic rate of convergence. 6.1 Limitations of Current Work While our simulation results show that the proposed analog substrate can achieve substantial speedup over traditional CPU execution, several major challenges still remain that limit the scalability and accuracy of the current proposal: 1. Handling large graphs – Our current implementation cannot handle large graphs whose corresponding circuits exceed the area constraint of the analog circuit. We propose techniques that could improve the scalability of our approach through clustered architectures (Section 6.2) and graph decomposition (Section 6.4). 2. Asymptotic behavior of the circuit – Since our current performance evaluation is based on relatively small graphs, the asymptotic behavior of the circuit remains to be explored. To this end, we provide insights and initial studies on the dynamic circuit behavior in Section 6.5. 3. Reading out individual flow values – The current implementation can only output the final flow value. Reading out individual flow values on the edges require sensing the voltage values at the internal nodes in the substrate. Additional circuit-level techniques need to be developed to address this limitation. 4. Consideration of wire resistances and temperature effects of the resistors – The wire resistances pose significant challenge to the accuracy of the solution. We envision that techniques in Section 4 can be extended to account for wire resistances. In addition, the analog substrate relies on the precise resistance values of the resistor to generate highly accurate solutions. However, resistance of the resistors is temperature dependent. Compensation techniques need to proposed to mitigate variations due to temperature effects. 5. Digital-analog interface – In this work, we did not consider the interface between digital computer and the analog substrate. ADC and DAC modules need to be carefully designed to maximize system accuracy, as the precision of these modules will have an impact on the final solution quality. 6.2 Clustered Island-Style Architectures Because of its simple and regular nature, the mesh-based architecture permits very efficient graph to substrate mapping. However, the utilization of the circuit components in the mesh will be low for sparse graphs. To exploit graph sparsity and further improve the scalability of the substrate, we propose to explore clustered architectures, which are structured in a similar fashion to the popular 14 mesh Processing  island Processing  island ... Processing  island CB CB (a) CB Processing  island CB Processing  island CB CB SB CB SB Processing  island CB Processing  island CB CB Switch Box SB CB (b) SB Figure 11: Clustered architectures. (a) one-dimensional structure (CB = connection box); (b) twodimensional structure (SB = switch box). island-style field-programmable gate arrays (FPGAs) [4]. More specifically, a clustered architecture includes a collection of mesh-based processing islands which are interconnected by a routing network. For a sparse graph, the highly connected subgraphs will map to separate processing islands, while the connections between these subgraphs will use the routing network. Figure 11 illustrates two interesting kinds of clustered architectures with one- and two-dimensional routing structure, respectively. These two architectures exhibit different characteristics in flexibility and complexity. We hypothesize that the one-dimensional (1-D) architecture in Figure 11a is more amenable for fast graph mapping but less scalable for large-scale graphs due to limited routing 15 Subject to: ,∈ 0 , ∈ 1 0 ∈ 0 , ∈ Figure 12: Linear program of the min-cut problem. pi indicates the partition that vertex i belongs to, dij indicates if edge (i, j) is in the cut set, and cij captures the edge capacities. Vobj … 1/c11 d11 1/c12 d12 pi pi‐ ni ‐R1 ‐R2 1/cij dij dij pt pj pt‐ ni ‐R1 ‐R2 ps 1V ,∈ (a) 0 , ∈ (b) 1 (c) Figure 13: Circuit components for min-cut problem. (a) Objective function. (b) Graph cut constraint. (c) Source/sink constraint. flexibility. On the other hand, the two-dimensional (2-D) architecture in Figure 11b is potentially more flexible but requires more complex circuits with switch boxes and more sophisticated CAD algorithms for efficient clustering, placement, and routing. We plan to study these interesting yet intricate trade-offs by exploring various 1-D and 2-D architectures. Furthermore, we propose to explore a set of important architectural parameters to carefully navigate the tension between architectural flexibility and complexity. Examples include the granularity of each island (i.e., local mesh size), homogeneity versus heterogeneity in the island configurations, and hierarchical versus two-level clustered organization. 6.3 The Dual Formulation Another interesting direction is to investigate the min-cut problem, which is the dual formulation of max-flow. Studying the min-cut problem enables us to utilize existing works on dual decomposition to handle large graphs, as detailed in Section 6.4. Figure 12 shows the linear program of the mincut problem, which partitions the nodes into two sets such that the sum of the edge capacities across these two sets, is minimized. Figure 13 illustrates how to map a min-cut problem into analog circuits. The underlying mechanism is similar to the one for max-flow discussed in the previous section. The objective function in Figure 13a tries to drive down the node voltages while satisfying the constraints enforced by circuits in Figures 13b and 13c. It is worth noting that we no longer require additional voltage sources for enforcing the edge capacity constraints. Instead, the capacity values are reflected by the objective function in the dual formulation. 16 pi ‐R1 dij ni ‐R2 pj objective function logic circuit structure  for min‐cut  constraint  elementary  cell Figure 14: The mesh-based architecture exploiting the dual formulation – Switches in cell (i, j) are closed if the corresponding edge (i, j) is present in the graph. Leveraging the dual formulation, we propose a mesh-based architecture similar to the one in Section 3, which is shown in Figure 14. This architecture forms an n × n mesh where each elementary cell in the mesh implements one constraint of the min-cut problem and uses three switches to achieve reconfigurability. Additional circuit is also introduced to implement the objective function using the mechanism shown in Figure 13a. We note that this mesh-based architecture essentially encodes the adjacency matrix of the input graph, which requires O(n2) elementary cells for any graph with up to n nodes. 6.4 Graph Decomposition The clustered architecture with min-cut formulation can potentially allow the analog substrate to host reasonably large graphs with tens of thousands nodes and edges. To further extend beyond this limit, we propose to investigate advanced decomposition techniques to handle very-large-scale network flow problems in an iterative manner where the substrate is reconfigured and reused to efficiently solve the intermediate subproblems. In particular, we plan to explore dual decomposition techniques such as [39], which partition the dual problem of max-flow into overlapping subproblems and iteratively solve the subproblems to arrive at a global optimal solution with agreement on the overlapping variables. For example, given a min-cut objective minimize cijdij and a partition of variables i,j∈V into two sets M and N , the objective function can be reformulated as minimize (EM (x) + EN (y)), subject to xi = yi, i ∈ M ∩ N , where EM (x) = cij xij − 1 2 cijxij, and EN (y) = i,j∈M i,j∈M ∩N cij yij − 1 2 cijyij. Here variable xi belongs to set M and yi belongs to set N . The i,j∈N i,j∈M ∩N variables in the overlap are constrained to be equal. The Lagrange dual function of this optimization 17 problem is: g(λ) = minimize (EM (x) + EN (y) + λi(xi − yi)) = minimize (EM (x) + λixi) + minimize (EM (y) − i∈M ∩N λiyi). Evaluating this function g is thus equivalent to i∈M ∩N i∈M ∩N solving two independent min-cut problems, and strong duality ensures xi = yi, i ∈ M ∩ N when g is maximized. 6.5 Circuit Dynamic and Asymptotic Behavior We have shown that the steady-state voltage values of the circuit converge to the solution of the maxflow problem. We are also interested in learning the dynamic behavior of the circuit, namely, how the node voltages evolve as a function of time. The exact voltage trajectory depends on the drive voltage Vflow. The step function used in the previous sections induces complex transient behavior that require detailed numerical analysis. To gain intuition for circuit convergence, we examine a simpler scenario where we restrict Vflow to be a slow-varying function, so that at any given time step it is valid to approximate the circuit behavior using the corresponding steady-state solution, also known as the quasi-static approximation. Consider the max-flow problem shown in Figure 15a, with the objective function and constraints shown in Equation (8). To simply the discussion, two of the edges have infinite edge capacity so that these two edges will not restrain the final flow value. Figure 15b shows the corresponding circuit for the max-flow problem, where the diodes and voltage sources due to edge capacity constraints are omitted for brevity. Figure 15c shows a graphical illustration of the solution space and how the circuit reaches the final voltage values, with the feasible region shown in green, and the voltage trajectory shown in red. maximize x1 x1 = x2 + x3 0 ≤ x1 ≤ 4 0 ≤ x2 ≤ 1 0 ≤ x3 ≤ 4 (8a) (8b) (8c) (8d) (8e) Assuming initially Vflow = 0, we solve for the voltage values of this circuit and show that Vx1 = 2 9 Vf low and Vx2 = Vx3 = 1 9 Vf low . We slowly increase the value of Vf low until Vx2 reaches its maximum Vx2 = 1V , corresponding to point D in Figure 15c. At this point Vflow = 9V and the diode at x2 (not shown in the circuit diagram) comes into play to hard-wire Vx2 to 1V. We then add the constraint Vx2 = 1V to the circuit, and solve it again to obtain Vx1 = 1 + Vx3, Vx2 = 1V , and Vx3 = .Vf low −4 5 We increase Vf low all the way to 19V . At this point Vx1 = 4V and Vx3 = 3V . The circuit obtains the final solution Vx1 = 4V , Vx2 = 1V , and Vx3 = 3V , corresponding to point B in Figure 15c. At this point, the objective function cannot be further improved, thus the circuit reaches the final state, corresponding to the optimal solution of the max-flow problem. The trajectory in Figure 15c shows that instead of moving across the extreme points of the feasible region, the circuit actually drives the solution through the interior of the feasible region. We conjecture there is a loose connection between the analog circuit behavior and the algorithmic behavior of the interior-point method [36], and plan to further investigate the convergence rate of the analog circuit. 18 s 4 x1 1 n1 x2 x34 n2 + + n3 t (a) Example max-flow problem. Vflow x1 x1- x2 x3 (b) The corresponding circuit with diodes omitted for brevity. x1 C(4,0,4) x3 B(4,1,3) D(2,1,1) A(1,1,0) O x2 (c) Trajectory of node voltages. Figure 15: Example circuit dynamic behavior. 19 7. Related Work There has been a large body of work seeking to break the limit of the traditional von Neumann architecture and to achieve higher performance and efficiency. Silicon implementation of artificial neural network is one of the most well-known efforts in this field. Recently, a digital neuromorphic chip with one million spiking neurons was built with a power consumption of less than 70mW [34]. Analog implementation of the neuron network is also studied to achieve speedup and power reduction for specific applications [1]. Novel devices such as memristors are utilized to construct unconventional computing substrates. Memristor-based substrates are demonstrated to solve maze [37], conduct matrix multiplication [23], as well as general approximate computation [27]. Hierarchical memristor crossbar design is proposed in [31] to improve the efficiency and flexibility of the neuromorphic computing accelerator. Our proposed clustered island-style architecture shares the same spirit of their hierarchical crossbar design. A generalized concept of universal memcomputing machine is also proposed as an alternative to the von Neumann architecture [11]. In an effort to achieve reconfigurability in analog circuit, field-programmable analog arrays (FPAAs) integrate dedicated analog components with floating-gate transistors and programmable resistor/capacitor networks, enabling reconfigurable computing in the analog domain [2]. Most relevant to our work, Vichik and Borrelli design an analog circuit to solve linear and quadratic programming problems, where unknown variables are directly modeled by node voltages and each constraint is specified by a dedicated circuit component [42]. The node voltages representing unknown variables are driven towards their optimal value subjected to the circuit constraints. However, their approach is problem specific, i.e., different instances of the analog circuit need to be built for solving different linear or quadratic programs. The structure of circuit components in our paper are built upon their work, but tailored to exploit the specific properties of max-flow problems. In addition, our substrate is reconfigurable, which is capable of solving multiple instances of the max-flow problem using the same substrate. In the field of discrete algorithms, decades of effort has aimed to develop efficient max-flow algorithm. Apart from formulating the flow constraints and solving for the maximum flow as a linear programming problem, the Ford-Fulkerson algorithm based on augmenting path sets the foundation as a specialized max-flow algorithm [16]. Among the classical approaches, more efficient algorithms include Dinitz’s blocking flow algorithm [12] and Goldberg and Tarjan’s push-relabel algorithm [17]. More recently, Christiano et al. show that near-linear time approximate algorithm for the max-flow problem is possible, where the algorithm converts the max-flow problem to solving current flow in electrical circuit [8]. By modeling the behavior of the electrical circuit, efficient approximate solution to the original max-flow problem can be obtained. This result also inspired our study. Although an ASIC or FPGA implementation of max-flow accelerator can improve performance and efficiency on a pure software execution, we note that in practice the demonstrated performance benefit is typically a moderate 3-5X speedup over CPUs using an FPGA-based accelerator [25]. In fact, it has been shown that the max-flow problem is inherently hard to parallelize [18], implying that ASIC or FPGA based hardware acceleration can only achieve limited performance improvements. This motivates us to investigate more disruptive approaches in this research. 8. Conclusions We present the design and analysis of a reconfigurable analog substrate whose steady-state node voltages represent the solution to a given max-flow problem. We achieve reconfigurability by configuring 20 the on/off states of the circuit switches that encode the graph topology. Using SPICE simulation and analytical modeling, we demonstrate two to three orders of magnitude improvements in speed and energy efficiency compared to a standard CPU implementation. We further propose techniques to minimize the impacts of variability and non-ideal circuit components on the solution quality, and provide insights on the dynamic behavior of the circuit. We hope this work can provide new insights and spark new research in building programmable analog substrate to solve even more advanced optimization problems. Acknowledgements The authors thank Steve Dai and Prof. Eilyan Bitar for fruitful discussions in the early phase of this research project. References [1] R. S. Amant, A. Yazdanbakhsh, J. Park, B. Thwaites, H. Esmaeilzadeh, A. Hassibi, L. Ceze, and D. Burger. General-Purpose Code Acceleration with Limited-Precision Analog Computation. Int’l Symp. on Computer Architecture (ISCA), 2014. [2] A. Basu, S. Brink, C. Schlottmann, S. Ramakrishnan, C. Petre, S. Koziol, F. Baskaya, C. M. Twigg, and P. Hasler. A Floating-Gate-Based Field-Programmable Analog Array. IEEE Journal of Solid-State Circuits (JSSC), 45(9):1781–1794, 2010. [3] A. A. Benczu´r and D. R. Karger. Approximating s-t Minimum Cuts in O˜ (n2) Time. ACM Symposium on Theory of computing, pages 47–55, 1996. [4] V. Betz and J. Rose. FPGA Routing Architecture: Segmentation and Buffering to Optimize Speed and Density. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), 1999. [5] D. Biolek, M. Di Ventra, and Y. V. Pershin. Reliable SPICE Simulations of Memristors, Memcapacitors and Meminductors. Radioengineering, 22(4), 2013. [6] Y. Boykov and V. Kolmogorov. An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 26(9):1124–1137, 2004. [7] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A Recursive Model for Graph Mining. Int’l Conf. on Data Mining (ICDM), 2004. [8] P. Christiano, J. A. Kelner, A. Madry, D. A. Spielman, and S.-H. Teng. Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs. In ACM Symposium on Theory of computing, pages 273–282, 2011. [9] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Int’l Symp. on Microarchitecture (MICRO), 2010. [10] J. Cong and Y. Ding. FlowMap: An Optimal Technology Mapping Algorithm For Delay Optimization in Lookup-Table Based FPGA Designs. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 13(1):1–12, 1994. [11] M. Di Ventra and Y. V. Pershin. The Parallel Approach. Nature Physics, 9(4):200–202, 2013. [12] E. Dinits. Algorithm of Solution to Problem of Maximum Flow In Network with Power Estimates. Soviet Mathematics Doklady, 194(4):754, 1970. [13] A.-T. Do, Z.-H. Kong, K.-S. Yeo, and J. Y. S. Low. Design and Sensitivity Analysis of a New Current-Mode Sense Amplifier for Low-Power SRAM. IEEE Trans. on Very Large-Scale Integration Systems (TVLSI), 19(2):196–204, 2011. 21 [14] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger. Dark Silicon and the End of Multicore Scaling. Int’l Symp. on Computer Architecture (ISCA), 2011. [15] G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee. Self-Organization and Identification of Web Communities. IEEE Computer, 35(3):66–70, 2002. [16] L. R. Ford and D. R. Fulkerson. Maximal Flow Through a Network. Canadian journal of Mathematics, 8(3):399–404, 1956. [17] A. V. Goldberg and R. E. Tarjan. A New Approach to the Maximum-Flow Problem. Journal of the ACM (JACM), 35(4):921–940, 1988. [18] L. M. Goldschlager, R. A. Shaw, and J. Staples. The Maximum Flow Problem is Log Space Complete for P. Theoretical Computer Science, 21(1):105–111, 1982. [19] F. Halim, R. H. Yap, and Y. Wu. A MapReduce-Based Maximum-Flow Algorithm for Large Small-world Network Graphs. Int’l Conf. on Distributed Computing Systems (ICDCS), 2011. [20] J. Han and M. Orshansky. Approximate Computing: An Emerging Paradigm for Energy-Efficient Design. IEEE European Test Symposium (ETS), 2013. [21] A. Hastings. The Art of Analog Layout. 2006. [22] Y. Ho, G. M. Huang, and P. Li. Nonvolatile Memristor Memory: Device Characteristics and Design Implications. Int’l Conf. on ComputerAided Design (ICCAD), 2009. [23] M. Hu, H. Li, Q. Wu, and G. S. Rose. Hardware Realization of BSB Recall Function Using Memristor Crossbar Arrays. Design Automation Conf. (DAC), 2012. [24] M. P. Kennedy and L. O. Chua. Neural Networks for Nonlinear Programming. IEEE Transactions on Circuits and Systems, 35(5):554–562, 1988. [25] D. Kobori and T. Maruyama. An Acceleration of a Graph Cut Segmentation with FPGA. Int’l Conf. on Field Programmable Logic and Applications (FPL), 2012. [26] M. N. Leuenberger and D. Loss. Quantum Computing in Molecular Magnets. Nature, 410(6830):789–793, 2001. [27] B. Li, Y. Shan, M. Hu, Y. Wang, Y. Chen, and H. Yang. Memristor-Based Approximated Computation. Int’l Symp. on Low-Power Electronics and Design (ISLPED), 2013. [28] B. Liu, M. Hu, H. Li, Z.-H. Mao, Y. Chen, T. Huang, and W. Zhang. Digital-Assisted Noise-Eliminating Training for Memristor Crossbar-Based Analog Neuromorphic Computing Engine. Design Automation Conf. (DAC), 2013. [29] G. Liu, Y. Tao, M. Tan, and Z. Zhang. CASA: Correlation-Aware Speculative Adders. Int’l Symp. on Low-Power Electronics and Design (ISLPED), 2014. [30] G. Liu and Z. Zhang. A Reconfigurable Analog Substrate for Highly Efficient Maximum Flow Computation. Design Automation Conf. (DAC), 2015. [31] X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu, et al. RENO: A High-Efficient Reconfigurable Neuromorphic Computing Accelerator Design. Design Automation Conf. (DAC), 2015. [32] A. Madry. Fast Approximation Algorithms for Cut-Based Problems in Undirected Graphs. IEEE Symposium on Foundations of Computer Science (FOCS), pages 245–254, 2010. [33] C. Mead. Neuromorphic Electronic Systems. Proceedings of the IEEE, 78(10):1629–1636, 1990. [34] P. A. Merolla et al. A Million Spiking-Neuron Integrated Circuit with A Scalable Communication Network and Interface. Science, 345(6197):668–673, 2014. 22 [35] L. W. Nagel and D. O. Pederson. SPICE: Simulation Program with Integrated Circuit Emphasis. 1973. [36] Y. Nesterov, A. Nemirovskii, and Y. Ye. Interior-point Polynomial Algorithms in Convex Programming. SIAM, 1994. [37] Y. V. Pershin and M. Di Ventra. Solving Mazes with Memristors: A Massively Parallel Approach. Physical Review E, 84(4):046703, 2011. [38] A. Schrijver. On the History of the Transportation and Maximum Flow Problems. Mathematical Programming, 91(3):437–445, 2002. [39] P. Strandmark and F. Kahl. Parallel and Distributed Graph Cuts by Dual Decomposition. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2010. [40] T. Szymanski. Max-Flow Min-Cost Routing in a Future-Internet with Improved QoS Guarantees. IEEE Trans. on Communications, 61(4):1485–1497, 2013. [41] D. W. Tank and J. J. Hopfield. Simple “Neural” Optimization Networks: An A/D Converter, Signal Decision Circuit, and a Linear Programming Circuit. IEEE Transactions on Circuits and Systems, 33(5):533–541, 1986. [42] S. Vichik and F. Borrelli. Solving Linear and Quadratic Programs with an Analog Circuit. Computers and Chemical Engineering, 70(5):160–171, 2014. 23