A Survey of Spintronic Architectures for Processing-in-Memory and

0 downloads 0 Views 2MB Size Report
Review; “spin transfer torque RAM”, “spin orbit torque”, “domain wall memory”, .... A sufficiently large voltage is applied along the to induce oscillation ...... issues,” in Proceedings of the Conference on Design, Automation and Test in Europe.
1

A Survey of Spintronic Architectures for Processing-in-Memory and Neural Networks Sumanth Umesh* and Sparsh Mittal† *IIT Jodhpur, †IIT Hyderabad. E-mail:[email protected],[email protected]. Abstract The rising overheads of data-movement and limitations of general-purpose processing architectures have led to a huge surge in the interest in “processing-in-memory” (PIM) approach and “neural networks” (NN) architectures. Spintronic memories facilitate efficient implementation of PIM approach and NN accelerators, and offer several advantages over conventional memories. In this paper, we present a survey of spintronicarchitectures for PIM and NNs. We organize the works based on main attributes to underscore their similarities and differences. This paper will be useful for researchers in the area of artificial intelligence, hardware architecture, chip design and memory system. Index Terms Review; “spin transfer torque RAM”, “spin orbit torque”, “domain wall memory”, “processing-in-memory”, “machine learning”, “neural networks”

F

1

I NTRODUCTION

As conventional von-Neumann style processors get progressively restricted by the data-movement overheads [1], use of processing-in-memory (PIM) approach has become, not merely attractive, but even imperative. Further, as machine learning algorithms are being applied to solve cognitive tasks of everincreasing complexity, their memory and computation demands are escalating fast. Since traditional processors are unable to meet these requirements, design of domain-specific accelerators has become essential. These factors and trends call for research into novel memory technologies, architectures and design approaches. Spintronic memories allow performing computations such as arithmetic and logic operations inside memory. Also, they allow efficient modeling of neurons and synapses which make them useful for accelerating neural networks [2]. These properties, along with the near-zero standby power and high density of spintronic memories make them promising candidates for architecting future memory systems and even computing systems. Use of spintronic memories, however, also presents key challenges. Compared to SRAM and DRAM, spintronic memories have higher latency and write energy. Also, most of the existing proposals have implemented simple neuron models such as neuron producing “binary output” based on the sign of the input. However, NN architectures aimed at solving complex cognitive tasks require modeling of more realistic neuron models [2]. Further, since some spin neuron-synapse units cannot be connected through spin-signaling [3], they need to be connected using CMOS (complementary metal-oxide semiconductor) based charge-signaling. Evidently, design of spintronic accelerators for PIM and NN is challenging and •

Sumanth worked on this paper while working as an intern at IIT Hyderabad. Support for this work was provided by Science and Engineering Research Board (SERB), India, award number ECR/2017/000622

2

yet, rewarding. Several circuit, microarchitecture and system-level techniques have been recently proposed towards this end. Contributions: In this paper we present a survey of spintronic-accelerators for PIM and NN. Figure 1 summarizes the contents of this paper. Section 2 provides a background on key concepts and a classification of research works on key parameters. Sections 3 and 4 presents techniques for designing logic and arithmetic units, respectively. Section 5 discusses spintronic accelerators for a range of application domains. In these sections, we focus on qualitative insights and not on quantitative results. Paper organization §2 Background and motivation §2.1 Magnetic tunneling junction §2.2 VCMA and VCMA assisted STT devices §2.3 Domain wall memory devices §2.4 Skyrmions and skyrmion based racetracks §2.5 Spintronics v/s all spin logic §2.6 Complete logic set §2.7 Classification §3 Spintronic logic units §3.1 Bitwise logic §3.2 Programmable switch and logic element §3.3 Multiplexer and encoder §3.4 Random number generator

§4 Spintronic arithmetic units §4.1 Adder designs §4.2 Approximate adder designs §4.3 Multiplier designs §4.4 Majority gate based designs §4.5 LUT designs §5 Spintronic accelerators for various applications §5.1 Neuromorphic computing §5.2 Image processing §5.3 Data encryption §5.4 Associative computing §6 Conclusion and future outlook

Fig. 1. Organization of the paper

Finally, Section 6 concludes this paper with a discussion of future challenges. This paper will be useful for researchers interested in the confluence of machine learning, hardware architecture and memory architectures. Table 1 shows the acronyms used in this paper. Input and output carry are shown as Ci and Co , respectively. TABLE 1 Acronyms used frequently in this paper ADC AES ANN/CNN CMOL CMOS/pMOS/ nMOS DPU DRAM DW DWM LSB/MSB LSV LUT MAC

2

analog to digital converter advanced encryption standard artificial/convolutional neural network hybrid CMOS/nanowire/MOLecular complementary/P-type/N-type metaloxide semiconductor digital processing unit dynamic random access memory domain wall domain wall memory least/most significant bit lateral spin valve look-up table multiply and accumulate

BACKGROUND

MCA MJG MTJ MUX/DEMUX NMC

memristive crossbar array majority gate magnetic tunnel junction multiplexer/demultiplexer nano-magnetic channel

NMOS NVM PIM RRAM SA SRAM TCAM VCMA

N-type metal-oxide-semiconductor logic non-volatile memory processing in memory resistive RAM sense amplifier static random access memory ternary content addressable memory voltage-controlled magnetic anisotropy

AND MOTIVATION

We now discuss relevant concepts and refer the reader to prior works for a background on NVMs [4–7]. 2.1

Magnetic tunneling junction

An MTJ is a device consisting of two “ferromagnetic layers” separated by a thin metallic oxide tunneling layer [8, 9]. The relative angular momentum or spin of the two “ferromagnetic layers” is leveraged to store binary data. The layers can be in two possible orientations: one where both layers have the same or parallel spins and the other where both layers have opposing or antiparallel spins as shown in Figure 2(a). While in the parallel orientation, tunneling effect occurs in the oxide layer, resulting in low resistance across the MTJ. When in anti-parallel orientation, the tunneling of electrons across the oxide layer is hindered, resulting in high resistance. These two resistances are used to denote binary logic states ‘high’ and ‘low’. In

3

an MTJ, the orientation of one of the “ferromagnetic layers” is fixed and this layer is termed as ‘reference’ or ‘fixed’ layer. The second “ferromagnetic layer” is left free to change orientation and it is termed the ‘free’ layer. Altering the orientation of the free-layer provides a switching mechanism that toggles logic states, similar to that in a transistor. Free Layer

Parallel

Tunnelling Layer Fixed Layer Heavy metal layer Write direction Read direction

Anti-parallel (a)

(b)

(c)

Fig. 2. (a) “Parallel” and “anti-parallel” orientations of MTJs (b) STT-MTJ (c) SOT-MTJ

The switching mechanisms are mainly of two types, “spin transfer torque” (STT) and “spin orbit torque” (SOT). In case of STT switching [8], an unpolarized current is passed through the fixed layer whose spin imparts an angular momentum which results in a spin-polarized current. This current, when passed through the free layer, transfers its angular momentum resulting in a change in free layer’s orientation. MTJs switched using the STT effect are termed STT-MTJs and they are shown in Figure 2(b). In the case of SOT switching [10], the free layer is attached to a strip of heavy metal. In order to write into the MTJ, “spin Hall effect” is leveraged where an unpolarized current through the “heavy metal layer” results in a “spin-polarized current” in a direction perpendicular to that of the unpolarized current. The spin current so produced transfers its angular momentum to the free layer resulting in switching action. MTJs switched using “spin Hall effect” are termed as SOT-MTJs and they are shown in Figure 2(c). Figure 3 shows standard STT-RAM and SOT-RAM bit-cells. An STT-RAM bit-cell makes use of the same set of terminals across the ferromagnetic layers for both read and write operations. On the other hand, an SOT-MTJ has separate sets of terminals for read and write operations. Although the SOT-RAM bit-cell requires extra terminals for its operation, it has an advantage since both write and read operations can be independently optimized. SL 𝑛

BL 𝑛 SL 𝑛−1

BL 𝑛−1 SL 0

BL 0

WBL 𝑛 WWL 𝑖

WL 𝑖

SL

BL WBL

WL

WW L

WBL 0 RBL 0

RBL 𝑛

RWL 𝑖 SL 𝑖

RBL

RWL WL 𝑗

MT J

SL

(a)

(b)

Fig. 3. (a) STT-RAM bit-cell and array (b) SOT-RAM bit-cell and array

Challenges of STT-RAM and SOT-RAM: Compared to SRAM, STT-RAM has higher write latency and energy [11]. With ongoing feature size scaling, its “sensing margin” is further reduced [12]. Also, scaling leads to a decrease in critical current required for switching which reduces the write energy overhead [13]. However, the read current does not scale much and hence, the write current approaches the read current leading to the phenomenon of “read disturbance” [12, 14]. Both STT-RAM and SOT-RAM suffer from thermal instability which leads to retention failures. The instability increases with scaling and poses a challenge to use of both memories. By virtue of using separate paths for read and write, SOT-RAM does not suffer from “read disturbance”, however, due to this, its bit-cell area is higher than that of STT-RAM. 2.2 VCMA and VCMA assisted STT devices VCMA switched MTJs rely on voltage for switching rather than current which is used by MTJ and SOT-

4

MTJs. VCMA MTJs have thicker oxide layers that act as capacitors [15, 16]. When a voltage pulse is applied across the terminals, charge is accumulated at the oxide-ferromagnetic layer interfaces which, in turn cause a change in the occupancy of the atomic orbitals. The change in orbital occupancy combined with the STT effect induces a change in the magnetic isotropy of the MTJ. However, for voltages greater than the threshold voltage, orientation of the free layer oscillates and the final orientation is dependent on the duration of the voltage pulse. In order to eliminate the dependency on pulse duration, VCMA assisted STT mechanism is used. A sufficiently large voltage is applied along the to induce oscillation in the free layer and a smaller voltage pulse is applied for a longer duration to generate STT effect and stabilize the final orientation of free layer [15]. The use of voltage rather than current for switching results in significantly reduced power consumption due to minimal Joule heating and Ohmic losses which is a major concern in case of STT-MTJ and SOT-MTJ. VCMA-MTJs also have higher packing densities than their STT and SOT counterparts.

2.3

Domain wall memory devices

A DWM device consists of a ferromagnetic nanowire in which opposing spin creates a DW [17]. The DW thus formed can be moved through use of spin-polarized currents. Similar to MTJs, domain wall devices too can be operated using STT or SOT mechanisms. Here, instead of switching the orientation of free layer, the STT and SOT techniques are used to displace the DW. Based on the mechanism used, the devices are termed as STT-DWM or SOT-DWM devices. Racetracks are made up of ferromagnetic nanowires of lengths sufficient to accommodate multiple domain walls[18]. Each racetrack possesses nanoscale notches that stabilize the domain walls. This allows each racetrack to store multiple bits [4]. A read/write head is formed by placing a ferromagnetic layer on top of the racetrack to form an MTJ. The data to be accessed is brought under the read/write MTJ by shifting it through domain wall motion. The key challenge in use of DWM is the latency and energy overhead of shift-operations [4].

2.4

Skyrmions and skyrmion based racetracks

Magnetic skyrmions are topologically stable field configurations that possess particle-like properties [19]. They are created as a result of competing effects of Dzyaloshinskii-Moriya interactions, magnetic isotropy and ferromagnetic exchange coupling in bulk ferromagnets and magnetic thin films [15, 20]. Skyrmions have gained attention as candidates for racetracks due to their topological stability, low driving current and small size. A racetrack with skyrmions instead of domain-walls stores data based on presence and absence of skyrmions and not based on orientation of layers as in domain-wall racetracks. Such racetracks would require a read head for skyrmion detection, a write head for skyrmion creation and a nanowire for skyrmion motion along with CMOS based peripheral circuitry. Such skyrmion based racetracks can outperform domain-wall based racetracks in terms of power consumption, packing density and robustness [20].

2.5

Spintronics v/s all-spin logic

Spintronics refers to devices that use CMOS and spin-based components. An example of spintronic device is STT-RAM which makes use of CMOS transistors and MTJs. These devices utilize both charge and spin-polarized currents. The spin-polarized currents are used for altering magnetic states whereas the charged currents operate the transistors. By comparison, all-spin logic makes use of only spin-polarized components of the currents. An example of an all-spin logic device is a “lateral spin valve” [21]. Lateral spin valves consist of ferromagnetic layers placed above conducting channels [21]. The input currents through the metallic channels are spin polarized and exert STT effect on the ferromagnetic layers resulting in a change in the magnetic orientation. The resultant sum of the spin polarized inputs is responsible for switching of the ferromagnetic layers.

5

2.6

Complete logic set

A set of Boolean functions is said to form a complete logic set if all Boolean functions can be implemented as a combination of the members of the logic set. The most commonly used complete logic sets are (1) AND, OR and NOT (2) NAND (3) NOR (4) implication and NOT (5) majority gate. Among these logic sets, the first three are grouped together under “reconfigurable” logic [22]. This is due to the fact that PIM architectures relying on the technique of “reconfigurable logic” are capable of implementing AND, OR, NOT, NAND and NOR with almost the same ease i.e., such designs are free to make use of at least three (AND, OR and NOT) of the aforementioned logic gates to implement Boolean functions. Reconfiguration is accomplished by varying the reference voltage provided to SAs. On the other hand, the PIM architectures relying on implication technique can use only IMP and NOT operations, while majority gate based architectures can use only one gate. 2.7

Classification

Table 2 first organizes the projects according to the type of memory used by them. Then, it underscores their optimization objective. Table 2 further shows the application domains of different spintronic architectures. Finally, it shows the projects that perform comparative evaluation of spintronic architectures with other platforms e.g., CPU, FPGA, etc. TABLE 2 A classification based on memory technology, optimization goal, application domain and comparative evaluation Category

Reference Memory technology used STT-RAM [2, 22–43] SOT-RAM [2, 44–52] VCMA [16, 53, 54] DWM [2, 3, 17, 45, 55–68] Skyrmion [15, 20, 69–71] Optimization objective Performance nearly all Energy [2, 3, 17, 21–24, 26, 28, 29, 32, 33, 35, 36, 42–44, 46–48, 50– 52, 55–67] Reliability [22, 27, 30, 31, 47, 60, 64, 66] Application area Neuromorphic computing [2, 3, 31, 32, 47, 48, 59–62, 68] Image processing [24, 61, 66] Encryption [29, 45, 50, 51, 64, 65, 67] Associative computing [23] Comparison of spintronic architectures with CPU [24, 50, 51, 58, 59, 64–67] GPU [24, 48] FPGA [48] ASIC [29, 48, 50, 51, 64, 65, 67] CMOS [2, 17, 21, 26, 28, 32, 36, 41, 43, 46, 52, 61, 63, 66] CMOL [29, 50, 51, 64, 65, 67]

3

S PINTRONIC L OGIC U NITS

In this section, we discuss spintronic PIM architectures for bitwise operations (Section 3.1), programmable switch and logic element (Section 3.2), MUX and encoder (Section 3.3) and random number generators (Section 3.4). Table 3 classifies the PIM architectures for performing logic operations based on their design features. It classifies the works as all-spin or spintronic logic. It then shows the bit-cell designs used by different works. Table 3 then shows the DWM device designs used in PIM accelerators. Further, it shows the circuit modifications performed by PIM architectures. The approaches used for configuring the PIM architectures can be broadly divided into two groups. In the first approach, operations are configured by directly

6

TABLE 3 A classification based on design features of PIM accelerators All-spin logic and spintronic logic designs [21, 42] nearly all others Bit-cell designs used in PIM accelerators 1T-1MTJ STT-RAM bit-cell with four terminals [24–27, 29, 53] 1T-1MTJ STT-RAM bit-cell with programmable [26] and read-only data 2T-1MTJ STT-RAM bit-cell with four terminals [23] 2T-1MTJ STT-RAM bit-cell with five terminals [26] 2T-1MTJ SOT-RAM bit-cell with five terminals [45–49] DWM devices used in PIM accelerators Three-terminal DWM [17, 36, 50, 55, 61, 62, 66, 68] Four-terminal DWM [32, 64] Five-terminal DWM [29] DWM racetracks [18, 56–60, 63, 65, 67] Circuit techniques/designs used for PIM operations Modified bit-cell structure [25, 26] Modified peripheral circuitry [24, 26, 29, 45] Configuration using reference voltages [24, 26, 29, 45] Configuration using binary data [25–27, 57] All-spin logic Spintronic logic

providing appropriate voltage to the reference terminal of the SAs whereas in the second approach, the binary data is either fed dynamically or is stored in the memory bit-cells. While the former approach is analog in nature, the latter approach is digital in nature. The works that use these approaches are also highlighted in Table 3. Table 4 first shows the PIM operations performed by different research works. Then, it summarizes the different logic sets used in PIM architectures. Further, it shows the works that use redundancy for various objectives. Finally, it highlights the strategies for reducing write overhead. TABLE 4 A classification of features and optimization strategies of PIM accelerators PIM operations Basic logic operations [22, 25–27, 29, 30, 45, 46, 53, 58] Programmable switch and logic element [36] Multiplexer and Demultiplexer [39] Encoder and decoder [63] Logic set used for PIM operations Reconfigurable logic [24–26, 29, 38, 41, 42, 45, 49] Implication and NOT logic [27, 53] Majority gate logic [3, 21, 30, 42, 50, 52, 66] Use of redundancy Redundant MTJs to avoid the impact of variation [37], provide reliability in the majority voting circuit [30] Redundant bits to facilitate shifting in DWM [58, 65] Strategies for reducing write overhead Achieving writes through shift operations [55, 57] Verify before shift [57]

3.1

Bitwise logic

Jain et al. [26] propose three STT-RAM based PIM accelerators that perform logic, arithmetic and vector operations. The first accelerator makes use of conventional STT-RAM arrays and modified peripheral circuitry. The current flowing through the ‘source line’ of STT-RAM array represents the summation of values in MTJs along the column. Modified peripheral circuitry consists of an additional external input to configure the logic operation, a decoder that sets the signals for the reference current value, a reference generator that generates reference currents, and a row-decoder that can enable two wordlines

7

simultaneously. They note that a vector operation leads to vector output and accessing this requires more than one accesses. Since reduction operations generally follow vector operations, they use a “reduce unit” which reduces the vector output to a scalar output so that it can be retrieved in a single-access. The second accelerator works by modifying the bit-cell structure so that it stores a 1-bit programmable data and 1-bit read-only data simultaneously, as shown in Figure 4. The MTJs attached to BL0 and BL1 store read-only ‘0’ and ‘1’, respectively. The peripheral circuitry comprises a “pre-charge circuit”, two SAs and two “current sources”, as shown in Figures 4(a) and 4(b). By pre-charging the “bit-lines” to either reference voltage or zero, the programmable and read-only data are accessed, respectively. The read-only bits of the bit-cells are used to form LUT to implement in-memory transcendental functions like logarithmic, sigmoid and trigonometric functions. BL0

BL1 BL0

Pre-charge circuit

BL1 MTJ

BLL = bit logic line, LL = logic line WLM = write or logic mode

LL

BLL

WLM

Vref Vref

WL

SL WL SL

BL

BL (a)

(b)

(c)

Fig. 4. (a) Bit-cell proposed by Jain et al. [26] which stores read-only and programmable data (b) Array structure with read-only and programmable bit-cell cells. (c) 2T-1MTJ bit-cell

The third accelerator, shown in Figure 4(c), makes use of a 2T-1MTJ cell array. When “write or logic mode” (WLM) is set to ‘high’, it works as a standard STT-RAM array that performs read and write operations. When the WLM is set to ‘low’, it works as a PIM accelerator. Two bit-cells store the operands and another bit-cell stores the output. Operand and output cells are enabled by the “bit logic line” (BLL) and connected through the common “logic line” (LL). By appropriately configuring the input cells and reference voltage, NOT, AND/NAND, OR/NOR and majority functions are implemented. They show that their technique is more energy efficient than CMOS-PIM accelerators and provides comparable throughput. Mahmoudi et al. [27] propose an STT-RAM based PIM accelerator. The proposed architecture implements the implication and NOT logic operations and all Boolean functions are implemented as a combination of these two operations. To perform implication operation, a current is passed via the common “bit-line”. The bit-cells having the operands are selected by two unequal enabling voltages along their word-lines. These unequal voltages lead to different channel resistances among the transistors and the resulting asymmetry results in the implication operation. NOT is performed by directly changing the orientation of the MTJ. The proposed structure allows implementation of basic Boolean logic functions in a standard STT-RAM array without modification or extra peripheral circuitry. Their experiments confirm the efficacy of their technique. Jaiswal et al. [53] propose a PIM platform using VCMA driven MTJ array. VCMA employs voltage to switch an MTJ instead of spin-polarized currents like an STT-MTJ. Operations are effected such that the result is stored in one of the operand bit-cells. The computations of the array are based on implication and NOT logic. In the case of implication logic, bit-cells containing operands are connected through a bit-line and the mid-point of the bit-line acts as a voltage divider. The asymmetry in voltages is exploited to obtain the result. NOT operation is directly performed by toggling the state of MTJ. Remaining Boolean functions are implemented as a combination of implication and NOT. Their technique does not rely on SAs and reference voltages for configuring the operations. The in-situ nature reduces the number of bit-cells required for operations, thereby decreasing both area and logic complexity. Wang et al. [54] propose a VCMA MTJ capable of implementing stateful Boolean functions such as AND, OR and XNOR. The device consists of five layers as shown in Figure 5(a). The bias voltage Vb and out-of-plane magnetic field Hex are leveraged as logic inputs. Based on critical points obtained from the

8

R-H curve (resistance-magnetic field curve), Hex is encoded to represent logic 0 and 1 in the form of logic input q . Similarly, Vb is encoded to give logic 0 and 1 as input p. Si/Si oxide substrate MgO

(a)

Electrode Ta

CoFeB

Co/Pd

Ri

Operation

0

AND

1

OR

q

XNOR (b)

Fig. 5. (a) VCMA MTJ proposed by Wang et al. [54] (b) Table for configuring operations

If Ri represents the current state of MTJ, then the next state of the MTJ represented by Ri+1 is given by Ri+1 = pRi +pq . By setting the value of Ri as shown in Figure 5(b), it is possible to implement AND, OR and XOR operations. The result of the operation is stored in the MTJ. Their technique allows performing logic operations in a manner similar to memory read and write operations. Also, their proposed VCMA-MTJ has write latency and energy consumption in the orders of nanosecond and femtojoule per bit, respectively. Comments: The techniques of Jaiswal et al. [53] and Wang et al. [54] both employ VCMA-MTJs. The difference between them lies in the device being used. While the former makes use of general VCMA-MTJ, the latter uses a specific five layer VCMA-MTJ. Also, the former performs Boolean computations based on implication and NOT logic, while the latter performs AND, OR and XNOR based operations. Fan et al. [45] present two dual-mode PIM accelerators. The first accelerator is based on an SOT-RAM array. Memory access operations are performed by activating the suitable “write wordline” (WWL). For computations, two wordlines containing the operands are activated using a row-decoder. The operation to be performed (AND or OR) is determined by the value of reference voltage on the SA. This is similar in working to a STT-RAM PIM platform that uses reference voltages to configure operations. The second accelerator is based on racetrack memory and can implement parallel XOR operation. It consists of perpendicularly coupled DWM racetracks made up of ferromagnetic nanowires, as shown in Figure 6. The nanowire mesh is equipped with spin polarizers and sensing MTJs that act as write and read heads, respectively. The bits present in the intersection region on the perpendicularly coupled nanowires labeled A and B are taken as inputs while the resistance of the intersection MTJ gives the result of the XOR operation which is stored in the intersection MTJ itself. This allows very fast and parallel in-memory XOR computations making it suitable for data encryption. They implement AES data encryption on the proposed PIM accelerators and show that their implementation consumes lower energy and area when compared to CPU, CMOL and ASIC designs. Read heads

Intersection MTJ

A XOR B

Input A

Input B

Write heads

Fig. 6. Perpendicularly-coupled racetracks design proposed by Fan et al. [45] which can perform XOR operations

Parveen et al. [29] propose an STT-RAM based PIM architecture which can perform two-input logic operations, viz., AND, OR, XOR, NAND, NOR, XNOR, between operands in a memory array irrespective of their position. Figure 7(a) shows their PIM accelerator. The proposed design can work as both an

9

NVM and a PIM accelerator. Traditional STT-RAM arrays perform the memory read/write operations whereas the computation mode is implemented through an extension to SAs using a 5-terminal DWM device shown in Figure 7(b), and differential latch. For Boolean operations, first the domain wall is set to its initial position and operands are read using SAs. Next, a sensing current is injected through the extension circuit. The current can flow between any two terminals out of R+, R1- and R2- depending on configuration of the extension circuit. The direction of this current and reference value of the differential latch determine which operation is performed. Row/column decoders Transmission gates

SA Extension

STT-RAM array

5 terminal DWM

R+

R1-

W+

W-

Latch Domain wall (a)

Out Out

R2-

(b)

Fig. 7. (a) PIM accelerator proposed by Parveen et al. [29] (b) Five terminal DWM device used in their accelerator [29].

Their implementation consumes lower energy than racetrack and other MTJ based PIM implementations. However, their implementation is slower due to increased latency of individual Boolean computations. Compared to CMOS-ASIC implementation, their proposed platform provides higher performance and better energy efficiency for bulk-bitwise operations. Also, their accelerator consumes less energy for AES data-encryption than CMOS-ASIC and CMOL implementations. Comments: The STT-RAM and SOT-RAM based architectures proposed by most works [24–27, 45] require the operands to be located in a common row or column. By comparison, the design of Parveen et al. [29] does not have this limitation since it reads the operands in two different cycles. Kang et al. [25] present an STT-RAM based PIM accelerator that performs bulk bitwise operations. Their design has a complementary STT-RAM array structure and exploits the peripheral circuitry of the memory with minor modifications. No extra processing units are required. The two operands are stored in two different wordlines while data from a third wordline is used to configure the logic operation. Therefore, programming is equivalent to writing to an MTJ. Using this design, AND and OR operations are performed. It can be extended to incorporate NOT, NAND and NOR by adding a MUX after each SA. This, however, requires more wordlines to configure the operation. Results show that the latency of performing logic operations on the proposed PIM platform is nearly same as that of reading from a bitcell. Their design makes it possible to perform logic operations in a manner similar to memory-readout without additional hardware. Comments: The techniques of Kang et al. [25] and Jain et al. [26] use binary data for configuring the array to perform the required operation. The latter uses a special binary input which is provided dynamically in a continuous manner during operation. In the former case, the configuring inputs are stored in bit-cells on the STT-RAM array itself. Zhang et al. [46] present a PIM accelerator based on voltage-gated SOT-RAM array. The voltagecontrolled “spin Hall effect” switching of the MTJ is exploited to perform in-situ logic operations. In case of voltage gated MTJs, two inputs are needed for changing the state. As shown in Figure 8(a), one input is the switching current and the other input is the bias voltage. The critical switching current is modulated ¯ i, by the “bias voltage” across the MTJ. A single operating MTJ can evaluate the function Bi+1 = A·Z + A·B where Bi is the original MTJ state, Bi+1 is the output that is stored as the new state of B . A is the bias voltage such that a positive bias represents logic high and the zero bias represents logic low. Z represents polarity of switching current and by changing the value of Z , two-input AND, OR and XOR functions are implemented, as shown in Figure 8(b). Since the output is also stored in the same MTJ, the operations performed by their accelerator are insitu in nature. These MTJs arranged in a cross-point fashion form a PIM accelerator suited for bulk-bitwise operations. The MTJs along a row are accessed concurrently using bit-lines similar to a conventional SOTRAM. Exploiting this feature allows bitwise operations to be performed with high degree of parallelism.

10 Bias voltage (A)

B𝑖+1 = AZ+AB𝑖 Z = B𝑖

Switching current (Z)

Z=1

Z=0

B𝑖 , B𝑖+1 B𝑖+1 = A XOR B𝑖

B𝑖+1 = A + B𝑖

B𝑖+1 = AB𝑖

SOT-MTJ (a)

(b)

Fig. 8. (a) Voltage gated SOT-MTJ (b) Using Z as a control signal to implementation AND, OR and XOR operations in the work of Zhang et al. [46]

Compared to CMOS logic gates, their proposed design has higher latency due to longer switching time of MTJs. However, the static power consumption is greatly reduced due to their non-volatile nature. Chang et al. [49] propose a PIM architecture through integration of SOT-RAM based memory and reconfigurable-logic. It consists of standard SOT-RAM array for memory, SOT reconfigurable logic similar to the SOT based PIM array [45], interconnections and a controller. The controller handles programming of SOT logic, instructions and address distribution in the SOT-RAM array. Interconnections facilitate data transfer between memory and logic. The proposed design makes use of identical memory and storage elements which avoids the issue of technological incompatibility between DRAM and SOT-RAM. Use of SOT-MTJ overcomes high latency of STT-MTJ and allows off-line programming and high speed operations. The proposed design provides higher performance than DRAM and STT-RAM based PIM accelerators. The performance advantage is even higher for iterative computations that require writing to memory frequently. Mahmoudi et al. [22] analyze the two main approaches to in-memory bitwise operations, viz., reconfigurable logic and implication logic. Reconfigurable logic refers to techniques that implement Boolean functions as combinations of spintronic AND/OR, NAND/NOR, XOR/XNOR and NOT gates. These gates are themselves implemented by applying appropriate reference voltages on SAs. On the other hand, implication logic refers to techniques in which Boolean functions are implemented as a combination of implication and NOT operations such as that used in [27]. In the case of implication implementations, multiple logic fan-outs are handled with the help of a combination of implication and NOT operations such that intermediate writing and sensing is eliminated. This results in higher reliability and lower power consumption for implication based systems than reconfigurable implementations. However, the number of logic steps needed for implementing complex functions is higher in case of implication logic [22] than in the case of reconfigurable logic. WL𝑛

WL2 WL1

BL1

SL1

BL1

SL1

Fig. 9. Coupled STT-RAM array structure for PIM as proposed by Mahmoudi et al. [22]

They present two approaches to reduce the number of steps. The first approach combines implication and reconfigurable logic. Such combination makes it possible to use AND, NAND, implication and NOT operations which significantly lowers the number of steps needed to implement “complex logic functions”. This approach provides higher performance and energy efficiency than implication implementations, but also suffers from higher error probabilities. The second approach is based on parallelization of STT-RAM arrays so that multiple operations are performed simultaneously. It makes use of coupled STT-RAM arrays as shown in Figure 9. This approach does not reduce the number of steps, but provides faster execution due to parallelization.

11

Wang et al. [72] present a spintronic memory which brings together the advantages of STT-RAM and SOT-RAM while eliminating their disadvantages. In the case of STT-RAm and SOT-RAM, the critical current required for transitioning from parallel to anti-parallel state is higher than the critical curent required for transition from anti-parallel to parallel state. Aslo the two critical currents are opposite in direction. These factors lead to source degradation. In order to compensate for the effect, sufficiently large access transistors need to be used keeping in mind the worse case (parallel to anti-parallel state) of write operations. This leads to high current and reduced reliability for the other case (AP to P). Also, STT-MTJ has high switching latency. On the other hand, SOT-RAM requires two access transistors which leads to low packing density. Also, SOT-RAM does not address the problem of source degradation and requires higher current density than STT-RAM for write operations. Heavy metal layer

BL WL[3]

V𝐷𝐷

Bit-cell

BL WL[2]

I𝑒𝑟𝑎𝑠𝑒 BL

WL[1]

I𝑤𝑟𝑖𝑡𝑒

I𝑟𝑒𝑎𝑑

BL WL[0]

PSL

nMOS pMOS NSL

Fig. 10. Structure of a NAND like block proposed by Wang et al. [72]

Their technique combines STT-RAM and SOT-RAM into a NAND based Flash memory like structure, as shown in Figure 10. Each string or block comprises of MTJs whose free layers are attached to the heavy metal layer. Each MTJ is accompanied by one access transistor and a pair of pMOS and nMOS select transistors. Write operation is carried out in two steps. First, a current Ierase is passed through the heavy metal layer which sets all the elements of the block to AP. Second, access transistors for MTJs to be switched and the pMOS select transistor are turned on, the nMOS transsitor is turned off and bit-lines are grounded. Write current Iwrite induces switching through STT effect. For read operation, access transistor is set, pMOS transistor is turned off and nMOS transistor is turned on. The current Iread through the MTJ is passed through a sense amplifier to obtain the bit-value. Their technique successfully addresses the problem of source degradation since both Iwrite and Ierase are unidirectional. It has lower power consumption compared to STT-RAM and occupies less area compared to SOT-RAM.

3.2

Programmable switch and logic element

Hanyu et al. [36] present spintronic components for FPGAs. They present a programmable switch and a programmable logic element. The switch, shown in Figure 11(a), is made up of MTJs to hold programmed data, write circuits and a SA. The switch is programmed by writing to the MTJs. The SA reads the programmed data ‘M’, which is maintained at location ‘Q’. The value at ‘Q’ is used to toggle the NMOS transistor connected to the routing track. Through this method, the NMOS transistor behaves as a programmable switch that can be used in an FPGA. Once programmed, the switch remains so even when not actively powered, thereby reducing static power consumption. Their programmable logic element is illustrated in Figure 11(b). The operation to be performed is configured in this element by programming the configuration cells which are made up of three-terminal DWM devices. Operands are provided through the select lines of the MUX and the result is sensed through the sensing circuit. The design can be extended to N -bits through addition of more configuration cells and MUXes to accommodate more select lines. The area of both the proposed devices is smaller than that of their CMOS counterparts.

12

Q

MOS switch

SA

Sensing circuit

Inputs

Write circuit BL EN

M

M

Configuration array

Routing track

DWM cells

Control circuit for read and write

Control signal (a)

(b)

Fig. 11. (a) Programmable switch and (b) programmable logic element proposed by Hanyu et al. [36]

Hanyu et al. [37] present a spintronic LUT for use in FPGAs. The proposed LUT circuit is shown in Figure 12. It comprises of a CMOS logic tree made up of combinational circuits, a reference tree, a SA and MTJs for holding the programmed data. This design has a number of CMOS transistors (in the logic tree) and MTJs connected in series. To remove the impact of variation in their characteristics, the proposed design employs redundant MTJs along each series path to control the operating point of the LUT. The proposed LUT is non-volatile and thus, has no standby power consumption. Despite using redundant MTJs, their LUT consumes lower area than the CMOS-only LUT since the MTJs share a single write circuit. Out SA Out

Reference logic tree

MTJ for storing LUT data

CMOS logic tree

MTJ MTJ MTJ MTJ

MTJ MTJ MTJ MTJ

MTJ MTJ

Redundant MTJ

Fig. 12. LUT design proposed by Hanyu et al. [37]

3.3

Multiplexer and encoder

Kumar et al. [39] present a 2×1 MUX and a 1×2 DEMUX which are designed using STT-MTJs. Figure 13(a) shows the design of MUX. It is made up of two MTJs whose free layers are connected by an NMC. The two inputs to the MUX are the currents I0 and I1 . The directions of these current denote the logic state which is stored in the MTJs. Once the orientation of MTJs is set, the select-current Is is passed through the select-line which determines the output represented as the new state of MTJ A. If a select-current Is flows from S to ground, the value of MTJ B is transported to MTJ A due to communication between MTJs via NMC. On flow of current in opposite direction, the state of MTJ A remains unchanged. MTJ B

MTJ A

MTJ A 𝐷1

MTJ A 𝐷0 S

S

Free-layer Nanomagnetic channel I0

I1 (a)

I

0

I S 0 (b)

Fig. 13. (a) 2×1 MUX [39] where I0 and I1 are the inputs while S is the select line. (b) 1×2 DEMUX where S is the select line and I is the input which is routed to either of the outputs D0 or D1 . ‘0’ represents the current input corresponding to logic ‘0’.

Figure 13(b) shows the design of DEMUX. The output is obtained as the logic state of either D0 or D1 depending on the select current Is , input current I and reference current Tref . The direction of Iref is kept constant and equal for both MTJs. The proposed design is an MTJ-only design and also demonstrates logic

13

communication between MTJs through NMCs making it possible to implement more complex devices. It is superior to CMOS MUX in terms of area and energy-delay efficiency. Deb et al. [63] propose two racetrack memory based encoder/decoder designs, one of which follows a dynamically reconfigurable encoding scheme and the other follows a fixed encoding scheme. Both designs require N racetracks to implement a N -bit design. In the first design, a control signal toggles the device between read and write modes. The encoding scheme is stored in the form of binary data on the racetrack. By changing this scheme, the encoding scheme can be changed dynamically. Data is written using the read/write MTJ, and then, the DW is shifted so that the next domain is available for writing. In read mode, the encoded output is obtained at a SA by sensing the read/write MTJ on the racetrack. Their second design implements a fixed encoding scheme and lacks the write-circuit present in the first design and thus, the encoding scheme is manually written into the racetracks and cannot be changed dynamically. This design trades reconfigurability for lower area and better energy efficiency. The proposed devices are meant for use in interconnects and buses, where suitable encoding schemes can reduce power consumption. The results show that at each bit-width, the reconfigurable design consumes higher leakage power than the CMOS-only implementation. Larger bit-width designs are slower due to increased size of SA whereas the smaller designs have greater operating speeds as compared to CMOS-only designs. However, the non-reconfigurable design has both lower energy consumption and higher operating speeds as compared to CMOS-only designs for all bit-widths. Huang et al. [57] present a racetrack memory based PIM accelerator to implement Boolean logic functions and basic devices like adders. The basic design consists of three racetrack strips, two of which hold operands while the third ‘reference racetrack’ holds ‘reference data’. Operations to be performed are configured by programming the reference cells while the ‘reserved cells’ provide an extra cell so that data is not lost while shifting. The use of binary data for configuration makes the proposed design easier to program than the accelerators that use explicit voltage values since programming is equivalent to writing into racetracks. Their design can implement AND, OR, NAND and NOR operations. Figure 14 shows the circuit of a two input AND/OR gate. While circuit for both operations is the same, they differ in the contents of the reference cells. When the reference data is ‘10’ the circuit behaves as an AND gate and when the data is ‘01’ the circuit behaves as an OR gate.If the reference cells are removed, the circuit so formed acts as a buffer. Their design incorporates “shift-only” approach such that the data is written into the racetrack only once while rest of the operations are implemented using only bit-shift approach. It also employs a “verify before shift” approach that stops shifting if the stored and input signal are of the same logic state. Both approaches reduce the number of write operations significantly. RE

RE

RE

RE

Y

Y

Y

Y

1

0

AND

0

1

OR

A

x

0

1

0

A

x

0

1

0

B

x

0

1

0

B

x

0

1

0

RE

Reference cells Operand cells

RE (a)

(b)

Fig. 14. Configurations of the two-input gate proposed by Huang et al. [57] for achieving (a) AND and (b) OR operations

Their design is optimized for reconfigurability and allows three methods to program the operations. The first is initial configuration, similar to that of an FPGA, where all operations are configured before the execution begins. Although simple to implement, this technique has limited flexibility. The second method involves using shifting to change the values of the reference cells. This method allows higher flexibility. The third method employs shifting the operands instead of the reference values. By virtue of using “shift-only” and “verify before shifting” strategies, their technique incurs low latency and energy consumption. Of the above mentioned three methods of reconfiguration, the second method of dynamically reconfiguring

14

reference-cell values is the fastest. The proposed accelerator offers a low-power, reconfigurable and highspeed implementation of logic circuits within memory. 3.4

Random number generator

Random number generators work on the stochastic properties of spintronic circuits. Table 5 shows the specific MTJ parameter which is stochastic in nature and is leveraged for designing stochastic circuits. Table 5 also shows the parameters which are used as knobs to control stochastic nature of MTJs. We now review works that propose spintronic random-number generators. TABLE 5 Classification of stochastic computing architectures Category Reference The MTJ parameter which is stochastic Switching delay [35] Switching probability [31, 34] Parameters varied to exploit stochastic properties of MTJs Write current [31, 35] Operating current [34] Variations in tunneling and free layer thickness [34]

Naviner et al. [34] propose a random number generator using an STT-MTJ. The probability of switching of an MTJ depends on the switching time, operating current and critical current. Deterministic switching occurs when operating current is greater than the “critical current”, whereas probabilistic behavior is obtained by keeping the operating current lower than the critical current. The variations in thickness of oxide layer and free layer of MTJ result in non-uniform tunneling magnetoresistance ratio because of which STT switching is intrinsically stochastic. Both the above-mentioned phenomenon are leveraged to achieve stochastic behavior in an MTJ. The stochastic behavior is used to implement a 1-bit random number generator. They show that the implementation of a polynomial function with their number generator consumes much less area than that using binary signal. This demonstrates the possibility of area optimization in stochastic logic circuits as compared to their binary counterparts. Wang et al. [35] propose a “true random number generator” based on the STT-MTJ. The proposed design exploits the fact that due to thermal fluctuations and magnetizations, the switching delay of an MTJ is stochastic in nature. Figure 15 shows the architecture of their proposed random number generator. The random-write circuit consists of two MTJs, one for the reference value and the other for number generation. The SAs are equipped with generating circuitry that produces a write current according to the random number generation probability. Correction circuit Counter Comparator Random number writing circuit Sense amplifier

Feedback

Random bit-stream

Fig. 15. Block diagram of the random-number generator presented by Wang et al. [35]

To achieve a bitstream with ideal randomness, i.e., 50% probability of ones and zeros, a correction circuit is used which is composed of counters and comparators. The circuit works in three phases: (1) in the “reset phase”, both MTJs are set to initial low resistance values (2) in the “writing phase”, the current generated by the generating circuitry is used to write to the MTJ. (3) In the “sensing phase”, the random bit is sensed at the output of the SA. The generated bit-stream is passed to the correction circuit which produces a control signal to tune the write current for the next cycle which helps in achieving ideal probability of ones and zeros. The proposed design uses an intrinsic phenomenon instead of physical

15

imperfections as the source of entropy. This reduces the amount of post-processing required to ensure high reliability. Hence, it achieves high performance and tolerance to variability without additional area overhead.

4

S PINTRONIC A RITHMETIC U NITS

In this section, we discuss various arithmetic units such as (precise) adder (Section 4.1), approximate adder (Section 4.2), multiplier (Section 4.3), majority gate-based designs (Section 4.4) and LUT designs (Section 4.5). Table 6 classifies these works on several important parameters. We now review these works. TABLE 6 A classification of arithmetic units Category Adder Multiplier Arithmetic logic unit LUT

Reference [17, 21, 43, 44, 52, 55–57, 66] [56] [38, 40–42] for multiplication [58, 65, 67], transcendental functions [26, 59], Boolean functions [58], used in FPGA [37] Approximate computing approaches Approximate adder Ignoring carry-in (Ci ) [33], inexact writing of one input [33], Taking carry-out (Co ) as the sum [66] Achieving transcendental functions with LUT [26, 59] Fixed-point instead of floating-point [60]

4.1

Adder designs

Roohi et al. [52] present a spintronic adder based on SOT-MTJ. 1-bit full adder is implemented through formation of MJGs. It consists of three SOT-MTJs, a SA and write circuits. Two of the MTJs form 3-input MJGs while the third MTJ forms a 5-input MJG. The SOT-MTJ has lower latency and energy requirements as compared to the STT-MTJ. The adder can be extended to N -bits and is intended for use in spintronic ALUs. They show that the proposed 1-bit adder has lower static and dynamic power consumption and smaller area than a CMOS-only adder. However, it is slower than the CMOS-only adder since the switching latency of SOT-MTJs is higher than that of CMOS transistors. Roohi et al. [17] present a full adder based on 3-terminal domain wall devices. The proposed design uses MJGs to formulate the sum and output carry functions of the adder. For a 1-bit full adder, it makes use of one 3-input MJG and a 5-input MJG. It has two SAs to read the outputs, one for sum and the other for carry. The adder can work in two modes. If a low current is used, it functions with low power consumption but also lower speeds. On the other hand, using a current of higher magnitude results in higher operating speeds but also higher power consumption. Their proposed adder has lower area and design complexity than the CMOS-only designs. Huang et al. [18] present an adder based on racetrack memory. One racetrack is used per input or output signal and a DEMUX is used for sharing the inputs for both “sum” and “carry” operations. The 1-bit adder is made of two circuits, one for computing sum as shown in Figure 16(a) and one for output carry as shown in Figure 16(b). This design is extended to multiple-bits by replacing the MTJs used for storing inputs with racetracks. The circuits for computing both sum and output carry are the same; the only difference is in the values of reference voltage Rref used by them. The corresponding values of Rref for the two circuits are given in Figure 16. By virtue of sharing of racetracks and demultiplexing strategies, their adder consumes low area and energy. Trinh et al. [55] propose a racetrack memory based “multi-bit adder”. The building block of their multibit adder is a one-bit “full adder” which is shown in Figure 17(a). The carry is evaluated as a majority function and implemented by connecting in series all the inputs in one branch and their complements in another branch. The proposed adder is different from spintronic adders such as [17, 28, 44, 52] since all the operands are stored in MTJs and no logic-tree like circuitry is involved.

16

RE

C𝑜 RE

C𝑜

R𝑟𝑒𝑓

RE

RE

RE

RE

Sum RE

Sum RE

A

C𝑜

C𝑖

B

1

A

R𝑟𝑒𝑓

B

1 2 1 R 2 𝑃

C𝑜 = 0, R𝑟𝑒𝑓 = R𝑃

1

R𝑟𝑒𝑓 = 2 R𝑃 + 2 R𝐴𝑃

C𝑜 = 1, R𝑟𝑒𝑓 =

Output carry

Sum

(a)

(b)

+ R𝐴𝑃

Fig. 16. 1-bit full-adder proposed by Huang et al. [18]. (a) Implementation of output carry (b) Implementation of summation

Sum

Sum

Co

SA

Carry circuit

Co SA

C𝑜

Writing circuit

C𝑜 A A

A A

A

A A

B B

B

B B

B B

B

Ci Ci

C𝑖 C𝑖

Shifting circuit

CLK (a)

Shifting Enable current

VDD

(b)

Fig. 17. (a) 1-bit full-adder proposed by Trinh et al. [55]. Sum and Co are the summation and output carry respectively, obtained from addition of A, B and Ci . (b) Multi-bit racetrack adder [55]

They further extend their single-bit adder to a multi-bit adder by replacing individual MTJs with racetracks, as shown in Figure 17(b). Multi-bit inputs and their complements are stored in separate racetracks. At the positive edge of a synchronising clock pulse, the output carry is calculated and used as an input for the next operation. At the negative edge, all racetracks are shifted by one bit to bring a new set of inputs under the read and write heads. The racetracks allow storing multiple bits of data on the same racetrack, thereby making it easier to perform multi-bit operations. The multi-bit inputs are written onto racetrack once and shifted, thereby reducing the number of write operations. Their proposed multi-bit adder consumes lower area and energy than a CMOS-only adder. Comments: Unlike the adder proposed by Trinh et al. [55], the adder of Huang et al. [18] does not store the complements of the inputs on racetracks. Secondly, it shares the inputs between carry and sum circuits through demultiplexing. Due to these strategies, the adder proposed by Huang et al. provides higher performance and energy efficiency than that proposed by Trinh et al. An et al. [21] present a full adder based on all-spin logic. Their design utilizes graphene based LSVs to form majority logic gates which, in turn, implement the addition operation. The sum and carry are generated using conventional majority gate synthesis. The proposed adder can be extended to N -bits through simple cascading, similar to the strategy used in ripple-adder. However, this causes an increase in the length of the NMC resulting in higher operational delays. This limitation can be mitigated through the use of a carry look-ahead adder design. Compared to a CMOS adder, their proposed adder has higher dynamic energy consumption but lower area and near-zero standby power consumption. Matsunaga et al. [28] present an STT-MTJ based full adder. Its general architecture is illustrated in Figure 18. It comprises of a SA, a “dynamic current source” (DCS), a “logic tree” and two MTJs. The DCS

17

cuts-off flow of steady current to reduce power dissipation. The “logic tree” consists of a CMOS circuit that determines which operation is to be performed. By changing the logic tree, different operations such as the Boolean functions AND and OR are implemented. Of the three operands required for addition, only one is stored in the MTJ and the other two are provided dynamically during execution. Results shows that the proposed circuit consumes lower area and energy than a CMOS-only implementation. This is because of the reduced number of current paths and reduced static power dissipation. WL 2

WL 1 V DD Clk Sum A

A

Clk Sum A

C𝑖

C𝑖

WL 3

WL 4

Clk

Clk

Carry

Carry A

A C𝑖 BL

B

C𝑖

C𝑖

B

B

Storage MTJs along with read and write circuits

B

BL Clk

Clk

Clk

Clk

Logic-tree

Timing circuit

Fig. 18. Full adder proposed by Matsunaga et al. [28]

Deng et al. [44] propose a SOT-MTJ based full adder. The proposed adder circuit is based on a hybrid CMOS-MTJ model consisting of two MTJs to hold complementary inputs, write circuits to write into the MTJs, a logic tree circuit that determines the operation to be performed, and a sensing circuit to read out the output. This structure bears similarity with that of the STT-MTJ based adder proposed by Matsunaga et al. [28]. However, their SOT-MTJ based design has lower write latency than the STT-MTJ based design. The proposed adder is capable of achieving sub-nanosecond switching with low write energy. Their design has lower latency and energy than the conventional STT-MTJ based adders. Comments: In the designs of Deng et al. [44] and Matsunaga et al. [28], the STT-MTJs or SOT-MTJs are used only for storing operands, while most of the logic is implemented through the CMOS logic tree. Lokesh et al. [38] present a spintronic ALU based on a full adder. Bitwise operations such as AND, OR, XOR and XNOR are implemented by modifying the full adder. Figure 19 shows the truth table of a full adder and subtractor. From this, it is observed that for the first four combinations when Z=0, (1) both the sum and difference are equal to A XOR B (2) carry is equal to A OR B and (3) borrow is equal to A AND B. Similarly, for the last four combinations when Z=1, (2) both sum are difference are equal to A XNOR B, (2) carry is equal to A AND B and (3) borrow is equal to A OR B. Hence, by modifying a full adder and subtractor circuit, two-input AND, OR, XOR and XNOR functions are obtained. Input Z can be used as (1) a control signal to implement AND, OR, XOR/XNOR, (2) input carry in case of addition and (3) input borrow in case of subtraction. Their design offers the possibility of developing completely non-volatile computing systems with zero-start up time. Z

A

B

Sum

Difference

0

0

0

0

0

Operation

Carry

Operation

0

0

1

1

1

0

1

0

1

1

0

1

1

0

0

1

1

1

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

0

0

1

1

1

1

1

0 XOR

XNOR

0 0

1 1

Borrow

Operation

0 OR

AND

1 1

0 0

AND

OR

1

Fig. 19. Table describing the use of full adder to perform AND, OR, XOR and XNOR operations in the technique of Lokesh et al. [38]

Patil et al. [41] present a technique for efficiently combining spintronic logic units into larger blocks.

18

Using this technique they propose a spintronic ALU based on the design of a “full adder”. The output of a spintronic logic circuit is determined by the state of every element in series. As a result, spintronic circuits cannot be combined in the same manner as as CMOS based logic circuits. For example, on designing a 1-bit adder-subtractor by directly combining a spintronic 1-bit subtractor and a 1-bit adder, two problems are observed. The first is the need for control inputs to distinguish between addition and subtraction. The second is the necessity of a switching circuit to select carry or borrow while extending the design to N -bits. To mitigate these problems, they propose a “neutralization” technique which involves switching-off a certain part of the combined circuit so that the desired result is obtained. The neutralization technique allows building complex spintronic logic circuits. Three methods are proposed to achieve neutralization and are illustrated in Figure 20: (1) using control inputs on the MTJ. This strategy is valid only for MTJs having single input. Two control signals X and Y are used to control the state of the MTJ. For example, if X and Y are equal, then the state of R1 is same as X, otherwise, R1 takes the state of input A (2) using an NMC to change the state of an MTJ so that both terminals of an SA have the same resistance. For example, the state of MTJ R1 is made equal to the state of MTJ R2 by transferring the state from R3 using the NMC. X

X

R3 R1

R1

R1 A B C

SA

A B X1

SA

R2

A X Y R2 (a)

(b)

A B X2 (c)

Fig. 20. Illustration of different techniques of ‘neutralization’ proposed by Patil et al. [41]. (a) Neutralization using control signals (b) Neutralization using STT (c) Neutralization using logic

(3) The third method is based on specific observations which apply only to certain operations. For example, for achieving XOR operation, it can be noted that if MTJs R1 and R2 attached to the terminals of an SA different states, the output is a ‘1’, but if they have the same states, the output is ‘0’. Their proposed ALU uses these three methods to perform addition, subtraction and Boolean operations. The ALU can be extended to N -bits. It consumes lower area and energy than a CMOS-only ALU and achieves comparable speed, however, it requires more control inputs. Ren et al. [43] present the energy analysis of a 1-bit STT-MTJ based adder circuit. They compare the MTJ adder with static and dynamic CMOS designs. The MTJ based adder is similar in design to the logic tree-based adders presented by Matsunaga et al. [28]. One of the inputs to the adder is stored in the MTJ. It is noteworthy that the static CMOS adder requires the least number of transistors while the MTJ-based adder requires the highest number of transistors. Simulations show that the dynamic-CMOS and MTJbased adders have higher energy efficiency than the static-CMOS adder. However, the dynamic-CMOS adder provides superior “energy-delay tradeoff” than the MTJ-based adder since MTJ switching consumes high energy.

4.2

Approximate adder designs

Angizi et al. [66] present a spintronic adder circuit which is designed with 3-terminal DWM devices. The proposed adder can perform both approximate and accurate computations. The DWM devices are used to implement MJGs as shown in Figure 21, which in turn form the adder circuit. While the accurate adder is implemented with conventional MJG synthesis, the approximate adder is implemented as Sum = Co = M ajority(A, B, Ci ). Table 7 shows the truth-table of their adder. Clearly, while Sum is wrong for two out of eight cases (shown by 7), Co is correct for all cases.

19

TABLE 7 Truth table of the adder proposed by Angizi et al. [66].

A 0 0 0 0 1 1 1 1

Input B 0 0 1 1 0 0 1 1

Ci 0 1 0 1 0 1 0 1

Accurate output Co Sum 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 1

Approximate output Co Sum 0 17 0 1 0 1 1 0 0 1 1 0 1 0 1 07 Clk

V+ΔV

VA

VO

V-ΔV V+ΔV

Inputs

VB

V V-ΔV V+ΔV

VC

V-ΔV

Fig. 21. 3-input MJG proposed by Angizi et al. [66]

Both approximate and accurate computation modes have nearly equal delays, but this delay is significantly greater than that of a CMOS-only adder. By using pipelining, the delay can be slightly reduced. The approximate mode consumes lower amount of power than the accurate mode while in both the modes, the proposed adder has lower energy consumption than its CMOS-only counterpart. They illustrate use of their adder for evaluating discrete cosine transform on images. The LSBs of pixel-values are processed in approximate mode while MSBs are processed in accurate mode. Changing the number of LSBs and MSBs that use approximate and accurate processing provides a varying degree of approximation [73]. They show that for all levels of approximation, an implementation of discrete cosine transform on their proposed platform consumes lower energy than that on CPU and CMOS-only platforms. Comments: The adders proposed by Angizi et al. [66] and Roohi et al. [17] both use 3-terminal DW motion devices and rely on MJGs for functioning. However, the former design is able to perform both approximate and accurate computations which makes it suitable for error-tolerant applications such as image processing. Cai et al. [33] present two approximate full-adders based on STT-MTJs. Their first design is implemented using reduced logic complexity and is shown in Figure 22. Let A, B and Ci be the inputs and Sum and Co be sum and output carry, respectively. A is provided during computation while the input B is written to the MTJ. Their first adder ignores the input carry Ci while calculating the sum Sum. Thus, the sum is computed as Sum = A XOR B . The output carry Co is computed accurately without ignoring Ci . Their second design, which is also shown in Figure 22, can compute both accurate and approximate outputs. It operates on inexact writing of input B by providing an insufficient writing current that is less than the “critical current” of the MTJ. This design does not ignore Ci while evaluating the sum S unlike their first design. To facilitate comparison, a parameter termed “error distance” is used which provides bit-by-bit comparison between the approximate output (x) and the accurate ) for all possible P output (yP combination of adder inputs. The error distance is computed as ED(x, y) = | p x[p]2p − q y[q]2q |. Here, p and q are indices of bits of x and y , respectively. They show that the “error distance” of the first and second adders are 4 and 6, respectively and thus, the first adder is more accurate. Further, both the approximate adders consume lower dynamic and leakage powers than CMOS approximate adders. Among the two,

20 Sum

Sum

SA

A

A

A

A Ci

A Ci

Ci

Co

Co

SA

Ci

Ci B

B

B

B

Clock

Fig. 22. Design of the second adder proposed by Cai et al. [33] implemented using low write current. The first adder, which is based on reduced logic complexity, is implemented by excluding the circuitry present inside the dotted lines.

the second approximate adder, which operates on inexact writing, consumes lower energy but has much higher delay than the adder operating on reduced logic.

4.3

Multiplier designs

Luo et al. [56] present a DWM-based multiplier. The proposed design is based on radix-4 Boothmultiplication since it provides highly efficient operation for binary-multiplication. Booth-multiplication works by parallel calculation of partial sums and their summation [74]. For this purpose, the multiplier bits are divided in bunches of three bits such that they overlap by one bit. Partial product is generated based on the radix-4 encoding scheme given in table 8 and an illustration is given in Figure 23. The multiplicand is stored on a single racetrack strip while each bit of the multiplier is stored on different strips so that they can be accessed concurrently to provide a high degree of parallelism. Read/write head Multiplicand (-73)

1 0 1 1 0 1 1 1 (0)

(+90)

0 1 0 1 1 0 1 0 (0)

Multiplier

+1

+2

-1

Adder

Memory cell

Padding

+1

0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0

Operand racetracks

Operand racetracks

0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0

1 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 (-6570)

1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0

Result racetracks (a)

(b)

Fig. 23. (a) Illustration of Booth multiplication and (b) pipelined addition used in the adder design by Luo et al. [56]

The proposed design uses a pipelined approach for addition where the partial products are stored in strips with multiple access ports. This is shown in Figure 23(b). Each pair of strips is associated with three adders, two of which are located at the ends of the racetrack and one in the center. The adders on the left and right each take two operands and their results are summed by the adder in the middle. This implements addition of partial products with minimal number of racetrack strips. By virtue of using Booth-multiplication algorithm, parallel fetching of multiplier and subsequent pipelined addition, their design performs high speed in-memory multiplication with minimum hardware.

21

TABLE 8 Encoding scheme for calculating the partial product [56]. X, Y and Z are the bits of multiplier, divided in bunches of three bits. Partial product is obtained by multiplying the “multiplication factor” with the multiplicand. X 0 0 0 0 1 1 1 1

4.4

Y 0 0 1 1 0 0 1 1

Z 0 1 0 1 0 1 0 1

Multiplication factor 0 1 1 2 -2 -1 -1 0

Majority gate based designs

Butzen at al. [30] propose an STT-MTJ based spintronic “majority voting” circuit which is used in “triple modular redundant” architectures for achieving fault-tolerance. It consists of three MTJs to store the operands, an SA and writing circuits. Initially, the input voltages are compared with the reference voltage which results in a current that writes the input to the MTJs. Next, the stored values are read, majority function is evaluated by the SA through comparison with a reference value that is empirically determined. The proposed design has low power consumption since the MTJs have near-zero standby power dissipation. Due to the low read latency of MTJ, their design has high performance. Fault in writing to one of the MTJs is tolerated as the other two inputs provide correct output while evaluating the majority function. This, combined with process variation tolerance of the devices, make the proposed circuit highly reliable. An et al. [42] present two all-spin logic based ALU designs. Both these designs use 5-input MJGs as fundamental blocks. Both ALU designs can perform the same set of add, subtract, increment, decrement and Boolean operations. The first design, shown in Figure 24(a) is constructed by all-spin logic based circuit design. It uses three MJGs, 2 control signals and 10 select lines for the MUX. The second design is constructed by realizing the basic functions as a combination of a full adder and a multiplexer. As shown in Figure 24(b), this design uses 14 MJGs, three control signals and only two select lines. Results show that out of the two designs, the former is superior in terms of energy efficiency, area and operational speed since it requires far less number of MJGs. However, configuring the first design is a challenging and tedious task since it has 10 select lines on the MUX. Also, constructing a control unit to integrate the first design into a computing system leads to high design complexity. Yao et al. [40] propose a spintronic ALU than can perform addition, subtraction and basic Boolean logic operations. It is built on a three-input MTJ element as shown in Figure 25(a). All three input currents A, B and Z have magnitudes greater than the “critical switching current”. The direction of the currents denote high ‘1’ and low ‘0’ logic states. The sum of the three currents is responsible for switching the MTJ. The MTJ state (M ) is the output of the Boolean function M = A · B + (A + B) · Z . By using Z as a control signal and grounding the top electrodes, the MTJ can be used to implement AND (Z =0) and OR (Z =1). Three such MTJs are combined to form a “fundamental logic unit” of the ALU as illustrated in Figure 25(b). The three MTJs connect through two NMCs that act as media for transferring logic states between MTJs. By activating control signals M1 and M2, the input is communicated to output via the NMCs. By combining the fundamental units, their proposed ALU can perform addition, subtraction and Boolean operations. The ALU operates in 3 steps: (1) the inputs are programmed (2) the control signals are activated for the required operation and (3) the output is read by SA. Comment: The spin ALUs, such as the one proposed by An et al. [42], rely only on spin-currents and not charge based currents. ALUs proposed by Patil et al. [41], Yao et al. [40] and Lokesh et al. [38] are spintronic designs since they make use of both spin and charge based currents. Also, the design of An et al. [42] is capable of performing both 3-input and 2-input AND/NAND, OR/NOR and XOR/XNOR while the other ALUs proposed [38, 40, 41] are restricted to 2-input operations.

22 𝐼𝑝𝑜𝑠 𝐼𝑧𝑒𝑟𝑜 𝐼𝑛𝑒𝑔

Mux

𝑆1 𝐼𝑝𝑜𝑠 Mux

A C U B B

𝐼𝑧𝑒𝑟𝑜 𝐼𝑛𝑒𝑔

𝑆10

𝑆1

#

A BC UV

M1

F0

M2

F1

M3

F2

A C

M1

𝑆1 ′

H’

B H’

C V

A BC UVH

H’

H

C’ B’

M7

C U

M8

A’ B’

M11

A H

M12

C V

F0

M3

#

M4

#

A V

U H’

M2

U

M5

H

M9

A’

M13

C’ B’

A’ B’

H’ B’

M6

M10

M14

F1

(b)

(a)

Fig. 24. (a) ALU constructed using all-spin logic circuit design method (b) ALU constructed using majority gate synthesis as proposed by An et al. [42]. Ipos , Ineg and Izero denote spin currents with positive, negative and zero spins respectively M1

M2

M2

M1 R

R1

A

B (a)

C

A

B

R2

A

C

B

C

(b)

Fig. 25. (a) MTJ proposed by Yao et al. [40] that performs Output = A · B + (A + B) · C (b) Fundamental logic unit of ALU [40]

4.5

LUT designs

Yu et al. [59] propose a DW nanowire based NN accelerator. They map an “extreme learning machine based super-resolution” (ELMSR) algorithm to their accelerator. The computations, which are performed most frequently by this algorithm are weighted summation and sigmoid. To facilitate these two computations, the accelerator is equipped with two types of PIM units, XOR and LUTs, both of which are designed with DWM nanowires [58]. Weighted summation is implemented by adders and multipliers are composed of XOR units. Sigmoid function is implemented with the help of a nanowire LUT. The structure of the overall PIM architecture is an H-tree design similar to that used by Angizi et al. [64]. PIM logic elements are distributed and integrated with memory units so as to reduce communication with the external processor and provide thread-level parallelism. Their NN architecture achieves lower energy consumption and better throughput than processing on a CPU. Wang et al. [58] present a DW nanowire based PIM accelerator that performs multiplication for bigdata applications. The proposed architecture makes use of LUTs to implement Boolean logic functions, while DW shifting is used to directly implement XOR and bitshift operations. Structure of XOR logic unit is shown in Figure 26(a). The operands are stored in separate nanowires. The operation is performed on a special read-only cell that has a structure similar to an STT-MTJ, but one in which both the ferromagnetic layers are free. Each of the two operands (A and B ) are connected to one of the ferromagnetic layers of the read-only cell and are shifted into it. The resulting orientation of the free-layers gives P XOR Q. An LUT is implemented on a single nanowire by dividing it into two segments as shown in Figure 26(b), a “data segment” stores programming of the LUT while another “reserved segment” functions as an extra space

23

so that the data is not lost while shifting. The read/write head acts as the sensing port. Shift direction of operands

Read direction

Reserved segment

Data segment

Shift

A

Access port

BL

WL

Shift

B BL

WL

(a)

(b)

Fig. 26. (a) XOR logic unit (b) LUT design proposed by Wang et al. [58]

An array of LUTs along with row and column decoders works as a PIM platform. This is shown in Figure 27. Multiplication for big-data applications is implemented using “MapReduce” technique in which a single multiplication of large size vectors is broken down into multiplication of smaller vectors, and the intermediate results are combined to give the final result. For this purpose, the LUTs in the array are divided into three groups. One group of LUTs are configured for multiplication, another group for Boolean logic operations while a third group is configured to make a controller.

Fig. 27. LUT array proposed by Wang et al. [58]

The mapping of multiplication is illustrated in Figure 28. The multiplication workload is compiled into list of tasks and saved in memory to facilitate concurrent operations. The matrix M is broken into units comprising of rows only such that every task requires only “dot-product” of vectors. The controllers fetch tasks from the queue and corresponding data and dispatch it to the mappers. This is an iterative process that continues till the queue is empty. Each result of the mapper is examined and combined with related results by the reducer till no further combination in possible. The final result is written back onto the memory. Their accelerator has higher latency but greater throughput compared to an implementation on a multicore platform. Also, it achieves higher density and lower power consumption by virtue of the non-volatile nature of nanowires.

5

S PINTRONIC

ACCELERATORS FOR VARIOUS APPLICATIONS

In this section, we review spintronic architectures in terms of their application domains, such as neuromorphic computing (Section 5.1), image processing (Section 5.2), data encryption (Section 5.3) and associative computing (Section 5.4).

24

Fig. 28. Mapping of multiplication to PIM accelerator proposed by Wang et al. [58]

5.1

Neuromorphic computing

In neuromorphic architectures, synapses work as the memory element and store weights of various inputs. The neurons process synaptic inputs to generate the output. Table 9 classifies the works based on their proposed NN architectures, neuron and synapse models and strategies for performing thresholding. Table 9 also shows the benchmarks used for evaluation by different works. TABLE 9 A classification of neuromorphic computing architectures Strategy

Reference NN architecture

Spiking neural network ANN/CNN Binary CNN Extreme-learning machine

[2, 31] [2, 3, 32, 47, 48, 60–62, 68] [47, 48] [59] Neuron model

MTJ neuron LSV neuron Domain wall motion neuron

[2] [2, 3] [2, 32, 61]

Synapse model MTJ synapse [31] Domain wall motion synapse [2, 3] Domain wall motion neuron with MCA synapse [32, 61] Thresholding performed by Bennett clocking [3] Spintronic comparator based on LSV and LUT [68] Spin torque switches [61] Benchmarks used for evaluating NN accelerators Character/digit recognition [3, 32, 47, 62, 68] Object detection/classification [31, 48, 60, 68] Face/edge detection [61, 68] Motion detection [61]

Sharad et al. [3] propose a spintronic ANN accelerator that is based on a nano-magnetic neuron and a domain wall motion neural synapse. The neuron model utilizes “lateral spin valves” to form “majority logic” gates. The input currents flowing into the MJGs get spin-polarized by their respective magnets. These currents have three “spin components” (one each along x, y, and z-axes) and one “charge component”. The charge component flows to the ground while the spin-components induce STT effect and their resultant effect is responsible for switching of the output magnet. A 3-terminal DW device is used to model the synapse. When a current flows vertically into the channel

25

via the DW magnet, its degree of spin-polarization changes in proportion to the displacement of the DW from the center of the magnet. These variations in spin polarization are used to implement programmable weights. The extreme positions of a small DWM device are used as binary weights while longer nanowires are used for non-binary weights. In order to reduce the injection current for synapses, “Bennett clocking” is used [75] whereby the magnet is switched to a meta-stable state and from this, the magnet can be transitioned to either of the stable states with minimal current. The structure of neuron with DW synapse is shown in Figure 29(a). A preset current forces the firing magnet (free layer of neuron MTJ) into its meta-stable configuration. Once this current is removed, the magnet orients itself based on the spin components of the input currents, and thus, the firing-MTJ acquires parallel or anti-parallel state. Weighted sum is computed as summation of “spin-polarized currents” in the metallic channel while thresholding is achieved through “Bennett clocking”. The number of possible input synapses is limited since the spin-polarizing strength of current decays with the length of the NMC. Firing neuron MTJ

Domain wall synapse

Input Current Metallic channel (a)

Latch

MOS Switch

Firing MTJ

Current to fan-out neurons

Neuron with synapses (b)

Fig. 29. (a) LSV neuron with DW synapse proposed by Sharad et al. [3]. (b) Charge based signaling [3]

The limitation of the proposed spin neuron-synapse units is that they cannot be networked through spin-signaling because nano-magnetic channels have very low spin-diffusion lengths. Hence, their proposed ANN uses a CMOS based charge-signaling as shown in Figure 29(b). One end of a differential latch is connected to a reference while the other is connected to the firing-MTJ of a neuron. The output through the transistors provides input-currents to other fan-out neurons. For “character recognition” benchmark, their proposed ANN accelerator consumes lower power than both analog and digital CMOS ANN accelerators. The area of their proposed design is lower than that of digital CMOS ANN and comparable to that of an analog CMOS ANN. Sengupta et al. [2] present implementation of artificial neurons and neural synapses with spintronic devices. They model three different types of neurons: step, non-step and spiking neurons. For step neurons, they present three implementations. The first is an STT-MTJ based implementation. The step functioning of the neuron is directly mapped to switching of the MTJ. For such a neuron, higher operating voltages are needed. This, combined with the large critical current, leads to high energy consumption. The second implementation is based on LSV, as shown in Figure 30(a). Here, magnets m2-m4 are input magnets. The “excitatory” and “inhibitory” currents through m2 and m3 get “spin-polarized” corresponding to the polarity of the magnets. The two spin-polarized currents exert opposing STT-effect on output magnet m1 whose final state is determined by the difference in magnitude between the two currents. The preset current reduces critical current for switching. The third implementation is based on a SOT-MTJ. Step operation is performed in two steps. First, a current is sent via the heavy metal to orient the “free layer” along the “hard axis”. Then, an input “synaptic current” is passed through the pinned layer which leads to switching of the MTJ. The “non-step neuron” is based on a 3-terminal SOT-driven DWM device which is shown in Figure 30(b). During the write operation, a “synaptic current” across T2 and T3 leads to displacement of the DW in proportion to the current magnitude. During the read operation, T1 and T3 are enabled and an “axon circuit” is used to provide an output current. They further present an “integrate-fire spiking neuron” which is shown in Figure 30(c). It is implemented using the same DWM device used for the “non-step” neuron. In a time-interval, the DW is displaced in proportion to the magnitude of synaptic current. It continues to accumulate input pulses in the form of DW displacements until it has reached the opposite

26 T1

(a)

T2

Heavy metal layer

(c)

T1

(b)

T3

Neuron MTJ Excitatory m2 current

Metallic channel

Pinned Layers m1

m3

Inhibitory current

Neuron MTJ m4

T2

Heavy metal layer

Pre-set current

T3

Fig. 30. (a) LSV based step neuron proposed by Sengupta et al. [2] (b) 3-terminal DWM device (c) “Integrate-fire spiking” neuron

end. The read circuitry present at the end detects it and utilizes the axon circuit to generate a spike. The displacement of the domain wall determines the resistance across T1 and T3. This property is exploited to model a programmable weight or neural synapse using the same device. They extend the idea of spintronic neurons to a spintronic neuromorphic processing architecture. It is based on the 3-terminal DWM and has a crossbar structure. A spiking neural network is implemented using the integrate-fire spiking neuron and domain wall neural synapse mentioned above. Results show that the proposed non-step and spiking neurons consume lower area and energy than their CMOS counterparts. As for spiking neural network, both neuron and synapse are modeled using the same device. The PIM capability of their model makes it superior to the CMOS implementation. Fan et al. [32] propose a “soft-limiting non-linear neuron”. It is based on a 4-terminal STT driven DWM device, which is shown in Figure 31. It has two current paths, lateral and vertical. One port along the lateral path is maintained at a constant voltage while the other is used as a programming port. The neuron works in three phases. First, the total synaptic current or weighted summation of inputs is supplied to the programming port. This results in a change in position of the DW and the displacement varies with the magnitude, direction and duration of applied programming current. In the second phase, the vertical current is passed and a “voltage divider circuit” is utilized for sensing the state of MTJ. In the third phase, the DW is set to its initial position. Sensing port

Programming port

Sensing port

Held at constant voltage

Fig. 31. 4-terminal DWM device used by Fan et al. [32] to model “soft-limiting non-linear neuron”

They further present an ANN architecture consisting of an array of the proposed neurons coupled with an MCA which serves as synapse. In this architecture, the functions of an “axon” are performed by transistors. The proposed soft-limiting neurons offer a continuous change in resistance corresponding to inputs resulting in improved accuracy and reduced network complexity when compared to ANNs implemented using hard-limiting neurons. Compared to hard-limiting neurons, their proposed softlimiting neurons lead to smaller area of hidden-layers in ANN models. The proposed neuron consumes significantly lower energy than the CMOS-ANN implementations. Vincent et al. [31] present an STT-MTJ based stochastic neural synapse. Switching time of an STT-MTJ is a stochastic quantity as it determines the probability of switching and is dependant on the switching current. By controlling the magnitude of write current such that it is kept lower than the “critical current”, the stochastic nature of the STT-MTJ is exploited. MTJs are organized in a crossbar structure, each of them connecting an input neuron to an output neuron. When an input neuron spikes, currents are set up in the crossbar array and reach the output neurons. Through the use of SAs, the orientation or logic state of the synapse is determined. Anti-parallel oriented MTJs act as synapses with ‘zero’ weight while parallel oriented ones have the weight ‘one’. Firing of an output neuron leads to a “voltage pulse” on the crossbar array. This voltage pulse results in a switching probability of the MTJ synapse which is determined by the synapse’s activity in the previous time-interval. This probabilistic switching mechanism is used to

27

implement synaptic learning through a “spike timing dependent plasticity” model. Results show that the use of controlled write current results in low energy consumption and their proposed design is robust to device variations. Angizi et al. [47] propose a SOT-RAM based accelerator for low bit-width CNN. The design performs convolution in a bit-wise manner on binary inputs and corresponding weights using PIM approach. Figure 32(a) shows the architecture of their accelerator. It consists of an “image bank”, “kernel bank”, “convolution engine” and a “digital processing unit” (DPU). The input vectors are mapped onto the “image bank” and the weights are mapped onto the “kernel bank”. These vectors are quantized by the DPU. Dot product is evaluated as a combination of bit-count and AND operations. The AND operation is performed in-memory, in the SOT-RAM sub-array using a reference voltage based approach similar to that used by Fan et al. [45]. The bit-counter counts the number of ones in the resultant vectors of the AND operation and passes it to the bit-shifter which left-shifts the vectors. The result so obtained is the partialsum of the corresponding sub-array. The partial sums of all sub-arrays involved are combined to obtain the final result. This result is passed to the DPU for batch normalization and evaluation of the activation function. Image bank

Image bank

Activation function

Quantization

Batch-normalisation

SOT RAM Sub array Bit-counter Bit-shifter

SOT RAM Sub array Bit-counter Bit-shifter

Partial-sum

Partial-sum

DPU

Bit-count Batch-normalization

Convolution engine

Scaling factor DPU

Kernel bank (a)

Convolution engine

Kernel bank

Multiplier Pooling

XOR

(b)

Fig. 32. Design of the accelerators proposed by (a) Angizi et al. [47] and (b) Fan et al. [48]

AND, bitcount and bit-shift operations are performed in-memory in a fast and parallelized manner, thereby accelerating the MAC operations of the CNN. Running a binary-weight AlexNet on the proposed architecture shows that it consumes lower energy than an RRAM based implementation. Fan et al. [48] propose an accelerator for a binary CNN based on the SOT-RAM PIM platform [45, 50]. The design of the accelerator is depicted in Figure 32(b). Their accelerator is modeled after XNOR-NET [76] architecture which is a binarized AlexNet network. The input image and weights are mapped to the image bank and kernel bank, respectively. The weight tensors are converted from -1 and +1 to 0 and 1, respectively. Thus, convolution is achieved through bitwise AND and bit-count operations [77]. Bitwise AND is performed in the memory itself using the SOT-RAM based “convolution engine”. This is followed by the “bit-count” operation. The DPU accompanying each block performs other computations such as batch-normalization, scaling and pooling. The proposed accelerator achieves acceleration due to its ability to perform convolution within memory itself. Movement of data in and out of memory is greatly reduced. Results show that the accelerator consumes lower area and energy than an RRAM based accelerator. Comments: The accelerator proposed by Angizi et al. [47] is similar in design and working to the accelerator presented by Fan et al. [48]. However, the former design utilizes the DPU to quantize inputs, while the accepts binarized inputs. Also, the in-memory bit-shift operation used by Angizi et al. is not used in the accelerator proposed by Fan et al. and the DPUs in both accelerators are equipped to perform different functions. Chung et al. [60] present a racetrack based accelerator for the convolution layer of CNNs. In CNNs, a large fraction of computations are performed in the convolution layers and the primary kernel of convolution layers is matrix multiplication. Thus, by accelerating matrix multiplication, the performance of CNNs can be greatly increased. The accelerator proposed by Chung et al. [60] leverages PIM approach to compute dot-products. It consists of nanowire input registers, a racetrack array, two accumulators and

28

adders. The weights are stored in the nanowires while the inputs are provided dynamically through the transistors. The racetrack sub-array performs the dot-product while the final result of multiplication is obtained after passing the partial-dot product through the ADC sub-array. Figure 33 shows a comparison between MCA, SRAM and DWM based dot-product engines. In Figures 33(b) and 33(c), the ADCs are included in the ”dot-product” blocks. Compared to an MCA implementation, their proposed design provides comparable throughput while consuming lower energy. Their proposed PIM dot-product engine is especially useful as a CNN accelerator. SRAM arrays (64 × 64-bit)

4-bit DACs (64 × 1)

Input (64 × 4-bit)

DWM arrays (128 × 64-bit)

MCA (64 × 64-bit)

4-bit ADCs (1 × 64) Output (64 × 4-bit)

(a)

Address (6-bit) 64-bit Input Dot product (64×4-bit) (Multiply and add) Output (4-bit)

(b)

Address (6-bit) 64-bit Input Dot product (64×4-bit) (Multiply and add) Output (4-bit)

(c)

Fig. 33. Sixty-four “dot products” with sixty-four four-bit weights using dot-product engines implemented using (a) MCA (b) SRAM (c) DWM Racetracks [60]

Comments: By virtue of storing multiple bits in the nanowire racetracks, the technique of Chung et al [60] achieves longer bit-width processing compared to that proposed by Sharad et al [3] and Sengupta et al [2]. Sengupta et al. [62] propose a spintronic-based ANN accelerator. Both neurons and synapses are designed with a 3-terminal SOT driven DWM device similar to the one used in [2]. The functions of the neuron and synapses are mapped to separate DWM devices. A “spintronic axon circuit” is used to enable networking of neurons. They implement a “feed-forward ANN” having a hidden layer which is fullyconnected to the output layers and this design is illustrated in Figure 34. The hidden layer and the output layer are mapped to crossbar arrays and connected through “axons”. Input voltages Vi proportionate to image pixels are affected along the rows while the position of the DW at every cross-point represents the “synaptic weight”. If Gij is the conductance of synapse between ith input and j th neuron, and Rj is the resistance of the neuron’s path then the synaptic current flowing into the neuron is given by P Gij · Vi Ij = i (1) 1+γ P where γ = Rj i Gij . When Rj is very small compared to P 1Gij i.e., γ  1, the voltage drop across the i neurons can be neglected. In such a case, Ij gives the weighted addition of weights and inputs thereby providing the functionalities of a neuron. Since the proposed spintronic neurons can be operated at much lower voltages than the crossbar array, their accelerator consumes much lower power than its analog counterpart. Ramasubramanian et al. [68] present a deep-neural network accelerator which relies on DWM-based implementation of neurons [2, 3]. These spin neurons are coupled with MCA synapses to form a network of neurons similar to the ones presented by Roy et al. [61] and Fan et al. [32]. The proposed architecture uses an array composed of 3-terminal DWM devices as the memory. Their design has a three-tier hierarchical structure as depicted in Figure 35. In the lowest tier, “spin neuron arrays” (SNA) are formed by combining the spin neuron network with peripheral circuitry. SNAs are used for performing thresholding and convolution operations. In the next tier, “spin neuromorphic cores” (SNC) are formed by combining several SNAs with dispatch units and local memory. In the highest tier, “SNC clusters” are formed by combining several SNCs via a local bus. The three-level hierarchy allows the proposed architecture to match the nested parallelism of deep neural networks. The number of SNAs in each SNC and number of SNCs in each SNC cluster can be changed

29 Hidden layer crossbar

Axon

Output layer crossbar

Inputs

Synapse Neuron

Fig. 34. ANN accelerator design proposed by Sengupta et al. [62]

Dispatch unit

Spin memory

SNA

SNA

SNA

SNA

SNA

SNA

SNA

SNA

SNA

Intra-cluster bus

SNC SNC Clusters

Fig. 35. Three-tier NN accelerator design proposed by Ramasubramanian et al. [68]

to obtain different points of the energy-speed tradeoff. Since spintronic crossbar array can operate at much lower voltage than their CMOS counterparts, their accelerator consumes much lower energy than an analog-CMOS based implementation. 5.2

Image processing

He et al. [24] propose an STT-RAM array that works as an NVM and a reconfigurable PIM platform. It is based on a standard STT-RAM array and uses revised column and row decoders for enabling either a single line for memory read/write, or two lines for PIM. It has a modified sensing circuit consisting of two SAs, and a reference generator circuit which provides reference values to the two SAs. The two SAs evaluate NAND, AND, NOR and OR simultaneously. From these functions, XOR and XNOR functions are generated using a CMOL combinational circuit. A MUX selects the desired output from the six possible outputs. They further propose a novel edge detection algorithm that makes use of the proposed PIM array. In the case of binary images, the entire image is stored in the memory array. Four neighboring bit-cells are simultaneously selected and the reference values of the SAs are set such that the edge-detection algorithm is intrinsically implemented. This is equivalent to a sliding window in conventional image processing. The algorithm is extended to N -bit grayscale images by dividing the image into N bit-planes from MSB to LSB and applying the algorithm to each bit-plane separately. The plane-wise results thus obtained are combined through in-memory pixel-wise OR operation to obtain the final result. Their technique consumes lower energy than the CMOS implementation of conventional edge detection algorithms. Their design allows executing complex algorithms on PIM platforms that are otherwise considered suitable only for basic Boolean functions. Roy et al. [61] propose a DWM based architecture for non-Boolean computing. Three-terminal DWM devices are used to model thresholding neurons. These neurons are networked through a CMOS latch based signaling system similar to the one used by Sharad et al. [3]. The latches control the transistors which, in turn, supply synapse currents to the fan-out neurons. A transistor corresponding to a negative weight

30

acts as drain, while that corresponding to a positive weight acts as a current source. The proposed NN architecture combines the three-terminal DWM devices and an MCA. The DW devices serve as neurons, while the MCA acts as neural synapse and the CMOS signaling system enables efficient connectivity between them. Simulations show that the thresholding operation is more efficient in spin-based neurons than in their CMOS counterparts. They evaluate their design using image processing algorithms such as edge-detection, motion-detection and digitization. Due to the low-operating voltage of the neurons, their proposed architecture consumes lower energy than the advanced mixed-signal CMOS implementations. Natsui et al[78] propose an automated design environment for MTJ-based large scale integration. The proposed design environment consists of a combination of standard EDA tools and newly developed customized tools and libraries. The flow diagram of the design procedure along with a comparision of conventional and newly developed techniques is shown in Figure 36(a). The “Nanolib” tool generates technology file for a given circuit netlist. The output comprises functional, structural, enviromental and timing information. “ns-spice mtj” is a SPICE simulation model for MTJs. The custom developed circuit simulator combined with “ns-spice mtj” allows generation of MTJ instances using a single line of code just like that of transistors. Specialized libraries are created using the circuit simulator and “Nanolib” for MOS/MTJ hybrid cells. Also included is a HDL preprocessor “Vericonv” which converts a HDL netlist into a Verilog netlist. Minimum SAD determination block

Control 5x5 PE

PE(0,0)

PE(0,4)

PE(4,0)

PE(4,4)

Candidate window data

PE(4,1)

Fig. 36. Architecture of the motion-vector prediction unit proposed by Natsui et al. [78]

Next they design and fabricate a motion-vector prediction unit using the proposed design environment. The architecture of the unit having an 8×8 candidate window along with a 4×4 search window for 8-bit images is shown in Figure 36. It consists of 25 processing elements (PEs), each of which contains 16 8bit non-volatile adders. Power consumption of the PEs is reduced by precisely controlled power supply over every operation cycle. This method of power gating is favourable to implement since the nonvolatile nature of the MTJs allows power supply to be controlled without worrying about data retention. This approach reduces the static power dissipation significantly . Power efficiency is also enhanced by increasing the granularity. Their motion-vector prediction unit has lower leakage power and satic power consumption than the CMOS-only designs. Also, the higher packing density and ability to embed MTJs over CMOS structures to form 3-D circuits leads to reduced area requirements.

5.3

Data encryption

The PIM capability of spintronic memories offers unique advantage for data encryption. Performing encryption directly in memory obviates the requirement of loading the data in volatile memory, performing encryption them using logic unit and then storing back in memory. Thus, energy/bandwidth overheads and security risks of data-movement [79] are entirely avoided. This also allows reaching the level of throughput and energy efficiency required for big-data applications. Figure 37(a) shows the flow-diagram of AES data-encryption and Figures 37(b) and Figure 37(c) illustrate ShiftRows and

31

MixColumns operations, respectively. We now review several works that propose spintronic accelerators for data encryption. Input

AddRoundKey SubBytes ShiftRows

N-1 iterations

MixColumns (a)

S1,1 S1,2 S1,3 S1,4

S2,1 S2,2 S2,3 S2,4

S2,2 S2,3 S2,4 S2,1

S3,1 S3,2 S3,3 S3,4

S3,3 S3,4 S3,1 S3,2

S4,1 S4,2 S4,3 S4,4

S4,4 S4,1 S4,2 S4,3

(c)

SubBytes

Shifted matrix

MixColumns S1,1 S1,2 S1,3 S1,4

Final iteration

AddRoundKey Key expansion

S1,1 S1,2 S1,3 S1,4

State matrix

AddRoundKey

ShiftRows

ShiftRows

(b)

S2,1 S2,2 S2,3 S2,4 S3,1 S3,2 S3,3 S3,4 S4,1 S4,2 S4,3 S4,4

Encrypted output

×

02

03

01

01

01

02

03

01

01

01

02

03

03

01

01

02

State matrix

Preset matrix

Fig. 37. (a) Flow diagram of AES data-encryption for N iterations [50] (b) illustration of ShiftRows and MixColumns operations

Angizi et al. [64] propose a PIM architecture for implementing data encryption based on a four-terminal DWM device driven by spin-Hall effect. The structure of the logic blocks is shown in Figure 38. Each H-tree shaped sub-array is split into two blocks. Every block consists of four memory cells and four PIM logic units. The logic units are “threshold logic gates” (TLG) and XOR gates. The TLG is used to implement majority, AND/NAND and OR/NOR functions. Since XOR accounts for a bulk of the operations performed in data encryption, Angizi et al. use specialized XOR gates even though XOR can be implemented with TLGs. In the computing mode, the TLG and XOR units are used for PIM operations and in the memory mode, the TLG serves as another memory cell.

Mem

XOR

Mem

XOR

Mem

TLG

Mem

TLG

Logic Blocks

Fig. 38. Logic blocks with memory (Mem), XOR and TLG units organised into H-tree structures [64]

They further illustrate data encryption by implementing AES on the proposed architecture. The flow diagram for AES and an illustration of ShiftRows and MixColumns transformations are shown in Figure 37. They propose three levels of parallelization for mapping the transforms. The first level uses 16 rows of a memory unit to store a 4×4, 16-byte state matrix whereas the second level uses two memory units simultaneously to hold two different state matrices. In a similar manner, this is extended to a higher level of parallelization. Results show that with increasing degree of parallelization, the speed increases at the cost of increased area and energy consumption. At highest degree of parallelization, their architecture has lower “energy-delay product” than CPU, ASIC, CMOS and the baseline DWM implementations [80], however, it has higher area than CMOL and baseline DWM implementations. He et al. [67] propose mapping of AES data-encryption to the DWM nanowire based PIM platform presented by Fan et al. [45]. The perpendicularly coupled nanowire crossbar array is equipped with row and column decoders to facilitate accessing individual cells. The crossbar array performs in-memory XOR [45] and bit-shift through DW shifting. Each crossbar stores a single row of the 4×4 state matrix. In

32

the “AddRoundKey” step, the state matrix is loaded into 4 nanowires such that each nanowire holds one row. The key matrix is similarly loaded into the perpendicular nanowires and XOR of the matrices is retrieved from the intersections. For the “SubBytes” step, the state matrix undergoes an LUT based transformation. Implementation of LUT on nanowires is similar to the technique used by Wang et al. [58]. In the “ShiftRows” step, each row undergoes shifting which is easily implemented through DW shifting on the nanowire. For the “mix-colums” step, the addition and multiplication by 2 and 3 can be implemented either as a combination of XOR and bit-shifts or with an LUT. Results show that the proposed implementation consumes lower energy than CPU, ASIC and CMOL implementations of AES. Wang et al. [65] present a DWM nanowire based PIM architecture and map AES data encryption to this architecture. This method allows integration of AES ciphers and data encryption within memory. The 16-byte 4 × 4 state matrices are split into 8 4 × 4 state-arrays from MSB to LSB. Each row of such an array is stored in a nanowire along with a few reserved-bits to facilitate shifting. The proposed design makes use of the nanowire based XOR and LUT implementations proposed by Wang et al. [58]. For the “AddRoundKey” step, the state-array is bit-wise XORed with a “key-array” through in-memory XOR operations. The resultant state-matrix is subjected to a non-linear transformation by using a LUT in the “SubBytes” step. In the “ShiftRows” step, the rows of the state-array are shifted cyclically. The redundant-bits in the nanowire are used to form a “virtual circle” in the nanowire. Each row is shifted by different amounts i.e., the ith row is shifted by i − 1 bits. For the last transformation viz. “MixColumns”, three operations are necessary: multiplication by 2, multiplication by 3 and bitwise XOR. These can be implemented as a combination of left shifts and bit-wise XOR, or directly using an LUT. All four transformations of the AES algorithm are implemented without moving data out of the memory. Their proposed design has higher throughput and energy efficiency and lower area than CPU, ASIC and memristive CMOL implementations, however, its latency is larger than that of memristive CMOL and ASIC since DW based XOR and LUT need multiple cycles due to shift operations. Comments: The techniques of He at al. [67] and Wang et al. [65] are very similar in terms of mapping AES to the proposed platforms. The main difference lies in the devices used i.e., the former uses a simple crossbar array while the latter uses specialised constructs based on nanowire racetracks. Due to the change in the devices used, the design of He et al. is simpler in terms of logic complexity but requires extensive programming and control units. The design of Wang et al., on the other hand, has higher logic complexity but requires a simpler control unit. Fan et al. [50] propose a PIM platform based on three-terminal DWM devices. The array consists of memory (‘Mem’) cells and memory/function (‘Mem/Function’) cells as shown in Figure 39. ‘Mem’ cells comprise of 2-transistors and 1 DWM device cells. The ‘Mem/Function’ cells have an extra access transistor that is controlled by the “mode activation row decoder”. The ‘Mem’ cells present in a row are used to store operands and are controlled by row-decoders whereas the ‘Mem/Function’ cells store the output. PIM is achieved through implementation of “majority logic” gate since it forms a complete logic set. All other Boolean functions are implemented as a combination of majority logic. WSL

WBL

Voltage driver

RSL RWL

Mode select WWL ‘Mem’ Cells

‘Mem/function’ Cells

‘Mem’ cell array

Mem/ function cells

Mode activation row decoder

Row decoder Output

(a)

(b)

Fig. 39. (a) The ‘Mem’ and ‘Mem/function’ cells [50]. (b) The PIM platform proposed by Fan et al. [50]

During computations, a current flows from the ‘Mem’ cells to the ‘Mem/Function’ cell which represents weighted summation of data stored in the ‘Mem’ cells. If the summation current is higher than the “critical current” of DWM device, the DW moves to the opposite end, otherwise it remains at its initial position. The result of the operation remains stored in the ‘Mem/Function’ cell in the form of DW displacement. Their design can be used for implementing any 2-input logic gate as a combination of majority logic.

33

Implementation of AES on their accelerator achieves higher energy efficiency than that on CPU, ASIC and CMOL. 5.4

Associative computing

Associative computing is based on the concept of associative search wherein data is accessed using the content rather than the address. The searched data is processed in a massively parallel manner which eliminates the need for accessing memory in each and every operation. Guo et al. [23] present an STT-RAM based associative computation architecture. Their design leverages PIM to reduce the address computation overhead associated with accessing data. The TCAM array is composed of 2T-1MTJ cells, as shown in Figure 40(a). Each cell of the array is capable of performing write, read and search operations. Search is performed by XORing the stored bit with the search-bit. The array configured for such a search scheme is shown in Figure 40(b). Here, D or D’ is biased and the search-bit is applied through the search-line. ML0 B L

D

D SA

B L ML1

WL

M TJ

SA

SA

ML2 ML BL0

BL1

BL0

Read SAs

(a)

BL1

Search SAs

SA

SA

(b)

Fig. 40. (a) 2T-1MTJ bit cell [23] (b) Array structure with XOR gates for search [23]

Their technique employs bit-serial search, i.e., it follows an iterative approach by searching the array column-after-column. Each row is equipped with additional logic circuitry such as SAs, flip-flops, and multi-match circuits. Organization of the TCAM array is shown in Figure 41. After a search operation, on-chip microcontrollers perform summation, match-count and indexing operations. The result of these operations is stored in memory and subsequently retrieved by the processor. By virtue of using TCAM and PIM to implement search operation, their architecture provides higher performance and energy efficiency than a DRAM-based system. The limitation of their technique is that with increasing search-width, its delay and energy consumption also increases. Row/column decoders

Search

Read/Write

Memory-line SAs for ‘search’

Additional row circuitry

Bit-line SAs for ‘read’

Fig. 41. Array organization of the PIM architecture proposed by Guo et al. [23]

Comments: The associative computing architecture proposed by Guo et al. [23] and the PIM accelerator proposed by Jain et al. [26] both make use of 2T-1MTJ cells. However, the structures of these cells are different. The bit-cell structure used by Guo et al. has four terminals while the bit-cell used by Jain et al. has five terminals. The extra terminal used by Jain et al. is the “write line mode” (WLM) terminal which is used to toggle between memory and PIM modes.

6

C ONCLUSION

AND

F UTURE O UTLOOK

Memory latency and bandwidth constraints have now become the key bottleneck in scaling the performance of modern processors. Although traditional techniques such as prefetching [81] and data-

34

compression [82] can mitigate these overheads partially, approaches that provide much higher efficiency are required for architecting processors of next-generation. In this paper, we presented a survey of spintronic-architectures for enabling “processing-in-memory” and designing accelerators for “neural networks”. We conclude this paper with a discussion of future challenges. Apart from performance, area and energy-efficiency, other metrics such as high reliability and high yield at small feature-sizes, security and cost effectiveness also will determine whether spintronic memories will see wide-scale integration in product systems. Most research works, however, do not evaluate the proposed architectures on all the metrics. Comprehensive evaluation of the spintronic architectures and management techniques is required to establish their effectiveness soundly. While conventional memories and compute-centric architectures fall way short of meeting the grand challenges of AI, spintronic memories of today also are unable to meet these targets. Evidently, there is a need of making concerted efforts across the entire software stack to address these issues. From deviceperspective, continuing the feature-scaling while reducing fault-rate will allow improving integration density. At microarchitecture-level, hiding the large latency of these memories will require use of techniques such as pipelining, prefetching and write-coalescing. As novel machine learning algorithms are proposed and deployed in various applications, designing spintronic accelerators customized for different algorithms and applications is required to ensure high efficiency. With decreasing feature size, the error rate in processor components increases [83]. Due to this, the functional units may become slow and/or faulty. In such cases, the PIM approach becomes even more important. While previous works have studied PIM capability of spintronic architectures primarily for energy and performance benefits, exploring the benefits of PIM for tolerating errors presents as an promising research avenue in near future.

R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]

[10]

[11] [12] [13] [14] [15] [16]

[17]

S. Mittal and S. Nag, “A survey of encoding techniques for reducing data-movement energy,” Journal of Systems Architecture, 2018. A. Sengupta and K. Roy, “A vision for all-spin neural networks: A device to system perspective,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, no. 12, pp. 2267–2277, 2016. M. Sharad, C. Augustine, G. Panagopoulos, and K. Roy, “Spin-based neuron model with domain-wall magnets as synapse,” IEEE Transactions on Nanotechnology, vol. 11, no. 4, pp. 843–853, 2012. S. Mittal, “A Survey of Techniques for Architecting Processor Components using Domain Wall Memory,” ACM Journal on Emerging Technologies in Computing Systems, 2016. S. Mittal, J. S. Vetter, and D. Li, “A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 26, no. 6, pp. 1524 – 1537, 2015. X. Chen, E. H.-M. Sha, Q. Zhuge, W. Jiang, J. Chen, J. Chen, and J. Xu, “A unified framework for designing high performance in-memory and hybrid memory file systems,” Journal of Systems Architecture, vol. 68, pp. 51–64, 2016. S. Mittal and J. S. Vetter, “A survey of software techniques for using non-volatile memories for storage and main memory systems,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 27, no. 5, pp. 1537–1550, 2016. S. Peng, Y. Zhang, M. Wang, Y. Zhang, and W. Zhao, “Magnetic tunnel junctions for spintronics: principles and applications,” Wiley Encyclopedia of Electrical and Electronics Engineering, pp. 1–16, 2014. M. Wang, W. Cai, K. Cao, J. Zhou, J. Wrona, S. Peng, H. Yang, J. Wei, W. Kang, Y. Zhang et al., “Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance,” Nature communications, vol. 9, no. 1, p. 671, 2018. I. Ahmed, Z. Zhao, M. G. Mankalale, S. S. Sapatnekar, J.-P. Wang, and C. H. Kim, “A comparative study between spin-transfertorque and spin-hall-effect switching mechanisms in pmtj using spice,” IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, vol. 3, pp. 74–82, 2017. S. Mittal, R. Wang, and J. Vetter, “DESTINY: A Comprehensive Tool with 3D and Multi-level Cell Memory Modeling Capability,” Journal of Low Power Electronics and Applications, vol. 7, no. 3, p. 23, 2017. W. Kang, Y. Cheng, Y. Zhang, D. Ravelosona, and W. Zhao, “Readability challenges in deeply scaled stt-mram,” in Non-Volatile Memory Technology Symposium (NVMTS), 2014 14th Annual. IEEE, 2014, pp. 1–4. S. Mittal, “A survey of soft-error mitigation techniques for non-volatile memories,” Computers, vol. 6, no. 8, 2017. S. Mittal, J. Vetter, and L. Jiang, “Addressing Read-disturbance Issue in STT-RAM by Data Compression and Selective Duplication,” IEEE Computer Architecture Letters, vol. 16, no. 2, pp. 94–98, 2017. W. Kang, Z. Wang, H. Zhang, S. Li, Y. Zhang, and W. Zhao, “Advanced low power spintronic memories beyond STT-MRAM,” in Proceedings of the on Great Lakes Symposium on VLSI 2017. ACM, 2017, pp. 299–304. J. G. Alzate, P. K. Amiri, P. Upadhyaya, S. Cherepov, J. Zhu, M. Lewis, R. Dorrance, J. Katine, J. Langer, K. Galatsis et al., “Voltage-induced switching of nanoscale magnetic tunnel junctions,” in Electron Devices Meeting (IEDM), 2012 IEEE International. IEEE, 2012, pp. 29–5. A. Roohi, R. Zand, and R. F. DeMara, “A tunable majority gate-based full adder using current-induced domain wall nanomagnets,” IEEE Transactions on Magnetics, vol. 52, no. 8, pp. 1–7, 2016.

35

[18] K. Huang, R. Zhao, and Y. Lian, “A low power and high sensing margin non-volatile full adder using racetrack memory,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 62, no. 4, pp. 1109–1116, 2015. [19] W. Kang, Y. Huang, X. Zhang, Y. Zhou, and W. Zhao, “Skyrmion-electronics: An overview and outlook.” Proceedings of the IEEE, vol. 104, no. 10, pp. 2040–2061, 2016. [20] X. Zhang, M. Ezawa, and Y. Zhou, “Magnetic skyrmion logic gates: conversion, duplication and merging of skyrmions,” Scientific reports, vol. 5, p. 9400, 2015. [21] Q. An, L. Su, J.-O. Klein, S. Le Beux, I. O’Connor, and W. Zhao, “Full-adder circuit design based on all-spin logic device,” in Nanoscale Architectures (NANOARCH), 2015 IEEE/ACM International Symposium on. IEEE, 2015, pp. 163–168. [22] H. Mahmoudi, T. Windbacher, V. Sverdlov, and S. Selberherr, “High performance MRAM-based stateful logic,” in Ultimate Integration on Silicon (ULIS), 2014 15th International Conference on. IEEE, 2014, pp. 117–120. [23] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “AC-DIMM: associative computing with STT-MRAM,” ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 189–200, 2013. [24] Z. He, S. Angizi, and D. Fan, “Exploring STT-MRAM Based In-Memory Computing Paradigm with Application of Image Edge Extraction,” in Computer Design (ICCD), 2017 IEEE International Conference on. IEEE, 2017, pp. 439–446. [25] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-Memory Processing Paradigm for Bitwise Logic Operations in STT–MRAM,” IEEE Transactions on Magnetics, vol. 53, no. 11, pp. 1–4, 2017. [26] S. Jain, S. Sapatnekar, J.-P. Wang, K. Roy, and A. Raghunathan, “Computing-in-Memory with Spintronics,” DATE, pp. 1640– 1645, 2018. [27] H. Mahmoudi, T. Windbacher, V. Sverdlov, and S. Selberherr, “MRAM-based logic array for large-scale non-volatile logic-inmemory applications,” in Nanoscale Architectures (NANOARCH), 2013 IEEE/ACM International Symposium on. IEEE, 2013, pp. 26–27. [28] S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, T. Endoh, H. Ohno, and T. Hanyu, “MTJ-based nonvolatile logic-in-memory circuit, future prospects and issues,” in Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 2009, pp. 433–435. [29] F. Parveen, Z. He, S. Angizi, and D. Fan, “HielM: Highly flexible in-memory computing using STT MRAM,” in Design Automation Conference (ASP-DAC), 2018 23rd Asia and South Pacific. IEEE, 2018, pp. 361–366. [30] P. Butzen, M. Slimani, Y. Wang, H. Cai et al., “Reliable majority voter based on spin transfer torque magnetic tunnel junction device,” Electronics Letters, vol. 52, no. 1, pp. 47–49, 2015. [31] A. F. Vincent, J. Larroque, N. Locatelli, N. B. Romdhane, O. Bichler, C. Gamrat, W. S. Zhao, J.-O. Klein, S. Galdin-Retailleau, and D. Querlioz, “Spin-transfer torque magnetic memory as a stochastic memristive synapse for neuromorphic systems,” IEEE transactions on biomedical circuits and systems, vol. 9, no. 2, pp. 166–174, 2015. [32] D. Fan, Y. Shim, A. Raghunathan, and K. Roy, “STT-SNN: A spin-transfer-torque based soft-limiting non-linear neuron for low-power artificial neural networks,” IEEE Transactions on Nanotechnology, vol. 14, no. 6, pp. 1013–1023, 2015. [33] H. Cai, Y. Wang, L. A. Naviner, Z. Wang, and W. Zhao, “Approximate computing in MOS/spintronic non-volatile full-adder,” in Nanoscale Architectures (NANOARCH), 2016 IEEE/ACM International Symposium on. IEEE, 2016, pp. 203–208. [34] L. A. de Barros Naviner, H. Cai, Y. Wang, W. Zhao, and A. B. Dhia, “Stochastic computation with spin torque transfer magnetic tunnel junction,” in New Circuits and Systems Conference (NEWCAS), 2015 IEEE 13th International. IEEE, 2015, pp. 1–4. [35] Y. Wang, H. Cai, L. A. Naviner, J.-O. Klein, J. Yang, and W. Zhao, “A novel circuit design of true random number generator using magnetic tunnel junction,” in Nanoscale Architectures (NANOARCH), 2016 IEEE/ACM International Symposium on. IEEE, 2016, pp. 123–128. [36] T. Hanyu, D. Suzuki, N. Onizawa, S. Matsunaga, M. Natsui, and A. Mochizuki, “Spintronics-based nonvolatile logic-inmemory architecture towards an ultra-low-power and highly reliable VLSI computing paradigm,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 2015, pp. 1006–1011. [37] T. Hanyu, D. Suzuki, A. Mochizuki, M. Natsui, N. Onizawa, T. Sugibayashi, S. Ikeda, T. Endoh, and H. Ohno, “Challenge of MOS/MTJ-hybrid nonvolatile logic-in-memory architecture in dark-silicon era,” in Electron Devices Meeting (IEDM), 2014 IEEE International. IEEE, 2014, pp. 28–2. [38] B. Lokesh and M. Malathi, “Full adder based reconfigurable spintronic ALU using STT-MTJ,” in India Conference (INDICON), 2013 Annual IEEE. IEEE, 2013, pp. 1–5. [39] D. Kumar, M. SaW, and A. Islam, “Design of 21 multiplexer and 12 demultiplexer using magnetic tunnel junction elements,” in Emerging Trends in VLSI, Embedded System, Nano Electronics and Telecommunication System (ICEVENT), 2013 International Conference on. IEEE, 2013, pp. 1–5. [40] X. Yao, J. Harms, A. Lyle, F. Ebrahimi, Y. Zhang, and J.-P. Wang, “Magnetic tunnel junction-based spintronic logic units operated by spin transfer torque,” IEEE Transactions on Nanotechnology, vol. 11, no. 1, pp. 120–126, 2012. [41] S. R. Patil, X. Yao, H. Meng, J.-P. Wang, and D. J. Lilja, “Design of a spintronic arithmetic and logic unit using magnetic tunnel junctions,” in Proceedings of the 5th conference on Computing frontiers. ACM, 2008, pp. 171–178. [42] Q. An, S. Le Beux, I. O’Connor, J. O. Klein, and W. Zhao, “Arithmetic Logic Unit based on all-spin logic devices,” in New Circuits and Systems Conference (NEWCAS), 2017 15th IEEE International. IEEE, 2017, pp. 317–320. [43] F. Ren and D. Markovic, “True energy-performance analysis of the MTJ-based logic-in-memory architecture (1-bit full adder),” IEEE Transactions on Electron Devices, vol. 57, no. 5, pp. 1023–1028, 2010. [44] E. Deng, Z. Wang, J.-O. Klein, G. Prenat, B. Dieny, and W. Zhao, “High-frequency low-power magnetic full-adder based on magnetic tunnel junction with spin-hall assistance,” IEEE Transactions on Magnetics, vol. 51, no. 11, pp. 1–4, 2015. [45] D. Fan, Z. He, and S. Angizi, “Leveraging Spintronic Devices for Ultra-Low Power In-Memory Computing: Logic and Neural Network,” pp. 1109–1112, 2017. [46] H. Zhang, W. Kang, L. Wang, K. L. Wang, and W. Zhao, “Stateful Reconfigurable Logic via a Single-Voltage-Gated Spin Hall-Effect Driven Magnetic Tunnel Junction in a Spintronic Memory,” IEEE Transactions on Electron Devices, vol. 64, no. 10,

36

pp. 4295–4301, 2017. [47] S. Angizi, Z. He, F. Parveen, and D. Fan, “IMCE: energy-efficient bit-wise in-memory convolution engine for deep neural network,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 2018, pp. 111–116. [48] D. Fan and S. Angizi, “Energy Efficient In-Memory Binary Deep Neural Network Accelerator with Dual-Mode SOT-MRAM,” in 2017 IEEE 35th International Conference on Computer Design (ICCD). IEEE, 2017, pp. 609–612. [49] L. Chang, Z. Wang, Y. Zhang, and W. Zhao, “Reconfigurable processing in memory architecture based on spin orbit torque,” in 2017 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH). IEEE, 2017, pp. 95–96. [50] D. Fan, S. Angizi, and Z. He, “In-Memory Computing with Spintronic Devices,” in VLSI (ISVLSI), 2017 IEEE Computer Society Annual Symposium on. IEEE, 2017, pp. 683–688. [51] F. Parveen, S. Angizi, Z. He, and D. Fan, “Low power in-memory computing based on dual-mode SOT-MRAM,” in Low Power Electronics and Design (ISLPED, 2017 IEEE/ACM International Symposium on. IEEE, 2017, pp. 1–6. [52] A. Roohi, R. Zand, D. Fan, and R. F. DeMara, “Voltage-based concatenatable full adder using spin Hall effect switching,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 12, pp. 2134–2138, 2017. [53] A. Jaiswal, A. Agrawal, and K. Roy, “In-situ, In-Memory Stateful Vector Logic Operations based on Voltage Controlled Magnetic Anisotropy,” Scientific reports, vol. 8, no. 1, p. 5738, 2018. [54] L. Wang, W. Kang, F. Ebrahimi, X. Li, Y. Huang, C. Zhao, K. L. Wang, and W. Zhao, “Voltage-controlled magnetic tunnel junctions for processing-in-memory implementation,” IEEE Electron Device Letters, vol. 39, no. 3, pp. 440–443, 2018. [55] H.-P. Trinh, W. Zhao, J.-O. Klein, Y. Zhang, D. Ravelsona, and C. Chappert, “Domain wall motion based magnetic adder,” Electronics letters, vol. 48, no. 17, pp. 1049–1051, 2012. [56] T. Luo, W. Zhang, B. He, and D. Maskell, “A racetrack memory based in-memory booth multiplier for cryptography application,” in Design Automation Conference (ASP-DAC), 2016 21st Asia and South Pacific. IEEE, 2016, pp. 286–291. [57] K. Huang and R. Zhao, “Magnetic domain-wall racetrack memory-based nonvolatile logic for low-power computing and fast run-time-reconfiguration,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 9, pp. 2861–2872, 2016. [58] Y. Wang, P. Kong, and H. Yu, “Logic-in-memory based big-data computing by nonvolatile domain-wall nanowire devices,” in Non-Volatile Memory Technology Symposium (NVMTS), 2013 13th. IEEE, 2013, pp. 1–6. [59] H. Yu, Y. Wang, S. Chen, W. Fei, C. Weng, J. Zhao, and Z. Wei, “Energy efficient in-memory machine learning for data intensive image-processing by non-volatile domain-wall memory,” in Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific. IEEE, 2014, pp. 191–196. [60] J. Chung, J. Park, and S. Ghosh, “Domain wall memory based convolutional neural networks for bit-width extendability and energy-efficiency,” in ISLPED. ACM, 2016, pp. 332–337. [61] K. Roy, M. Sharad, D. Fan, and K. Yogendra, “Exploring Boolean and non-Boolean computing with spin torque devices,” in Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on. IEEE, 2013, pp. 576–580. [62] A. Sengupta, Y. Shim, and K. Roy, “Proposal for an all-spin artificial neural network: Emulating neural and synaptic functionalities through domain wall motion in ferromagnets,” IEEE TBioCAS, vol. 10, no. 6, pp. 1152–1160, 2016. [63] S. Deb, L. Ni, H. Yu, and A. Chattopadhyay, “Racetrack memory-based encoder/decoder for low-power interconnect architectures,” in SAMOS. IEEE, 2016, pp. 281–287. [64] S. Angizi, Z. He, N. Bagherzadeh, and D. Fan, “Design and Evaluation of a Spintronic In-Memory Processing Platform for Non-Volatile Data Encryption,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017. [65] Y. Wang, H. Yu, D. Sylvester, and P. Kong, “Energy efficient in-memory AES encryption based on nonvolatile domainwall nanowire,” in Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 2014, p. 183. [66] S. Angizi, Z. He, R. F. DeMara, and D. Fan, “Composite spintronic accuracy-configurable adder for low power digital signal processing,” in Quality Electronic Design (ISQED), 2017 18th International Symposium on. IEEE, 2017, pp. 391–396. [67] Z. He, S. Angizi, F. Parveen, and D. Fan, “Leveraging Dual-Mode Magnetic Crossbar for Ultra-low Energy In-Memory Data Encryption,” in Proceedings of the on Great Lakes Symposium on VLSI 2017. ACM, 2017, pp. 83–88. [68] S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, “SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing,” in Proceedings of the 2014 international symposium on Low power electronics and design. ACM, 2014, pp. 15–20. [69] X. Chen, W. Kang, D. Zhu, X. Zhang, N. Lei, Y. Zhang, Y. Zhou, and W. Zhao, “A compact skyrmionic leaky–integrate–fire spiking neuron device,” Nanoscale, vol. 10, no. 13, pp. 6139–6146, 2018. [70] S. Li, W. Kang, Y. Huang, X. Zhang, Y. Zhou, and W. Zhao, “Magnetic skyrmion-based artificial neuron device,” Nanotechnology, vol. 28, no. 31, p. 31LT01, 2017. [71] Y. Huang, W. Kang, X. Zhang, Y. Zhou, and W. Zhao, “Magnetic skyrmion-based synaptic devices,” Nanotechnology, vol. 28, no. 8, p. 08LT02, 2017. [72] Z. Wang, L. Zhang, M. Wang, Z. Wang, D. Zhu, Y. Zhang, and W. Zhao, “High-density nand-like spin transfer torque memory with spin orbit torque erase operation,” IEEE Electron Device Letters, vol. 39, no. 3, pp. 343–346, 2018. [73] S. Mittal, “A Survey Of Techniques for Approximate Computing,” ACM Computing Surveys, 2016. [74] A. D. Booth, “A signed binary multiplication technique,” The Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236–240, 1951. [75] C. S. Lent, M. Liu, and Y. Lu, “Bennett clocking of quantum-dot cellular automata and the limits to binary logic scaling,” Nanotechnology, vol. 17, no. 16, p. 4240, 2006. [76] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 525–542. [77] S. Mittal, “A Survey of ReRAM-based Architectures for Processing-in-memory and Neural Networks,” Machine learning and knowledge extraction, vol. 1, p. 5, 2018.

37

[78] M. Natsui, D. Suzuki, N. Sakimura, R. Nebashi, Y. Tsuji, A. Morioka, T. Sugibayashi, S. Miura, H. Honjo, K. Kinoshita et al., “Nonvolatile logic-in-memory lsi using cycle-based power gating and its application to motion-vector prediction,” IEEE Journal of Solid-State Circuits, vol. 50, no. 2, pp. 476–489, 2015. [79] S. Mittal and A. Alsalibi, “A survey of techniques for improving security of non-volatile memories,” Journal of Hardware and Systems Security, 2018. [80] Y. Wang, L. Ni, C.-H. Chang, and H. Yu, “DW-AES: A domain-wall nanowire-based AES for high throughput and energyefficient data encryption in non-volatile memory,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 11, pp. 2426–2440, 2016. [81] S. Mittal, “A Survey of Recent Prefetching Techniques for Processor Caches,” ACM Computing Surveys, 2016. [82] S. Mittal and J. Vetter, “A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems,” IEEE TPDS, vol. 27, no. 5, pp. 1524–1536, 2016. [83] S. Mittal and M. S. Inukonda, “A Survey of Techniques for Improving Error-Resilience of DRAM,” Journal of Systems Architecture, vol. 91, pp. 11–40, 2018.