Parallel Programmable Asynchronous Neighborhood ... - IEEE Xplore

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 12, DECEMBER 2011

2091

Parallel Programmable Asynchronous Neighborhood Mechanism for Kohonen SOM Implemented in CMOS Technology Rafał Długosz, Marta Kolasa, Witold Pedrycz, Fellow, IEEE, and Michał Szulc

Abstract— We present a new programmable neighborhood mechanism for hardware implemented Kohonen self-organizing maps (SOMs) with three different map topologies realized on a single chip. The proposed circuit comes as a fully parallel and asynchronous architecture. The mechanism is very fast. In a medium sized map with several hundreds neurons implemented in the complementary metal-oxide semiconductor 0.18 µm technology, all neurons start adapting the weights after no more than 11 ns. The adaptation is then carried out in parallel. This is an evident advantage in comparison with the commonly used software-realized SOMs. The circuit is robust against the process, supply voltage and environment temperature variations. Due to a simple structure, it features low energy consumption of a few pJ per neuron per a single learning pattern. In this paper, we discuss different aspects of hardware realization, such as a suitable selection of the map topology and the initial neighborhood range, as the optimization of these parameters is essential when looking from the circuit complexity point of view. For the optimal values of these parameters, the chip area and the power dissipation can be reduced even by 60% and 80%, respectively, without affecting the quality of learning. Index Terms— Asynchronous and parallel circuits, complementary metal-oxide semiconductor implementation, Kohonen self-organizing map, low energy consumption, neighborhood mechanism.

I. I NTRODUCTION

A

RTIFICIAL neural networks (ANNs) realized as analog or analog-digital application specific integrated circuits (ASICs), in comparison with the ANNs realized in software, achieve much higher data rates, while consuming much less

Manuscript received September 20, 2010; revised July 18, 2011; accepted September 17, 2011. Date of publication October 28, 2011; date of current version December 1, 2011. R. Długosz is with the Faculty of Telecommunication and Electrical Engineering, University of Technology and Life Sciences, Bydgoszcz 85-796, Poland. He is also with the Electronics and Signal Processing Laboratory of the Swiss Federal Institute of Technology, Lausanne, Neuchâtel CH2000, Switzerland, and the Mars Society Polska, Space Research Center of the Polish Academy of Sciences, Warsaw 00-716, Poland (e-mail: [email protected]). M. Kolasa is with the Institute of Electrical Engineering, University of Technology and Life Sciences, Bydgoszcz 85-796, Poland (e-mail: [email protected]). W. Pedrycz is with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton AB T6G 2V4, Canada. He is also with the Systems Research Institute, Polish Academy of Science, Warsaw 01-447, Poland (e-mail: [email protected]). M. Szulc is with the Computer Engineering, Poznań University of Technology, Poznań 60-965, Poland (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2011.2169809

energy. The high data rate is due to the parallel data processing performed by such ANNs, in which each neuron operates as an autonomous processing unit. On the other hand, the low power dissipation results from the problem-driven optimization of the structure of the modules of the ANNs. This significantly reduces the number of transistors on the chip, which reduces the chip area and the energy consumption. These features make ANNs realized as ASICs suitable for applications in quite new areas, for example, in wireless body sensor networks (WBSNs) used in medical diagnostics. In a typical WBSN, small, ultralow power sensors placed on the human body collect various biomedical data, and transmit them to a base station for further analysis. As most of collected data are being transmitted, the radio-frequency (RF) communication blocks of the sensors consume even 90% of total energy that reduces the battery life span [1], [2]. One of the optimization approaches of the WBSN devices is to optimize the RF block [2]. Another option is a reduction of the amount of transmitted data. It can be realized by locating some data processing tasks directly at the sensor level. This concept calls for energy efficient and low chip area signal processing units. The self-organizing map (SOM) proposed in this paper fits this concept, offering even 10 000 higher data rate to energy consumption ratio than a ANN implemented on a typical PC. In this paper, we propose a new very fast, fully parallel, asynchronous neighborhood mechanism for the Kohonen SOMs implemented at the transistor level. The proposed circuit enables (ENs) a fast determination of distances to all neighboring neurons, independently from the location of the winner, the neighborhood range and the number of neurons forming the map. In the literature, one can find only a few implementations of such mechanism, described in Section II. In comparison with them, the proposed programmable circuit is more flexible, as it ENs operation with different map topologies on a single chip. Additionally it can be used with either analog or digital SOM. The idea of the discussed circuit is introduced in Section III. Section IV presents detailed simulations performed by means of the software model of the SOM. The purpose of these investigations was to find such network parameters that allow reducing the circuit complexity, without affecting the learning capabilities of the SOM. Transistor level implementation of the proposed neighborhood mechanism is described in Section V. In this section, we discuss various hardware realization issues and trade-offs that were present in the process of design and implementation of

1045–9227/$26.00 © 2011 IEEE

2092


such mechanism for an example map with 8 × 8 neurons. We also study an influence of the process, supply voltage and environmental temperature (PVT) variation (the corner analysis) on the behavior of the circuit. The conclusions are formulated in Section VI. II. S TATE - OF - THE -A RT IN H ARDWARE R EALIZED SOM S Competitive learning in SOMs is an iterative process, consisting of learning cycles, in which training patterns, being vectors in an n-dimensional space of real numbers and coming from a given learning set, are presented to the SOM in a random fashion. In each learning cycle, l, the network computes a distance between a given pattern X (l) and the weight vectors W j (l) of all neurons. The distance can be specified, for example, as the Manhattan (L1) or the Euclidean (L2) one. The neuron, whose weights resemble a given input pattern to the highest extent, becomes a winner. When selecting the learning algorithm for hardware realized SOM, we must consider trade-offs that are different than those discussed in case of software realizations. The paramount features are in this case the power dissipation and the chip area that in software systems are of second importance. For this reason, at the transistor level realization we have to use simple algorithms or to optimize the existing algorithms in such a way to require a minimum number of electronic components. One of the commonly used algorithms, which is relatively simple, is the winner takes most (WTM) algorithm, often referred to as the classic Kohonen SOM [3]. In this case, the adjustment of the weights is realized as follows: W j (l + 1) = W j (l) + η(k)G(R, d(i, j ))[X (l) − W j (l)] (1) where η(k) is the learning rate in the k th training epoch, W j (l) is the weight vector of the particular j th neuron in the map. The neighboring neurons are adjusted at different intensities that depend on the neighborhood function (NF) G(). The G() function, in turn, depends on the distance d(i, j ) between the winning i th neuron and any j th neighboring neuron in the map and the neighborhood range, R, in a given epoch. In software realization, the type of the NF is of limited importance when considering its implementation. In hardware realization, on the other hand, particular NFs feature substantially different hardware complexity. For this reason, in this paper we carefully investigate the influence of particular NFs on the learning process. In the classical approach, the rectangular neighborhood function (RNF) is being used [3] 1, for d(i, j ) ≤ R G(R, d(i, j )) = (2) 0, for d(i, j ) > R where R is the range of the neighborhood. As stressed in the literature, better results are achieved by using Gaussian NF (GNF) [4], which is described in the form 2 d (i, j ) . (3) G(R, d(i, j )) = exp − 2R 2 A realization of the GNF in the software-based SOMs is simple, but the hardware realization of (3) is very complex [5], [6]. The problem becomes even more evident in case

when all neurons operate in parallel. In this case, each neuron must contain a separate block that calculates the value of the GNF. Our investigations carried out at the transistor level show that the hardware complexity of the GNF is even two orders of magnitude larger than in case of the RNF. For this reason, we have recently proposed an efficient digital, clockless, hardware implementation of the triangular neighborhood function (TNF). This realization can be viewed as a compromise solution, as it requires only a single multiplication and is performed in a single step. We carried out detailed simulations of the software model of the SOM that demonstrate that the TNF forms a very good approximation of the GNF [7]. The TNF is defined as follows: G(R, d(i, j )) −a(η0 ) · (R − d(i, j )) + c, for d(i, j ) ≤ R = 0, for d(i, j ) > R

(4)

where a is the steepness of the function, while η0 is the value of the winner’s learning rate. The parameters a, R, η0 decrease to zero after each epoch. One of the paramount benefits resulting from the hardware realization is the possibility of parallel operation of all neurons in the map. However, in case of certain algorithms, it becomes the source of specific problems, not present in software realizations. To illustrate this issue, we are referring to a SOM algorithm proposed by Lee and Verleysen in [8], called the fisherman rule. In the classical SOM, all neurons that belong to the winner’s neighborhood are adapted so that their connections (weights) move toward the input pattern X (l), as shown in (1). In the algorithm proposed in [8], in the first iteration the winning neuron is adapted in the same way as encountered in the “classic” update rule Wi (l + 1) = Wi (l) + α0 [X (l) − Wi (l)] .

(5)

For the distance d = 0 the α0 parameter is equal to the η0 (k) · G() term present in (1). On the other hand, in the fisherman rule the neighboring neurons (for d = 1, . . . , R) are trained in an iterative fashion according to the formula Wd (l + 1) = Wd (l) + αd [Wd−1 (l + 1) − Wd (l)].

(6)

In the second iteration, for d = 1, all neurons positioned at the first ring surrounding the winner are adapted in such a way that their weights move toward the weights of the winning neuron calculated in the first iteration. The neurons forming the second ring, i.e., for d = 2, are in the next iteration adapted toward the updated weights of the neurons of the first ring, and so forth. A detailed comparison between different learning rules, presented in [8], show that the fisherman algorithm usually leads to better results than the classic one, although in many cases the results are comparable. In case of the software realization, in which weights for particular neurons are calculated sequentially, both the classical and the fisherman algorithm bring a comparable computational complexity as well as a realization facility, so there is a sound rationale behind using the fisherman rule. In hardware realization, on the other hand, the fisherman rule is significantly more complex, as the described iterative adaptation sequence

DŁUGOSZ et al.: PARALLEL PROGRAMMABLE ASYNCHRONOUS NEIGHBORHOOD MECHANISM FOR KOHONEN SOM

VDD

Itr WSCn−1,m Nn−1,m

Itrn−1,m Itrn,m Itrn+1,m WSCn,m WSCn+1,m WSCn+2,m Nn,m Nn+1,m Nn+2,m

Nn+2,m−1 VBIAS1

VBIAS1 VBIAS2

Fig. 1. Analog neighborhood mechanism reported by Peiris in [9] redesigned by us in the CMOS 0.18 μm technology.

has to be controlled by a clock. The iterative nature of the second algorithm is a source of another disadvantage. The adaptation of neurons in each ring can be undertaken only after the adaptation in the preceding ring has been completed. It is the source of a large delay that significantly slows down the adaptation phase. For the comparison, we propose the implementation of the classic WTM algorithm, in which the adaptation in all neurons can be performed fully in parallel. In our opinion, hardware complexity is the underlying reason why, at the transistor-level realizations of the SOM reported before, the classic algorithm is being used in all cases. A realization of a fast, fully parallel SOM requires a development of a special neighborhood mechanism, which is the main topic of this paper. This mechanism calls for a development of two different circuits that must be clearly distinguished. The first one is the NF block that calculates the factor α in (5) and (6) or the term η(k) · G() in (1) for particular rings of neurons, using the signal that is proportional to the topological distance d(i, j ). On the other hand, to determine this distance, another block is required. We will be referring to it as a distance determination block (DD). A majority of reported solutions address the realization of the first block only, while only a few circuit solutions reported in the literature deal with a realization of the DD mechanism. This problem is not trivial at all, especially for large maps. The first known analog implementation of the DD block, based on the nonlinear diffusion network, has been reported by Peiris in [9]. Since it is the only reported solution suitable for large maps, with R>>1, we compare our design with this circuit in more details. As this circuit has been originally realized in the complementary metal-oxide semiconductor (CMOS) 2 μm technology, to make the comparison meaningful, we redesigned it in the newer CMOS 0.18 μm process, introducing some improvements. The circuit, shown in Fig. 1, was able to determine distances between neurons for different values of R. This solution offers a quasi-triangular function, in which the steepness a, as well as the neighborhood range R are controlled by two DC bias voltages. The winner selecting circuit (WSC) signals come from the WSC that determines the winning neuron in a given cycle. The Peiris’s solution employs the principle of the current divider. The input current Itr , which is provided to the winning neuron through the p-channel MOS (PMOS) current mirror is shared with its neighboring neurons. Each neuron has an

2093

individual connection with the ground, which is realized using the n-channel MOS (NMOS)-type transistors, with the onresistance of the channel controlled by a VBIAS2 voltage. The control of the VBIAS1 and the VBIAS2 voltages allows adjusting the neighborhood range and the shape of the NF. The voltages at particular nodes Nx,y , which are the η · G() term in (1), depend on the on-resistances of the corresponding NMOS and PMOS transistors. This solution offers several advantages, such as programmability and low circuit complexity, as only a few transistors per each neuron are required. On the other hand, it suffers from several drawbacks. In case of large maps, current mirrors used in this circuit are located far from the others, which becomes a source of large mismatch errors. To bring these errors down to a few % only, the sizes (W/L) of transistors should be equal to at least 15/2 μm [10]. For the comparison, in our digital solution we use transistors with minimal sizes of 0.6/0.18 μm. As a result, although our circuit requires a substantially larger number of transistors, the resulting chip area is comparable. The current I j must be high enough to EN the assumed value of the neighborhood range and a proper NF shaping. This current should increase with the square of the neighborhood range to keep data rate constant. For example, in the map with 3 × 3 neurons and R = 1, the current is shared between nine neurons only, while in the map with 9 × 9 neurons, for R = 4 it is shared between 81 neurons. If, in the second case, the current I j would remain the same, it would have to recharge parasitic capacitances associated with larger number of the nodes, thus reducing data rate. Another disadvantage results from the fact that currents in particular branches (NMOS and PMOS transistors) must flow until the adaptation phase for a given input pattern is not completed, as the voltages in particular nodes Nx,y must be kept constant during this time. For the comparison, in the solution proposed in this paper after the determination of all topological distances and the values of the NFs in particular neurons, the power dissipation becomes zero. Another disadvantage of the Peiris’s solution is a drift of the values of the voltages at particular nodes, which is due to the process, voltage, and temperature variation. This makes the overall learning process strongly dependent on the external parameters. In the proposed digital solution, the variation of the external parameters has some influence only on data rate. Another DD circuit has been recently proposed by Chang et al. in [5]. This is the first known application of the analog GNF in the hardware-realized SOMs, although GNF circuits as separate blocks have been reported before [6], [11]. The solution presented in [5] has been verified by considering small maps with only 2 × 2 neurons, with the maximum neighborhood range equal to one. The output current of this circuit represents a distance (ρc −ρi )2 , where ρc is an identifier of the winning neuron, while ρi are unique identifiers of the particular neighboring neurons in the map. The ρi identifiers are DC currents, with the values specified off-line before the learning process takes place. In case of the map with 2 × 2 neurons, the DC values of these currents are sufficient, since all distances are constant. In larger maps the values of ρi identifiers depend upon the location of the winner, so in our opinion this solution still needs a special circuit to determine

2094


R3

R2

R2

R1

R2

R2

R1

1

(a)

R2

R2

R2

R2

R1

R1

2 1 R1

R1 R1

R2

R2

2

R2

R3

R1

R3

R3

R2

R4

1

r

EN r EN

r

EN EN EN r NEURON r NEURON r NEURON EN EN EN r r r EN r EN r r EN

1

NEURON r

NEURON r

(b)

NEURON

R3

R3

2 R2

1 R1

R2

1 R2

2

2

R3 R3

2

1

R2 R1 1

EN r

r

r

r

r EN

EN EN NEURON r NEURON r NEURON EN EN

2

(c) Fig. 2. Different SOM topologies (a) and (b) rectangular grid with four (Rect4) and eight (Rect8) neighbors, and (c) hexagonal grid (Hex).

the distances in an on-line fashion. The DD circuit proposed in this paper could be used as a component of this SOM in case of the use of the larger number of neurons. Macq, Verleysen et al. have proposed a fully analog Kohonen SOM with a digital DD block [12]. We do not focus on this concept here, as it is a simple solution with a small neighborhood range, R = 1. In such case, the DD block is realized by the use of only several logic gates per each neuron. In this paper, we present a novel digital neighborhood mechanism which is an efficient, parallel and asynchronous solution with the programmable radius R and additionally programmable map topology. The parallel operation makes this circuit extremely fast. Even in case of large maps with hundreds neurons, the distance to all neighboring neurons is determined within approximately a dozen nanoseconds (for the circuit realized in the CMOS 0.18 μm technology). The proposed circuit offers an asynchronous solution, which significantly simplifies the overall structure. The circuit operates with three different topologies i.e., with four, six, and eight neighbors, as shown in Fig. 2. We will be referring to them to as Rect4, Hex, and Rect8, respectively. To switch the map between particular topologies only two bits are required. The reprogramming process can be performed even on-line when the learning process is in progress, during less than 1 ns. The proposed mechanism offers a universal solution that can be used in either analog SOMs or digital SOMs operating with either fixed or floating point numbers [13]. This is possible, as the neighborhood mechanism itself is a separate block not directly involved in data processing in particular neurons. The output signals of this block at particular rings of neighbors are digital fixed point numbers, representing the distance from the winner. These signals can be used as the input signals to the NF block, which is one of the components of the adaptation mechanism. For example, in an analog currentmode winner takes all (WTA) NN recently presented in [14], the weights updates w = η(x − w) were represented as analog signals, with the learning rate η being controlled by

Fig. 3. General diagram of the proposed solution: placement of neurons together with the connection scheme.

an external multi-bit signal, throughout a multi-output current mirror with binary-weighted widths of particular output transistors. This approach ENs a direct linking the digital signals of the proposed mechanism with the analog adaptation block. To make the overall presentation complete, we briefly compare the proposed solution with digital SOMs realized by the use of field programmable gate arrays (FPGAs). FPGA forms an interesting design platform that ENs fast and inexpensive realization of any digital circuit, especially if a small number of devices is required. Nevertheless, in the considered case the FPGA platform is not suitable due to several reasons. The reported implementations [15], [16] of both the WTA and the WTM SOMs show that the achieved performance is even two orders of magnitude worse than the performance of similar NNs realized as an ASIC. The meaningful comparison is possible by considering the connection updates per second (CUPS). The figure-of-merit (FOM) can be defined as CUPS over the power dissipation (i.e., CU over the energy consumption) of a single device. In an example realization described in [15], the SOM with 25 neurons and 23 inputs, at data rate of 10.86 kS/s dissipates the power of 32 mW. This SOM achieves 6.25 MCUPS, with the FOM equal to 192e6 [CU/J]. For the comparison, the proposed SOM with 64 neurons and three inputs at data rate of 10 MS/s dissipates 19 mW of power i.e., it achieves 1920 MCUPS with the FOM equal to 100e9 [CU/J]. Updating a single connection consumes 10 pJ energy as compared to 5200 pJ in [15]. This makes the proposed SOM at least two orders of magnitude more efficient than the SOM described in [15]. FPGA platforms support a realization of digital circuits only, while the proposed circuit can be used with both the analog and digital NNs. Although the proposed circuit is digital, but it must be designed in the full-custom style at the transistor level, since a proper layout of particular components is one of the key design parameters here. All these reasons support a realization of the proposed SOM as an ASIC. III. P ROPOSED PARALLEL AND A SYNCHRONOUS N EIGHBORHOOD M ECHANISM A transistor-level implementation calls for solving specific design problems [17]–[19]. One of them is a proper placement

ENin_8 ENout_8 rout

R_PROP STOP

ENin_2 ENout_2 rout

STOP8 NEIGHBOR

WSC

STOP7

rout (a)

rout

in_3

ENout_3

ENout_4 ENin_5

ENout_7 ENin_7

ENout_6

ENin_6

ENin_2

ENout_2

ENout_5

NEIGHBOR

ENout_5 ENin_5

ENout_4 ENin_4 rin_4

rin_3

ENin_2 ENout_2 EN

ENin_4

rout rout

2095

ENin_1 ENout_8 ENin_8

EN7

ENin_6 ENout_6

EN_PROP

ENout_1

EN8

q− bits rin_6

NEURON

Fig. 4.

STOP1 NEIGHBOR

rout

rin_2

ENout_3 ENin_3

EN1

rin_7 ENin_7 ENout_7

q−bits

ENin_1 ENout_1 rout

rin_1

rin_8


rin_5

EN8

RPROG

STOP8 NEIGHBOR

WSC

ENout_8 ENin_8

ENin_4 ENout_4

WSC

Schematic diagram of a single neuron used in the proposed NN.

ENout_6 ENin_6

(b)

of neurons on the chip so that the routing between them minimizes the silicon area. The connecting paths between particular neurons should be as short as possible to minimize parasitic capacitances to the substrate. The arrangement of the chip layout that meets these requirements is shown schematically in Fig. 3. This approach offers a regular placement of neurons that allows for an efficient routing of all neurons with the WSC and due to a modular structure ENs a realization of the maps of different sizes in a relatively short time, by simply replication of neurons. Each neuron in the SOM contains several key components such as the distance calculation block (L1/L2), the adaptation block and, optionally, the conscience mechanism [14], [20]. In this paper, we mostly focus on the neighborhood mechanism itself, as the other blocks in the analog version have been already developed and verified by us in the prototype WTA NN through running some laboratory tests [14], while the realization of digital version of these components is relatively simple. Although this new mechanism adds several new blocks to each neuron [18], it simultaneously significantly increases the overall functionality of the SOM. The simplified structure of a single neuron is shown in Fig. 4. For a better illustration, we present only these new blocks that are specific for the proposed neighborhood mechanism, namely the EN_PROP and the R_PROP circuits. In the proposed solution, each neuron is connected with only the closest p neighbors, as shown in Figs. 2 and 3, where the value of p depends on the type of the topology and equals four, six or eight. The connection between any pair of neighbors requires 2(q +1) signal lines, where q is the number of bits representing the maximal value of the neighborhood range (or the radius) Rmax . In an example map with 16 × 16 neurons and the Rect8 topology Rmax < 16, resulting in q = 4. In the same map with Rect4 topology Rmax < 32 i.e., q = 5. One additional line is required to transfer the EN signal. The factor of 2 is needed as all signals are sent in both directions.

ENout_1

ENin_2 ENout_2

ENin_1 EN8 STOP8 NEIGHBOR

ENout_8 ENin_8

ENin_4

WSC

ENout_4 ENin_5

ENin_6 (c)

ENout_6

ENout_5

Fig. 5. Structure of the EN propagation (EN_PROP) block used in (a) Rect8, (b) Rect4, and (c) Hex topology.

The proposed mechanism can be described as follows. A winning neuron, viz. the neuron whose ‘WSC’ identification signal becomes one, responds by sending a 1-bit EN signal in all directions using the EN_PROP circuit shown in Fig. 5. Particular ENout_i signals of this neuron become the ENin signals in its closest neighbors. Note that the WSC signal is a privileged signal that is allowed to activate all ENout_i signals. On the other hand, a given ENin_i signal activates only selected ENout_i signals. As a result, the neighboring neurons always receive the ENin signals coming from only one direction and generating as a response one or three ENout signals that are sent to the next ring of neighbors. Let us consider, for example, the Rect8 grid shown in Fig. 5(a). If the ENin signal comes from one of the N, S, W, E directions, then only one ENout signal is generated at the opposite side. On the other hand, when the ENin comes from one of the NE, NW, SE, SW directions, then this neuron generates three ENout signals at the opposite side. Such a distinction is necessary, as the number of neurons in each successive ring increases by eight. The propagation of the EN signal resembles a wave that spreads asynchronously in all directions originating concentrically from the winning neuron. The only

2096


rin a3

ai

STOP bi

Fig. 6.

b3 rout

a2

b2

a1

b1

N1 B

N2 N3 A B c3 c1 c2 N4 A c8 N5 c4 A N6

r 1

winning N N N EN ‘1’ EN ‘1’ neuron STOP ‘1’ STOP ‘1’ Fig. 7.

(b) N7

N1 N2 c1 B A c2 N4

RPROG 3 RPROG 3 RPROG 3 WSC ‘1’ WSC ‘0’ WSC ‘0’

c7

r 0 EN ‘0’ STOP ’0’

(c)

N7

A c8

N3 N2 c1 c2 B c3

N4 c8 A c7

c6 B c7 A c5 B N9 N8 (a) N7

Radius propagation (R_PROP) block.

r 2

N1

c4 N6 A c6 c5

N5

B N8

N9

c3 N3

c4 N6 B A c5 B c6 N8 N9 N5

Fig. 8. Transition between particular map topologies realized in a single chip (a) basic Rect8 topology, (b) Rect8 → Rect4, and (c) Rect8 → Hex.

Propagation of signals in the neighborhood mechanism. TABLE I T RANSITION S CHEME B ETWEEN PARTICULAR M AP T OPOLOGIES

delay results from a delay of several logic gates located at particular rings, but the overall process is very fast. The EN_PROP block itself, shown for different map topologies in Fig. 5, does not include any mechanism that could terminate the propagation of the EN signal at a required distance d(i, j ). This problem has been solved by using another block (R_PROP) shown in Fig. 6, and an additional signal r = R − d(i, j ) that is running in parallel with the EN signal. A diagram illustrating the EN and the r signals at particular rings of neighbors, for an example value of RPROG = 3, is shown in Fig. 7. After each learning epoch, all neurons are re-programmed by receiving a new value of the RPROG variable that determines the neighborhood range. Re-programming of all neurons is performed in parallel. For a given input pattern X (l) only the winning neuron is allowed to use the RPROG variable, as shown in Figs. 4 and 7. In this case, the RPROG signal, controlled by the WSC = 1 signal, becomes the input signal, rin , of the R_PROP block in the winning neuron. This signal, once its value has been decreased by one, is then resent in all directions as the rout signal. The neighboring neurons also decrease the value of this signal by one and resend it in all directions. To avoid ambiguity, particular neurons can receive the rin signal only from a direction, from which they have received the ENin signal equal to one, as illustrated in Fig. 4. The rin signals access the R_PROP blocks through the switches operated by the ENin signals. The access to the R_PROP block can be realized also by use of the AND logic gates. The propagation of the EN signal stops at the ring of neurons, for which the signal r becomes 0 (STOPi = 0), as shown in Fig. 7. The propagation of the r signal proceeds asynchronously with a delay of several logic gates occurring in the R_PROP block. For an example value of RPROG = 16, in the CMOS 0.18 μm process, a delay between the first and the last neuron in the chain equals 11 ns i.e., it equals approximately 0.68 ns per ring. In newer technologies, this delay can be shortened even by one order of magnitude. A separate design aspect concerns a realization of the NF block, as described in Section II. Note that for the RNF given

Rect8 Hex Rect4

c1 B B -

c2 A A B

c3 B -

c4 A B A

c5 B A -

c6 A B B

c7 B -

c8 A A A

by (2), the EN signal can directly be used to trigger the adaptation process in a given neuron, while the learning rate η for all neighboring neurons is equal and is re-programmed after each epoch in the same way as the RPROG variable. This solution produces the circuit of the lowest complexity. A different situation arises in the case when the TNF or the GNF are being used. In this case, the term η · G() in (1) in particular neurons is calculated on the basis of the local values of the r signal that requires an additional NF block in each neuron. This increases the complexity of the circuit. A. Proposed Programmable Map Topology The simulations performed with the use of the software model of the proposed SOM presented in next section show that the network topology impacts the quality of the learning process, depending also upon the values of other parameters. For this reason, we propose a programmable solution that operates with all three topologies, mentioned above [18]. The R_PROP circuit has the same structure in all cases and thus does not need to be re-programmed, which is an advantage here. The mechanism differs only in the structure of the EN_PROP block, as shown in Fig. 5. Note that in all topologies particular input signals of the EN_PROP circuit activate either one or three ENout signals at the opposite side of this block. The two types of the inputs, denoted in Fig. 8 as (A) and (B) are placed alternately. Fig. 8 presents how the transition between particular topologies can be achieved on a single chip. The transition between the Rect8 and Rect4 topologies requires only breaking the diagonal connections (c1, c3, c5, c7) and re-programming the status (A → B) of both vertical directions. The transition between the Rect8 and the Hex topologies is realized be “shifting” every row by half


ENout_2

ENout_1

O4

O8

ENin_4

O8

O7

ENout_2 O4

ENin_8

ENout_4

S6

ENin_2 S6 S4 S8

S8

ENout_4

O2

NEURON

ENin_3 ENout_8O8

ENin_7

S4 ENin_6

ENin_5

O4

S8

S4

ENout_8

O3

ENout_4

ENout_8

ENout_8 O8

ENout_3

O2

O2

O2

O1

ENout_2

ENout_2

O5

ENin_1 S8

S4

ENout_5

ENout_4 O8

O6 O6

ENout_7 ENout_6

ENout_4

O4

ENout_8 ENout_6

O5

O4

ENout_5 ENO6 out_6

O5

ENout_5

Fig. 9. Reconfigurable EN_PROP circuit used in the programmable version of the proposed neighborhood mechanism. ENin_1 ENin_2 ENin_3 ENin_4 ENin_5 ENin_6 ENin_7 ENin_8 0

0.5

Time [1e-6s]

1.0

1.5

0

0.5

Time [1e-6s]

1.0

1.5

1.0

Rect4 (S4 ‘1’) 1.5

ENout_1 ENout_2 ENout_3 ENout_4 ENout_5 ENout_6 ENout_7 ENout_8 S4 S6 S8 WSC 0

Rect8 (S8 ‘1’)

0.5

Hex (S6 ‘1’)

Fig. 10. Simulations illustrating operation of the programmable EN_PROP circuit, which is responsible for reprogramming the topology of the SOM. Three topologies (Rect8, Hex, Rect4) are available on a single chip.

of the distance between two adjacent neurons, breaking two diagonal connections c3 and c7 and re-programming the c4, c5, and c6 outputs. This scheme is also shown in Table I. The structure of the reconfigurable EN_PROP block is shown in Fig. 9. Each ENin → ENout path inside this block has been drawn as a separate branch for a better illustration. Each neuron contains eight OR gates, denoted as O1-O8, and additionally the configuration AND gates that together with the programming signals S4, S6, and S8 are used to switch on a given topology. The S p signals are coded on two bits only. The WSC signal that identifies the winning neuron requires several additional AND gates not shown for simplicity. Reprogramming is performed by setting one of the S p bits that control the directions in which the spreading wave of the EN signals and the decreasing r signal can flow. As a result, the S p bits block the directions that are not permitted in a given topology. This mechanism is very simple. The results of transistor level simulations illustrating how the programmable EN_PROP block operates are shown in Fig. 10. The top diagram illustrates a sequence of the ENin_i signals. The middle diagram shows the resultant ENout_i

2097

output signals. The control S p and the WSC winning neuron identification signals are shown at the bottom. In the first period (0–0.5 μs), the circuit operates in the Rect8 mode (S8 = 1). For the ENin_1 , ENin_3 , ENin_5 , ENin_7 signals equal to one, always three neighboring ENout signals are equal to one, while for the remaining cases only one ENout signal is one. For WSC = 1 all ENout signals are one, as expected. In the second period (0.5–1 μs), the circuit is switched to the Hex mode (S6 = 1) and finally in the last period it works in the Rect4 mode (S4 = 1). Note that the WSC signal in each mode fires only those outputs which are allowed for a given topology. This solution is very efficient. The learning abilities of the SOM have been significantly increased at the expense of only a few additional logic gates per each neuron. IV. O PTIMIZATION OF THE SOM ON THE S YSTEM L EVEL As described in Section II, various parameters have an influence on the hardware complexity. To determine the influence of these parameters as well as the input data on the learning process, we completed a series of simulations using the software model (in C++) of the network. The comprehensive tests have been carried out for the three topologies, sizes of the map varying in-between 4 × 4 and 64 × 64 neurons, different numbers of inputs, different values of the initial neighborhood size, Rmax , and different training sets. The network was trained with 2-D and 3-D data regularly placed in the input space, as shown farther in Fig. 13(a)–(d), as well as with data centers randomly distributed in this space as shown in Fig. 13(e)–(h). Centers representing particular data classes were surrounded by different number of the patterns X, with the maximum distance of these centers treated as yet another parameter. Investigations of this type have been carried out in [21] for small maps of the size in-between 3 × 4 and 8 × 7 neurons and for the limited number of the combinations of the mentioned parameters. In [21] two topologies have been compared i.e., the Rect4 and Hex. The learning process has been assessed using two criteria, namely the quantization and the topographic errors given further by (8) and (9), respectively. The results reported in [21] show the differences between both topologies even in case of small maps, which makes the implementation of the SOM as the programmable structure a legitimate solution. In our work, we performed much more detailed investigations, with much larger maps, as the objective is to build the chip, in which no changes are available after the fabrication. To find the optimal settings of these network parameters, we performed more than six thousand simulations for their different combinations. In this section, we report on some selected results which can be regarded as being representative to the overall suite of experiments. The parameter that exhibits the main influence on the circuit complexity is Rmax , which is the neighborhood range R at the beginning of the learning process. This parameter determines the number of the connecting paths between pairs of neurons as well as the number of transistors in the R_PROP circuit. For this reason, most of the results are presented as a function of the Rmax parameter. It is often assumed that the neighborhood range Rmax at the beginning of the learning process should cover at least half

2098


of the map [3], [22] and then gradually decrease to zero. The reduction of the value of this parameter can be realized in the following manner: Rk = 1.00001 + (Rmax − 1) · (1 − k/L max )

(7)

where k stands for the k th iteration, L max is the total number of the iterations in the ordering phase of the learning process. Using the software model, we have verified this observation for different values of the parameters listed above. The effectiveness of the learning process of the SOM can be assessed using different criteria. A widely used alternatives are the quantization error and the topographic error [8], [21], [23]–[26]. In this paper, we consider five criteria described in [8]. They allow assessing the learning process in two different ways i.e., by expressing the quality of the vector quantization, as well as the topographic mapping. The quantization quality is assessed using two measures. One of them is the quantization error, defined as follows: m n 1 (x j,l − wi,l )2 . (8) Q err = m j =1

value equal to the number of the direct neighbors that are also the closest to neuron ρ in the feature space. As a result, the ET2 criterion for P neurons in the map can be defined as

l=1

In this formula, m is the total number of the patterns X in the input data set, while i denotes the winning neuron. The Q err is the error that the SOM produces approximating the input vector by means of the weight vectors of the winning neurons. This criterion illustrates a way of fitting of the map to input data [21]. Its major disadvantage is dependence of Q err on the number of neurons. For larger maps the value of Q err decreases, as the distances between neurons decrease [23]. A second measure used to assess the quantization quality is a percentage of dead neurons (PDN) that tells us about the ratio of inactive neurons versus a total number of neurons. Let us recall that dead neurons are the neurons that never win and thus do not represent any input data. These errors are not useful in the assessment of the topological order of the map. The quality of the topographic mapping is assessed using three measures [8]. The first one is the topographic error ET1, defined as follows: m 1 λ (X h ) . (9) E T1 = 1 − m h=1

This is one of the measures proposed by Kohonen [3], [21]. The value of λ(X h ) equals one when for a given pattern X two neurons whose weight vectors that resemble this pattern to the highest extent are also direct neighbors in the map. Otherwise the value of λ(X h ) equals zero. The lower the value of ET1 is, the better the SOM preserves the topology [21], [23]. The remaining two measures concerning the topographic mapping do not require the knowledge of the input data. In the second criterion, first we calculate the Euclidean distances between the weights of an ρ th neuron and all other neurons. Next, we check if all p direct neighbors of neuron ρ are the nearest ones to this neuron in the sense of the Euclidean distance measured in the feature space. To express this requirement in a formal manner, let us assume that neuron ρ has p = |N(ρ)| direct neighbors, where p depends on the topology. Let us also assume that a function g(ρ) returns the

E T2 =

P 1 g(ρ) . P |N(ρ)|

(10)

ρ=1

The optimal value of ET2 equals one. In the third criterion, we build around each neuron ρ the Euclidean neighborhood in the feature space defined as a sphere with the radius R(ρ) = max ||Wρ − Ws || s∈N(ρ)

(11)

where Wρ are the weights of a given neurons ρ, while Ws are the weights of its particular direct neighbors. Then we count the neurons, which are not the closest neighbors of neuron ρ, but are located inside R(ρ). The ET3 criterion, with the optimal value equal to zero, is defined as follows: E T3 =

P

1

s|s = ρ, s ∈ N(ρ), ||Wρ − Ws || < R(ρ) . (12) P ρ=1

The simulations based on the software model that illustrate the quality of the learning process completed on the basis of the five criteria described above are shown in Fig. 11 for particular topologies and different other parameters. The simulations have been performed for map sizes in-between 4 × 4 and 64 × 64, for different training sets and different numbers of inputs. Here we present selected results for 8 × 8 and 16×16 neurons for two example 2-D data sets. The results for 2-D sets have been selected for a better illustration [8], [21], [24], [26]. In both sets, data are divided into P classes (centers), where P equals the number of neurons in the map. Each center is represented by an equal number of learning patterns. In the first case the centers are placed regularly in the input data space, while in the second case the centers as well as the patterns X around them are randomly distributed. To achieve comparable results in case of the regular data the input space is fitted to input data i.e., for 8 × 8 neurons the inputs are in the range of zero to one, while for 16 × 16 neurons in the range of zero to two. As a result, for both these map sizes the optimal value of Q err = 16.2e − 3, while the optimal values of remaining parameters (PDN/ET1/ET2/ET3) equal (0/0/1/0), respectively. The optimal nonzero value of Q err results from the arrangement of data. The regular arrangement of data allows for ideal placement of neurons over the input data, assuming the training process is optimal. As a result, this scenario facilitates the comparison of the results in case of various combinations of the parameters [5]. In case of the random data, the input signals are in the range [−1, 1] independently on the number of neurons. As a result, in larger maps individual neurons are closer to each other and the optimal value of Q err becomes smaller. The values of Q err in Fig. 11 are shown for all NFs, while the numbers for the remaining four criteria are reported for the RNF only. Fig. 11 shows that the optimal values of Rmax are usually smaller than four or eight. This means that especially in case of large maps, the neighborhood range encountered at


40

30

PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

25 20 15 0

2

PDN = 21,42 ET1 = 0,189 ET2 = 0,665 ET3 = 3,875

Qerr[IE-3]

Qerr[IE-3]

35

PDN = 100,8 ET1 = 0,983 ET2 = 0,036 ET3 = 47,03

110 100 90 80 70 60 50 40 30 20 10 0

rectangular function triangular function Gaussian function

ET1 = 0,844 ET2 = 0,076 ET3 = 45,56

6 Rmax 8

4

14

12

10


PDN = 18,66 ET1 = 0,128 ET2 = 0,763 ET3 = 2,062

4 Rmax 6

(a)

Qerr[IE-3]

35 30

PDN = 0,39 ET1 = 0,081 ET2 = 0,895 ET3 = 1,039


PDN = 1,95 ET1 = 0,042 ET2 = 0,925 ET3 = 0,664

PDN = 0,78 ET1 = 0,093 ET2 = 0,867 ET3 = 1,14

20 15

5

0

10

PDN = 0 ET1 = 0,05 ET2 = 0,920 ET3 = 0,816

PDN = 0 ET1 = 0,041 ET2 = 0,926 ET3 = 0,691

PDN = 0,78 ET1 = 0,034 ET2 = 0,954 ET3 = 0,406

Rmax 15

PDN = 0,39 ET1 = 0,041 ET2 = 0,926 ET3 = 0,578

PDN = 0,78 ET1 = 0,033 ET2 = 0,957 ET3 = 0,402

PDN = 0,78 ET1 = 0,022 ET2 = 0,957 ET3 = 0,402

25 PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

20

25

55 50 45 40 35 30 25 20 15 10

30

PDN = 52,33 ET1 = 0,957 ET2 = 0,019 ET3 = 197,1 PDN = 15,81 ET1 = 0,212 ET2 = 0,588 ET3 = 11,69


0

Qerr[IE-3]

Qerr[IE-3]

PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

20 15 0

115 105 95 85 75 65 55 45 35 25 15 0


30

25

3 Rmax 4

2

1

PDN = 17,17 ET1 = 0,129 ET2 = 0,684 ET3 = 3,375

PDN = 15,67 PDN = 15,71 ET1 = 0,130 ET1 = 0,129 ET2 = 0,676 ET2 = 0,675 ET3 = 3,61 ET3 = 3,24

5

10

Rmax 15

5

7

6

Qerr[IE-3]

30

PDN = 3,52 ET1 = 0,061 ET2 = 0,792 ET3 = 5,18

PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

25 20 0


PDN = 1,56 ET1 = 0 ET2 = 0,947 ET3 = 0,480

2

4

PDN = 2,73 ET1 = 0,066 ET2 = 0,807 ET3 = 5,55

PDN = 0,78 ET1 = 0 ET2 = 0,970 ET3 = 0,238

PDN = 1,56 ET1 = 0 ET2 = 0,940 ET3 = 0,477

15

6 Rmax 8

10

12

14

55 50 45 40 35 30 25 20 15

20 15 0

1

2

PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

3 Rmax 4

Qerr[IE-3]

Qerr[IE-3]

25

PDN = 1,56 ET1 = 0,05 ET2 = 0,966 PDN = 1,56 ET1 = 0 ET3 = 1,22 ET2 = 1 ET3 = 0,031

PDN = 34,29 ET1 = 0,016 ET2 = 0,789 ET3 = 7,64

1

rectangular function

PDN = 31,55 ET1 = 0,025 PDN = 31,24 ET2 = 0,890 ET1 = 0,0 ET2 = 0,85 ET3 = 2,0 ET3 = 2,89

PDN = 0 ET1 = 0 ET2 = 1 ET3 = 0

5

6

Qerr[IE-3]

35

PDN = 1,56 ET1 = 0,041 ET2 = 0,974 ET3 = 1,13

30 25

PDN = 19,63 PDN = 19,69 ET1 = 0,008 ET2 = 0,837 ET1 = 0 ET2 = 0,845 ET3 = 3,76 ET3 = 3,05

0

2

4

6

2

PDN = 18,53 ET1 = 0,003 ET2 = 0,838 ET3 = 3,61

6 Rmax 8

4

PDN = 33,2 ET1 = 0,006 ET2 = 0,841 ET3 = 3,61

PDN = 19,30 ET1 = 0,005 ET2 = 0,841 ET3 = 3,52

10

PDN = 60,94 ET1 = 0,980 ET2 = 0,093 ET3 = 49,48

7

PDN = 19,93 ET1 = 0,008 ET2 = 0,833 ET3 = 3,70

12

14



0

rectangular function triangular function Gaussian function PDN = 1,17 ET1 = 0,014 ET2 = 0,976 ET3 = 0,977

PDN = 1,95 ET1 = 0,012 ET2 = 0,985 ET3 = 0,922

PDN = 1,56 ET1 = 0,02 ET2 = 0,980 ET3 = 1,082

PDN = 1,56 ET1 = 0,021 ET2 = 0,979 ET3 = 1,102

20 15

5

1


2

6 Rmax 8

PDN = 23,44 ET1 = 0,061 ET2 = 0,901 ET3 = 2,73

3 Rmax 4

PDN = 25,0 ET1 = 0,078 ET2 = 0,888 ET3 = 2,47

5

6

7

(j)

PDN = 1,17 ET1 = 0,012 ET2 = 0,974 ET3 = 0,797

PDN = 1,17 ET1 = 0,012 ET2 = 0,982 ET3 = 0,824

PDN = 32,86 ET1 = 0,016 ET2 = 0,876 ET3 = 2,25


PDN = 18,28 ET1 = 0,058 ET2 = 0,738 ET3 = 14,35

115 105 95 85 75 65 55 45 35 25 15

7

10

12

14

(k)

Qerr[IE-3]

PDN = 28,52 ET1 = 0,933 ET2 = 0,026 ET3 = 208,07

PDN = 31,54 ET1 = 0,016 ET2 = 0,881 PDN = 32,86 ET3 = 2,47 ET1 = 0,014 ET2 = 0,886 ET3 = 2,20

3 Rmax 4

2

(i) 40

30

triangular function

PDN = 52,33 ET1 = 0,922 ET2 = 0,027 ET3 = 217,9

0


PDN = 7,81 ET1 = 0,019 ET2 = 0,975 ET3 = 0,766

30

25

(h)

PDN = 26,56 ET1 = 0,841 ET2 = 0,093 ET3 = 49,05

35

20

PDN = 15,30 ET1 = 0,160 ET2 = 0,663 ET3 = 3,64

Gaussian function

(g) 40

PDN = 17,74 ET1 = 0,111 ET2 = 0,692 ET3 = 3,43

(f)

Qerr[IE-3]

PDN = 28,52 ET1 = 0,928 ET2 = 0,027 ET3 = 214,8 PDN = 5,08 ET1 = 0,055 ET2 = 0,814 ET3 = 5,43

35

PDN = 17,46 ET1 = 0,119 ET2 = 0,684 ET3 = 3,32

PDN = 100,8 ET1 = 0,966 ET2 = 0,102 ET3 = 49,61

(e) 40

PDN = 17,83 ET1 = 0,132 ET2 = 0,689 ET3 = 3,211

(d)

PDN = 26,6 ET1 = 0,80 ET2 = 0,10 ET3 = 49,8

35

14

12

10

PDN = 14,94 ET1 = 0,150 ET2 = 0,698 ET3 = 3,50

(c) 40

8

PDN = 18,66 ET1 = 0,089 ET2 = 0,763 ET3 = 2,06

(b)

Qerr[IE-3]

PDN = 28,52 ET1 = 0,955 ET2 = 0,016 PDN = 0,78 ET3 = 196,3 ET1 = 0,074 ET2 = 0,866 ET3 = 1,43

40

PDN = 23,78 PDN = 23,90 ET1 = 0,091 ET1 = 0,125 ET2 = 0,781 ET2 = 0,723 ET3 = 2,16 ET3 = 2,55

PDN = 28,22 PDN = 27,54 ET1 = 0,072 PDN = 18,66 ET1 = 0,128 ET2 = 0,719 ET1 = 0,062 ET2 = 0,741 ET3 = 2,719 ET2 = 0,763 ET3 = 2,92 ET3 = 2,062

2

2099

55 50 45 40 35 30 25 20 15

PDN = 65,62 ET1 = 0,941 ET2 = 0,032 ET3 = 211,5


PDN = 30,08 ET1 = 0,091 ET2 = 0,804 ET3 = 10,54 PDN = 28,91 PDN = 28,12 ET1 = 0,027 ET1 = 0,052 ET2 = 0,917 ET2 = 0,880 ET3 = 2,91 ET3 = 3,78

0

2

4

PDN = 28,91 ET1 = 0,036 ET2 = 0,917 ET3 = 2,91

6 Rmax 8

PDN=28,52 ET1=0,082 ET2=0,877 ET3=3,84

PDN = 30,08 ET1 = 0,072 ET2 = 0,890 ET3 = 3,48

10

PDN = 31,25 ET1 = 0,040 ET2 = 0,886 ET3 = 4,48

12

14

(l)

Fig. 11. Quantization error as a function of Rmax for particular NFs, for (a)–(d) Rect4, (e)–(h) Rect8, and (i)–(l) Hex topologies. (a) Map with 8 × 8 neurons and 2-D data regularly distributed. (b) Map with 8 × 8 neurons and 2-D data randomly distributed. (c) Map with 16 × 16 neurons and 2-D data regularly distributed. (d) Map with 16 × 16 neurons and 2-D data randomly distributed. (e) Map with 8 × 8 neurons and 2-D data regularly distributed. (f) Map with 8 × 8 neurons and 2-D data randomly distributed. (g) Map with 16 × 16 neurons and 2-D data regularly distributed. (h) Map with 16 × 16 neurons and 2-D data randomly distributed. (i) Map with 16 × 16 neurons and 2-D data regularly distributed. (j) Map with 8 × 8 neurons and 2-D data randomly distributed. (k) Map with 16 × 16 neurons and 2-D data regularly distributed. (l) Map with 16 × 16 neurons and 2-D data randomly distributed.

the beginning of the learning process covers only 2–10% of the area of the map and as such it can be programmed using two or three bits only. This effect is additionally shown in Fig. 12, which presents the value of Q err as a function of the

iteration for selected single simulations. The results are shown for an example map with 20 × 20 neurons, the RNF and the Rect4 topology, but this effect was commonly observed for different cases. In Fig. 12(a), for Rmax = 38, the Q err starts

2100


(b)

0.08 R5 R3 0.07 0.06 0.05 0.04 0.03 no progress R0 0.02 0.01 0 0 200 400 600 800 iteration k 1400 1600 1800 2000

Qerr

Qerr

Qerr

(a)

0.18 0.16 R 38 R 11 0.14 0.12 0.1 0.08 0.06 no progress 0.04 R0 0.02 0 0 200 400 600 800 iteration k 1400 1600 1800 2000

(c)

0.12 R 2 0.1 0.08 0.06 0.04 0.02 0 0

(a)

200 400 600 800 iteration k 1400 1600 1800 2000

Fig. 12. Q err for an example map with 20 × 20 neurons for Rect4 case for (a) Rmax = 38, (b) Rmax = 5 i.e., for 1/8 of the size of the map, and (c) Rmax = 2.

decreasing only around the 700th iteration, for R = 11 i.e., for about 1/4 of the map size. The best results have been achieved for smaller values of Rmax in-between three and five, as shown in Fig. 12(b). If Rmax is too small or is equal to zero, like in the WTA algorithm, the learning process is not effective, as shown in Fig. 12(c). This conclusion is important, as in comparison with the case in which the neighborhood covers the entire map, it reduces the number of transistors in the R_PROP circuit even by 40–60%, while the number of connecting lines between neurons is reduced by half in such case. This also significantly reduces the energy consumed in the learning process of the SOM. This issue is discussed in more details in next section. Fig. 13 shows the input data and the final placement of neurons in the input space for the selected cases coming from Fig. 11. The comparison of the results reveals that even small differences in the values of particular criteria (8)–(12) impact the learning quality. For example, in the case shown in Fig. 13(e) Q err and PDN have smaller values than those shown in (f) but the ET1/ET2/ET3 parameters are significantly worse that is visible in arrangement of the map. The results in Fig. 11 show that it is difficult to identify one of the topologies as optimal for all cases and therefore the programmable circuit implemented on a single chip arises as the most reasonable solution. It is also not possible to clearly identify one of the NFs to be the most efficient in all studied cases. Nevertheless, for the majority of tested data sets and map sizes, the optimal results were found for the RNF. This feature is important as is shows that the TNF block can be switched off in the majority of cases, thus reducing the energy consumption of the neighborhood mechanism by 80% and increasing the speed of this block by 30% in these cases.

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

(b)

1.6

(c)

R0

1.6

1.6

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

(d)

(e)

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1

1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1

−0.5

0

0.5

1

1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1

(g)

0

(f)

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1

−0.5

0

0.5

1

(h)

Fig. 13. Quality (Q err /PDN/ET1/ET2/ET3) in the learning process reported for the following cases: (a) Rect8/RNF/REG/Rmax = 1 (16.2/0/0/1/0). (b) Rect8/RNF/REG/Rmax = 7 (25.1/1.56/0/0.947/0.48). (c) Rect8/TNF/REG/Rmax = 3 (17.29/0.391/0/0.989/0.285). (d) Rect8/RNF/REG/Rmax = 13 (24.0/2.73/0.066/0.807/5.55). (e) Rect8/RNF/RAND/Rmax = 1 (18.3/28.9/0.058/0.738/14.3). (f) Rect8/RNF/RAND/Rmax = 11 (20.5/33.2/0.006/0.841/3.61). (g) Rect4/RNF/RAND/Rmax = 1 (15.81/25/0.212/0.587/11.69). (h) Rect4/RNF/RAND/Rmax = 3 (14.94/24.61/0.15/0.698/3.50).

V. H ARDWARE R EALIZATION OF THE P ROPOSED C IRCUIT AND P ERFORMANCE A NALYSIS To evaluate the main hardware parameters of the proposed circuit, the programmable neighborhood mechanism for an example map with 8 × 8 neurons has been designed in the CMOS 0.18 μm technology. The circuit complexity strongly depends upon the values of particular parameters, as discussed in previous section. For example, if only the RNF is being used, the proposed DD circuit without the NF blocks is sufficient. In this case, a total number of transistors in the neighborhood mechanism equals 12 500 or 18 500 for two or three bits in the RPROG signal, respectively. The TNF block that can be connected with the DD circuit, enlarges the number of transistors per neuron by 1300. As a result, for the map with 8 × 8 neurons a total number of transistors is about 100 000. This makes the overall mechanism even eight times more complex than in the case of the RNF, but it is still much less complex than in the case when the GNF block is used.

DŁUGOSZ et al.: PARALLEL PROGRAMMABLE ASYNCHRONOUS NEIGHBORHOOD MECHANISM FOR KOHONEN SOM 1.8 [v]

100 WSC11

EN11 EN21 EN31 EN41 EN51 EN61 EN71 EN81

0 1.8 [v]

10 EN82 EN83 EN84 EN85

1.0

EN86

EN87

EN88

WSC11

VDD_TT = 1.8[V] VDD_TT = 1.3[V] VDD_TT = 1.0[V] VDD_TT = 0.8[V] VDD_SS = 1.8V] VDD_SS = 0.8[V] VDD_FF = 1.8[V] VDD_FF = 0.8[V]

1

0 IDD 3 [mA] 2 1 0

0.1 16

18

20

22 Time[ns]

24

0

2

4

6 R 8

10

12

14

16

26

100

Performance of the proposed circuit presented over time.

t [ns]

Fig. 14.

E [pj]

1.0

2101

10

The ability of using small values of Rmax for different map sizes ENs using a constant number of the connection lines between neurons, as shown in Fig. 4. This simplifies the design process of the chip, as the neuron can be designed as a fixed cell and simply duplicated in the chip. The proposed circuit is a digital feed-forward solution, and thus it offers stable performance under different external conditions and for different transistor models. In such a case, the circuit can be reliably verified by means of the so-called corner analysis for a series of transistor-level simulations performed for extreme values of the PVT parameters. The system has been verified for the slow, fast and typical (SS, FF, TT) models, for supply voltages varying in-between 0.6 and 1.8 V, and temperatures varying in the range of −40 to +120 °C. The corner analysis is a typical procedure used to verify the commercial chips. To evaluate the performance of the circuit in the worstcase scenario i.e., for the longest possible path of the EN signal, we activated the upper-left corner neuron (1, 1) by setting its WSC11 signal to 1. In this case, the maximum distance in the map equals 7 or 15 for the Rect8 and Rect4 topology, respectively. Illustrative time-domain simulations for the Rect4 topology and RNF are shown in Fig. 14. In the top diagram, the ENout signals in the first column of the map are subsequently activated starting from the top to the bottom. As the EN81 signal becomes the logical 1, then particular ENout signals in the bottom row are activated from the left to the right, as shown in the middle diagram. This transition scheme is in good agreement with Fig. 2(a). A delay of the DD circuit equals 11 ns. The bottom diagram shows the supply current IDD . For an average value of this current of 1.75 mA, and the supply voltage of 1.8 V the energy consumed during this period equals 35 pJ for the entire map, i.e., 0.54 pJ per a single neuron. Note that the maximum current flow appears in the middle of this period i.e., when eight neurons (1, 8), (2, 7), (3, 6), . . . , (8, 1) are being switched over at the same time. After this period i.e., after determining the distances to all neighboring neurons the mechanism enters the standby mode and dissipates only a negligible amount of power. This is a visible advantage over the Peiris’s solution, which dissipates power also during the following adaptation phase.

VDD_TT = 1.8[V] VDD_TT = 1.3[V] VDD_TT = 1.0[V] VDD_TT = 0.8[V] VDD_SS = 1.8V] VDD_SS = 0.8[V] VDD_FF = 1.8[V] VDD_FF = 0.8[V]

1 0

2

4

6 R 8

10

12

14

16

Fig. 15. Energy consumption per single input pattern and delay between the two extreme neurons in the map versus the neighborhood range R, for the Rect4 topology, for different supply voltages, 20 °C, for TT transistor model.

The comparative results for different PVT parameters being regarded as a function of R are shown in Fig. 15 for 20 °C. For other values of temperature, the system operates properly, while the results differ by ±10% only and therefore are not presented. Note that the worst values of both the energy consumption and the delay are only at the early stage of the learning process, while for smaller values of R they are significantly reduced. The results are presented for the RNF. In case of the TNF, in which NFs blocks are required, the energy consumed in the worst case equals 208 pJ per a single input pattern, i.e., 3.25 pJ per a single neuron. Comparing the results for the same supply voltage, VDD , and different transistor models, we note that the energy consumption is almost equal for particular topologies, while delay times differ significantly. This effect results from a different power dissipation in these particular cases. For slow transistor models the switching process takes more time, especially for small values of VDD , but in this period the average power dissipation is proportionally smaller. Significant power in the CMOS logic gates is drawn while these devices are switching between two logic states. In this case, the on-resistances of their output channels reach the minimum values. To reduce the power dissipation, the switching process should be fast. In case of the slow transistor model, this time is longer but is compensated by larger threshold voltage, VTH , which for a given voltage VDD increases the on-resistance of transistors. For a given transistor model, depending on the neighborhood range, the delay time, t, varies in-between 3 and 95 ns for the Rect4 topology and between 3 and 60 ns for the Rect8 topology. The Rect4 topology is not effective looking from this point of view, but since the majority of the learning process is performed for small values of R this problem is not significant.

2102


A. Computational Power of the Proposed SOM

f S = 1/(TDW · n + TWSC + TDD + TNF + TAD · n)

(13)

where TDW is the time required to provide a single input signal and to update the ϕ L1 or the ϕ L2 signals, that - after all components x of X (l) are provided - become the input signals to the WSC block. TWSC is the time of determination of the winning neuron. TDD is the time required to determine distances to all neighboring neurons, and then TNF is the time the NF block requires to calculate the η · G() term in (1). The adaptation of particular weights is performed in series in particular neurons. It could be performed in parallel but in this case it would increase the complexity of the circuit, as each weight would require its own multi-bit multiplier. In case of the serial approach, the time to realize the adaptation process is given as TAD · n. The simulations performed at the transistor level show that data rate of 10 MS/s for the ‘8 × 8’ map with three inputs is achievable. The computational power can be estimated to be larger than 25 · 109 operations/s in this case and 400 · 109 operations/s for the map with 32 × 32 neurons. An important parameter is the power dissipation. The simulations of the overall SOM show that a single neuron consumes 25–30 pJ per a single pattern X (l). Fig. 16 shows

8 Power [W]

An interesting aspect is the assessment of the computational power of the overall NN. In the proposed SOM, most of the operations are performed in parallel, regardless of the number of neurons, but some operations have to be serialized, especially in case of a fully digital realization. In this case, for a multi-bit representation of the input signals, these signals cannot be provided in parallel due to the limited number of connecting lines. For this reason, we propose a solution, in which the neighborhood mechanism works in two modes. In the first mode, it works as described in Section III. In this case, the connections between particular pairs of neurons are separated from the others. In the second mode, all the lines are disconnected from the R_PROP and the EN_PROP blocks and are shorten together, enabling a parallel re-programming of all neurons. In this mode particular input signals x i from a given pattern X (l) are delivered to all neurons at the same time. After delivering a given i th signal, where i = 0, . . . , n, all neurons perform one of the operations ϕ L1 (i ) = ϕ L1(i − 1)+ |x i − wi, j | or ϕ L2 (i ) = ϕ L2 (i − 1) + (x i − wi, j )2 depending on which distance (L1 or L2) is being used. For i = n the ϕ L1 or ϕ L2 signals become the distance measures between particular neurons and X (l). The software model simulations show that the learning quality is rather independent from the distance measure, and therefore the L1 measure is preferred in our realization, as it does not require squaring operation. After determining the distance measures for all neurons, the same lines are in the same mode used by the WSC block to determine the winning neuron. We developed a new bitwise WSC block that compares the signals the bit after the bit independently on the number of neurons in the map. The sampling frequency at the input of the proposed SOM can be calculated as follows:

6 4 2 0 10

× 106

5 fs [Hz]

0 0

15000 10000 5000 No. of neurons

Fig. 16. Estimated power dissipation regarded as a function of the number of neurons and the sampling frequency at the input (the worst case).

the estimated power dissipation in the worst-case scenario, i.e., for the neighborhood range R covering the entire map, and for all neurons being adapted with the maximum value of η. In practice, as the values of R are usually small, the power dissipation will be even five times smaller. For the map with 8 × 8 neurons, if the TNF is being used, an average power dissipation at the maximum data rate equals 19 mW, while for 1 kHz it equals only 1.9 μW. For the map with 32 × 32 neurons, the power dissipation will be equal to 300 mW at the maximum data rate. Note that these results have been obtained for the CMOS 0.18 μm process. For the latest technologies below 65 nm, we expect an improvement of the results. VI. C ONCLUSION A new programmable ultralow power neighborhood mechanism has been proposed for the Kohonen SOM realized in the CMOS technology. The proposed circuit was a flexible solution that allows for an easy realization of three different map topologies on a single chip. As all neurons in the SOM operate in parallel, the achieved data rate was even three orders of magnitude larger than in case of a similar network realized on PC. Extensive transistor level simulations, which in case of such circuits provide reliable results, show that the SOM with 64 neurons realized in the CMOS 0.18 μm process achieves data rate of 10 MS/s. For the comparison, a single software-model test during which the SOM processes two million input patterns takes about 20 minutes i.e., data rate equals 1.66 kS/s. One of the main advantages is a very low power dissipation and matching of the power dissipation to data rate. The prospective application of the NN is in an on-line classification of the ECG signals performed in a low power WBAN, where one of the paramount features was low energy consumption. It has been shown in [27], [28] that depending on the type of the NN, the number of neurons varying in-between 30–130 was sufficient to perform the classification of the ECG complexes. The sampling frequency required in this case does not exceed 1 kHz [29]. The proposed SOM with 130 neurons realized in the CMOS 0.18 μm technology will in this case occupy the area of ca. 5 mm2 , dissipating the power of about 10 μW at f S = 1 kHz. At full data rate the SOM with 64 neurons consumes about 20 mW.


To verify the common opinion that the neighborhood range at the beginning of the training should cover at least half of the map, we performed more than six thousand simulations for different network parameters and different map sizes. Our key conclusion was that the neighborhood range in most cases can be very small. This conclusion is important in hardware implementation, as it allows reducing the chip area and the power dissipation even by 60%. The chip area and the power dissipation strongly depend on the NF and the initial value of the neighborhood range, Rmax . For example, in case of the TNF the neighborhood mechanism occupies even eight times larger area than in case of the RNF, dissipating six times more power. Fortunately, simulations carried out by means of the software model show that the RNF is sufficient in many cases. R EFERENCES [1] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: A survey,” Comput. Netw., vol. 38, no. 4, pp. 393–422, Mar. 2002. [2] A. Bereketli and O. Akan, “Communication coverage in wireless passive sensor networks,” IEEE Commun. Lett., vol. 13, no. 2, pp. 133–135, Feb. 2009. [3] T. Kohonen, Self-Organizing Maps (Information Sciences), 3rd ed. New York: Springer-Verlag, 2001. [4] I. Mokriš and R. Forgáˇc, “Decreasing the feature space dimension by Kohonen self-organizing maps,” in Proc. 2nd Slovakian – Hungarian Joint Symp. Appl. Mach. Intell., Her´lany, Slovakia, 2004, pp. 153–164. [5] F. Li, C.-H. Chang, and L. Siek, “A compact current mode neuron circuit with Gaussian taper learning capability,” in Proc. IEEE Int. Symp. Circuits Syst., Taipei, Taiwan, May 2009, pp. 2129–2132. [6] D. Masmoudi, A. Dieng, and M. Masmoudi, “A subthreshold mode programmable implementation of the Gaussian function for RBF neural networks applications,” in Proc. IEEE Int. Symp. Intell. Control, Feb. 2002, pp. 454–459. [7] R. Długosz, M. Kolasa, and W. Pedrycz, “Programmable triangular neighborhood functions of Kohonen self-organizing maps realized in CMOS technology,” in Proc. 18th Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., Bruges, Belgium, Apr. 2010, pp. 529– 534. [8] J. Lee and M. Verleysen, “Self-organizing maps with recursive neighborhood adaptation,” Neural Netw., vol. 15, nos. 8–9, pp. 993–1003, Oct. 2002. [9] V. Peiris, “Mixed analog digital VLSI implementation of a Kohonen neural network,” Ph.D. dissertation, Dépt. Électr., Ecole Polytechnique Fédérale Lausanne, Lausanne, Switzerland, 2004. [10] J. Croon, M. Rosmeulen, S. Decoutere, W. Sansen, and H. Maes, “An easy-to-use mismatch model for the MOS transistor,” IEEE J. Solid-State Circuits, vol. 37, no. 8, pp. 1056–1064, Aug. 2002. [11] J. Madrenas, M. Verleysen, P. Thissen, and J. Voz, “A CMOS analog circuit for Gaussian functions,” IEEE Trans. Circuits Syst. II, vol. 43, no. 1, pp. 70–74, Jan. 1996. [12] D. Macq, M. Verleysen, P. Jespers, and J.-D. Legat, “Analog implementation of a Kohonen map with on-chip learning,” IEEE Trans. Neural Netw., vol. 4, no. 3, pp. 456–461, May 1993. [13] A. Savich, M. Moussa, and S. Areibi, “The impact of arithmetic representation on implementing MLP-BP on FPGAs: A study,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 240–252, Jan. 2007. [14] R. Długosz, T. Talaska, W. Pedrycz, and R. Wojtyna, “Realization of the conscience mechanism in CMOS implementation of winner-takesall self-organizing neural networks,” IEEE Trans. Neural Netw., vol. 21, no. 6, pp. 961–971, Jun. 2010. [15] K. B. Khalifa, B. Girau, F. Alexandre, and M. Bedoui, “Parallel FPGA implementation of self-organizing maps,” in Proc. 16th Int. Conf. Microelectron., Dec. 2004, pp. 709–712. [16] W. Kurdthongmee, “A novel hardware-oriented Kohonen SOM image compression algorithm and its FPGA implementation,” J. Syst. Archit., vol. 54, no. 10, pp. 983–994, Oct. 2008. [17] C.-Y. Wu and W.-K. Kuo, “A new analog implementation of the Kohonen neural network,” in Proc. Int. Symp. VLSI Technol., Syst., Appl., Taipei, Taiwan, May 1993, pp. 262–266.

2103

[18] R. Długosz and M. Kolasa, “CMOS programmable asynchronous neighborhood mechanism for WTM kohonen neural network,” in Proc. 15th Int. Conf. Mixed Design Integr. Circuits Syst., Poznan, Poland, Jun. 2008, pp. 197–201. [19] A. Rajah and M. K. Hani, “ASIC design of a Kohonen neural network microchip,” in Proc. IEEE Int. Conf. Semicond. Electron., Dec. 2004, pp. 1–4. [20] D. DeSieno, “Adding a conscience to competitive learning,” in Proc. IEEE Int. Conf. Neural Netw., vol. 1. San Diego, CA, Jul. 1988, pp. 117–124. [21] E. Uriarte and F. Martin, “Topology preservation in SOM,” Int. J. Appl. Math. Comput. Sci., vol. 1, no. 1, pp. 19–22, 2005. [22] F. Bação, V. Lobo, and M. Painho, “The self-organizing map, the GeoSOM, and relevant variants for geosciences,” Comput. Geosci., vol. 31, no. 2, pp. 155–163, Mar. 2005. [23] D. Beaton, I. Valova, and D. MacLean, “CQoCO: A measure for comparative quality of coverage and organization for self-organizing maps,” Neurocomputing, vol. 73, nos. 10–12, pp. 2147–2159, Jun. 2010. [24] J. A. Lee, N. Donckers, and M. Verleysen, “Recursive learning rules for SOMs,” in Proc. Workshop Self-Organiz. Maps, Jun. 2001, pp. 67–72. [25] M. Sheikhan, V. T. Vakili, and S. Garoucy, “Codebook search in LDCELP speech coding algorithm based on multi-SOM structure,” World Appl. Sci. J., vol. 7, pp. 59–68, 2009. [26] M.-C. Su, H.-T. Chang, and C.-H. Chou, “A novel measure for quantifying the topology preservation of self-organizing feature maps,” Neural Process. Lett., vol. 15, no. 2, pp. 137–145, 2002. [27] O. Inan, L. Giovangrandi, and G. Kovacs, “Robust neural-networkbased classification of premature ventricular contractions using wavelet transform and timing interval features,” IEEE Trans. Biomed. Eng., vol. 53, no. 12, pp. 2507–2515, Dec. 2006. [28] L. He, W. Hou, X. Zhen, and C. Peng, “Recognition of ECG patterns using artificial neural network,” in Proc. 6th Int. Conf. Intell. Syst. Design Appl., vol. 2. Jinan, China, Oct. 2006, pp. 477–481. [29] J. Segura-Juarez, D. Cuesta-Frau, L. Samblas-Pena, and M. Aboy, “A microcontroller-based portable electrocardiograph recorder,” IEEE Trans. Biomed. Eng., vol. 51, no. 9, pp. 1686–1690, Sep. 2004.

Rafał Długosz received the M.Sc. degree in control and robotics and the Ph.D. degree in telecommunications (with distinctions) from the Poznań University of Technology, Poznań, Poland, in 1996 and 2004, respectively. He is currently with the University of Technology and Live Sciences, Bydgoszcz, Poland. He was with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada, within the framework of his fellowships from 2005 to 2008. Then, he joined the Electronics and Signal Processing Laboratory, Institute of Microtechnology, Swiss Federal Institute of Technology, Lausanne, Switzerland. He has published over 110 research papers and book chapters. His current research interests include ultralow power reconfigurable analog and analog-digital circuits, analog filters, analog-to-digital converters, artificial neural networks, and others. Dr. Długosz was the fellow of the Foundation for Polish Science. Then he also received the Marie Curie International Outgoing Fellowship under the EU 6th Framework Program.

Marta Kolasa received the M.Sc. degree in telecommunication from the Institute of Telecommunication, University of Technology and Live Sciences, Bydgoszcz, Poland, in 2005. She is currently an Assistant with the University of Technology and Live Sciences. She is co-author of more than 20 research papers. Her current research interests include energy efficient analog-digital integrated circuits, artificial neural networks and their hardware implementation, especially analog-digital application-specific integrated circuit self-organized neural networks.

2104


Witold Pedrycz (M’88–SM’90–F’99) received the M.Sc., Ph.D., and D.Sci. degrees from the Silesian University of Technology, Gliwice, Poland, in 1977, 1980, and 1984, respectively. He is a Professor and Canada Research Chair in computational intelligence with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada. He is with the Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland. He is an author of 12 research monographs. He has published numerous papers in his areas of expertise. His current research interests include computational intelligence, fuzzy modeling, knowledge discovery and data mining, fuzzy control including fuzzy controllers, pattern recognition, knowledgebased neural networks, granular and relational computing, and software engineering. Dr. Pedrycz has been a member of numerous program committees of the IEEE conferences in the area of fuzzy sets and neurocomputing. He serves as the Editor-in-Chief of the IEEE T RANSACTIONS ON S YSTEM M AN AND C YBERNETICS —PART A: S YSTEMS AND H UMANS and an Associate Editor of the IEEE T RANSACTION ON F UZZY S YSTEMS . He is also the Editor-inChief of Information Sciences. He is a recipient of the prestigious Norbert Wiener Award from the IEEE Society of Systems, Man, and Cybernetics and an IEEE Canada Silver Medal in Computer Engineering.

Michał Szulc received the M.Sc. degree in robotics from the Poznań University of Technology, Poznań, Poland, in 2002. He is currently the Chair of Computer Engineering, Poznań University of Technology. He has participated in two research grants sponsored by the Polish government. He has published over 30 research papers and book chapters. His current research interests include the field-programmable gate array-based implementations of biologically realistic models of neural network as well as artificial neural networks, and engineering of biomedical systems based on human gait analysis and biomedical image processing.