Adobe PDF, Job 3

2 downloads 0 Views 943KB Size Report
CELLULAR NEURAL NETWORKS: APPLICATION IN IMAGE PROCESSING . ... initially motivated right from its inception by the recognition that the brain computes in ...... different: it can be a financial time series (currency exchange rate, stock ...
Tempus

ADVANCES IN NEURAL NETWORKS Igor Aizenberg

These Class Notes were prepared under the support of the Tempus Project JEP-16160-2001

CONTENTS PREFACE.....................................................................................................................................2 I. INTRODUCTION. A BRIEF HISTORICAL OBSERVATION................................................ 2 II. MATHEMATICAL MODEL OF A NEURON.......................................................................... 7 III. A NEURON WITH THE THRESHOLD ACTIVATION FUNCTION..................................... 9 IV. LEARNING PROCESS AND LEARNING OF THE THRESHOLD NEURON.................... 11 V. IMPLEMENTATION OF THE NON-THRESHOLD BOOLEAN FUNCTIONS USING FEEDFORWARD NEURAL NETWORK FROM THE THRESHOLD NEURONS............ 18 VI. IMPLEMENTATION OF THE NON-THRESHOLD BOOLEAN FUNCTIONS USING THE UNIVERSAL BINARY NEURON ......................................................................................... 20 VII. P-REALIZABLE BOOLEAN FUNCTIONS AND UNIVERSAL BINARY NEURON........ 21 VIII. MULTIPLE-VALUED THRESHOLD FUNCTIONS AND MULTI-VALUED NEURON . 27 IX. LEARNING ALGORITHMS FOR MULTI-VALUED AND UNIVERSAL BINARY NEURONS............................................................................................................................... 29 X. A CLASSICAL MULTILAYER FEEDFORWARD NEURAL NETWORK ......................... 35 XI. A MULTILAYER FEEDFORWARD NEURAL NETWORK BASED ON MULTI-VALUED NEURONS (MLMVN)............................................................................................................ 40 XII. APPLICATION OF A MULTILAYER FEEDFORWARD NEURAL NETWORK BASED ON MULTI-VALUED NEURONS (MLMVN) IN TIME SERIES PREDICTION .............. 45 XIII.CELLULAR NEURAL NETWORKS: BASIC PRINCIPLES ................................................ 49 XIV. CELLULAR NEURAL NETWORKS: APPLICATION IN IMAGE PROCESSING ........... 52 REFERENCES..........................................................................................................................57

1

ADVANCES IN NEURAL NETWORKS PREFACE Artificial neural networks or simply neural networks represent an emerging technology rooted in many disciplines. This popular and important area of science and technology was extensively developing for the recent period of time. Neural networks are endowed with some unique attributes, like the ability to learn from and adapt to their environment and the ability to approximate very complicated mappings. We will consider here some basic principles of neural networks and several novel ideas related to them. After a brief historical observation we will consider, what a neuron is, and what a neural network is. We will observe the basic ideas of the threshold logic and thus, the threshold neuron (the classic perceptron) will be considered. It will be observed that historically the first neural networks were developed for the implementations of the non-threshold Boolean functions. We will consider the basic principles of the learning process and the error-correction learning, in particular. A classical network - a feedforward neural network and its backpropagation learning will be presented. Then we will move to several novel and advanced solutions in neural networks that are the main subject of this course. We will consider two types of the complex-valued neurons, whose activation functions are functions of the argument of the weighted sum. There are the universal binary neuron, which can implement non-threshold Boolean functions and the multi-valued neuron, which can implement those mappings that are described by the threshold functions of multiplevalued logic and by the continuous multiple-valued functions. The learning process for both neurons will be considered. A special attention will be paid to the implementation of the nonlinearly separable Boolean functions including the XOR and parity n ones on the single universal binary neuron. We will also consider a feedforward neural network based on multi-valued neurons and its backpropagation learning, which does not require a derivative of the activation function. Cellular neural networks, their architecture, their different types and their applications in image processing will be studied in the concluding part of the course. The author is deeply indebted to Professor Claudio Moraga, who has given freely of his time to read through the Class Notes, for his very useful and helpful comments.

I. INTRODUCTION. A BRIEF HISTORICAL OBSERVATION Work on artificial neural networks, commonly referred to as simply "neural networks" has been initially motivated right from its inception by the recognition that the brain computes in an entirely different way as the conventional digital computer. A strong interest to understand the brain owes 2

much to the pioneering work of Ramon y Cajal (1911) [1], who introduced the idea of neurons as structural constituents of the brain. Typically, neurons are five to six orders of magnitude slower than silicon logic gates. Events in a silicon chip happen in the nanosecond ( 10 −9 s) range, whereas neural events happen in the millisecond ( 10 3 s) range [2]. However, the brain makes up for the relatively slow rate operation of a neuron by having a truly staggering number of neurons (nerve cells) with massive interconnections among them. It is estimated that there must be on the order of 10 billion neurons in the human cortex, and 60 trillion synapses of connections [3]. The net result is that the brain is an enormously efficient structure. Specifically, the energetic efficiency of the brain is approximately 10 −16 joules per operation per second [2], whereas the corresponding characteristic for the best modern computers in use today is about 10 −6 joules per operation per second [4]. Today we can conclude that the brain is a highly complex, nonlinear and parallel computer (information-processing system) [2]. It has the capability of organizing neurons so as to perform certain computations (e.g., pattern recognition, perception, and motor control) incompatibly faster than the fastest digital computer in existence today. Consider, for example, human vision, which is a complicated information processing task. It is the function of the visual system to provide a real time representation of the environment around us and to supply the information we need to interact with the environment. To be more specific, the brain routinely and permanently accomplishes perceptual recognition tasks (e.g., recognizing a familiar face embedded in an unfamiliar scene) in something of the order of 100-200 ms, whereas tasks of much lesser complexity can take hours even on a huge conventional computer [2]. How does a human brain do it? At birth, a brain has great structure and the ability to build up its own rules through what we usually refer to as "experience". Indeed, experience is built up over the years, with the most dramatic development of the human brain taking place in the first two years from birth. But the development then permanently Fig. 1 Interconnections between neurons

continues well beyond this stage. During this early stage of development, about one million synapses

are formed per second. Synapses are elementary structural and functional units that mediate the interactions between neurons (Fig. 1). In widely and traditionally used model of neural organization, it is assumed that a synapse is a simple connection, which can impose excitation or inhibition. Plasticity permits the developing nervous system to adapt to its environment. In the brain, plasticity consists of two mechanisms: the creation of new synaptic connections between neurons, and the modification of existing synapses. 3

Just as plasticity appears to be essential to the functioning of neurons as information processing units in the human brain, so it is with neural networks made up of artificial neurons. In its most general form, a neural network is a machine that is designed to model the way in which the brain performs a particular task or function of interest. This definition is given by S. Haykin in [2]. The network is usually simulated in software on a conventional digital computer or implemented using electronic components. To achieve good performance, neural networks employ a massive interconnection of the elementary cells (neurons). The following definition of a neural network, which was given by I. Aleksander and H. Morton in [5] is very successful and it is widely used: Definition 1 A neural network is a massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for use. It means that: 1) Knowledge is acquired by the network through a learning (training) process; 2) The strength of the interconnections between neurons is implemented by means of the synaptic weights used to store the knowledge. The learning process is a procedure of the adapting the weights with a learning algorithm in order to capture the knowledge. On more mathematically, the aim of the learning process is to map a given relation between inputs and output (outputs) of the network.

It is clear form the definition and the above consideration that a neural network obtains its computing power through, first, its massively parallel distributed structure and, second, its ability to learn and therefore generalize. Generalization refers to the neural network producing reasonable outputs for inputs not encountering during training (learning) [2]. The following important properties of neural networks and neurons should be distinguished [2]: 1)

Nonlinearity. A neuron is basically a nonlinear device. Consequently, a neural network, made up of an interconnection of neurons, is itself nonlinear. Moreover, the nonlinearity is of a special kind in the sense that it is distributed throughout the network. Nonlinearity is a highly important property, particularly if the underlying physical mechanism responsible for the generalization of an input signal (e.g., speech or video signal) is inherently nonlinear.

2)

Input-Output Mapping. A popular paradigm of learning called supervised learning involves the modification of the synaptic weights of a neural network by applying a set of labeled training samples. Each example always consists of a unique input signal and 4

the corresponding desired response. The network is sequentially presented an example, and the synaptic weights of the network are being modified so as to minimize the difference between the desired response and the actual response of the network produced by the input signal. The training of the network is repeated for the selected examples in the set until the network reaches a steady state, where there are no further significant changes in the synaptic weights. Thus the network learns from the examples by constructing an input-output mapping for the problem at hand. 3)

Adaptivity. Neural networks have a built-in capability to adapt their synaptic weights to changes in the surrounding environment. In particular, a neural network trained to operate in a specific environment can be easily retrained to deal with minor changes in the operating environmental conditions.

4)

Evidential Response. In the context of pattern classification, a neural network can be designed to provide information not only about which particular pattern to select, but also about the confidence in the decision made. This latter information may be used to reject ambiguous patterns, should they arise and thereby improve the classification performance of the network.

Usually one distinguishes feedforward and feedback neural networks. A typical example of the feedforward network is multilayer feedforward neural network (MLF) often referred to as a multilayer perceptron (MLP) (see e.g., [2]). MLF consists of an input layer of neurons, one or more hidden layers, and an output layer. The outputs of the neurons from a previous layer are connected with inputs of neurons of the following layer only (feedforward connections). In a feedback network the signal from the outputs of the neurons may transmit to the inputs of the neurons from the same layer (fully connected by such a way network is a recurrent network). The Hopfield network is an example [14]. Another example of the neural network with feedback connections is the Cellular neural network introduced by L. Chua and L. Yang in [17]. It is necessary to underline here the similarity with biological neural networks, where feedback connections always present [2]. To implement a mapping between inputs and outputs of a network, and on the other hand a mapping between inputs and output of each neuron (or in other words, to accumulate the knowledge) a learning process is used. Summarizing many definitions of learning beginning from Hebb [7] and continuing by Minsky and Papert, [12], Novikoff [9], Aleksander and Morton [5] and Haykin [2] (many others may also be added) we can define learning in the following way. Learning is a process by which weighting parameters of a neural network are adapted through some process of the corrections to implement a corresponding mapping between inputs and output of the network (or between inputs and output of a neuron). The type of learning is determined by the method, in 5

which the correction of weights is organized. In supervised learning a training (learning) set of input/output data is given, and the interconnection structure of the neurons is known in advance. Learning in such a case consists in iterative minimization of the error between the actual and desired outputs (with some given precision, or until precise coincidence of the actual and desired output vales) according to some learning rule for adapting the weights. So an “external teacher” is always present in supervised learning. In unsupervised learning desired outputs of the network are not available. In reinforcement learning the weights implementing a corresponding input/output mapping are obtained through a process of trials and errors. In context of this course we will only deal with the supervised learning. We want to mention here the most important events that present a historical point by point evolution of the field of neural networks: • • • • •

• • • • • •

1943 – W. McCulloch and W. Pitts – the first nonlinear mathematical model of the neuron (a formal neuron) [6]. 1948 – D. Hebb – the first learning rule. One can memorize an object by adapting weights [7]. 1958 – R. Rosenblatt – concept of the perceptron as a machine, which can learn and classify patterns [8]. 1963 – A.B.J. Novikoff – significant development of the learning theory, a proof of the theorem about convergence of the learning algorithm applied to solution of the pattern recognition problem using the perceptron [9]. 1960-s – the extensive development of the threshold logic, initiated by previous results in perceptron theory. A deep learning of the features of threshold Boolean functions, as one of the most important objects considered in the theory of perceptrons and neural networks. The most complete summaries are given by M. Dertouzos [10] and S. Muroga [11]. 1969 – M. Minsky & S. Papert – potential limit of the perceptron as a computing system is shown [12]. 1977 – T. Kohonen – consideration of the associative memory as a content-addressable memory, which is able to learn [13]. 1982 – J. Hopfield shows by means of energy functions that neural networks are capable to solve a large number of problems. Revival of extensive research in the field [14]. 1982 – T. Kohonen describes the self-organizing maps [15]. 1986 – D.E. Rumelhart & J.L. McClelland – introduction of the feedforward neural network and learning with backpropagation. Consideration of a neural network as a universal approximator. Present – more and more scientists and research centers are devoted to research in the field of neural networks and their applications in pattern recognition, classification, prediction, image processing and others.

We will consider here some basic principles, on which neurons and neural networks are based. Then we will observe some widely used neural architectures, such as feedforward and cellular neural networks. We will also pay a special attention to recent solutions in neural networks that are based on the use of complex-valued neurons. 6

II. MATHEMATICAL MODEL OF A NEURON A neuron is an information-processing unit that is fundamental to the operation of a neural network. A commonly used mathematical model of a neuron is shown in Fig. 2. We may distinguish the following three basic elements of the neuron model: 1)

A neuron has a set of n synapses associated to the inputs. Each of them is characterized by a weight wi , i = 1,..., n . A signal xi , i = 1,..., n at the ith input is multiplied (weighted) by the weight wi .

2)

The weighted input signals are summed. Thus a linear combination of the input signals

w1 x1 + ... + wn x n is obtained. A "free weight" (or bias) w0 , which does not correspond to any input, is added to this linear combination and this forms a weighted sum

z = w0 + w1 x1 + ... + wn x n . 3)

A nonlinear activation function φ is applied to the weighted sum. A value of the activation function y = ϕ( z ) is the neuron's output. w0 w0 x1

w1 x1 w1

Z=

∑w x i

...

xn

wn

i

ϕ(Z) Output

ϕ( z ) = f ( x1 ,..., x n )

wn x n

Fig. 2 Mathematical model of a neuron If the input/output mapping is described by some function f ( x1 ,..., xn ) , the following equality holds

ϕ( z ) = ϕ( w0 + w1 x1 + ... + wn x n ) = f ( x1 , ..., x n ) . A nonlinear activation function limits the amplitude of the output of a neuron. Usually the interval of the output and inputs of a neuron is [0, 1], or [1, -1]. Three types of activation functions can be recognized as classical ones and are most widely used. There are the following functions.

⎧ 1, if z ≥ 0 ⎩ − 1, if z < 0.

1) Threshold Function. ϕ( z ) = sign ( z ) = ⎨

7

1

-1

Threshold Activation Function It is clear that the threshold activation function determines a binary output of a neuron. It is equal to 1, if the weighted sum is not negative, and it is equal to -1, if the weighted sum is negative.

if z ≥ 1 ⎧ 1, ⎪ 2) Piecewise-Linear Function. ϕ( z ) = ⎨ z , if - 1 < z < 1 ⎪− 1, if z ≤ −1. ⎩ 1

-1

1 -1

(b) Piecewise-Linear Activation Function The piecewise-linear function determines a multiple-valued output of a neuron. Another function, which also determines a multiple-valued output of a neuron and which is may be the most popular and widely used one, is the

Sigmoid Function ϕ( z ) =

1 (a is a slope parameter). 1 + exp(− az ) 1 0.5

-1

1 0

Sigmoid Activation Function

8

There are some other types of the activation functions. For example, in applications of modeling and control hyperbolic tangent function ϕ( z ) = (1 − exp(− z )) /(1 + exp(− z )) is normally used (it is an analog of sigmoid function, but it takes values in the interval [-1, 1] instead of [0, 1]). We will study below the complex-valued activation functions that make possible a significant increacement of the neuron's functionality.

III. A NEURON WITH THE THRESHOLD ACTIVATION FUNCTION As it was mentioned above, the threshold activation function supposes inputs and outputs of the corresponding neuron and neural network are binary. This means that the input/output mapping for a neuron and a neural network consisted of such neurons in general are described by a Boolean function. Moreover, if we consider only one neuron, such a Boolean function has to be threshold, or linearly separable.

Definition 2 The Boolean function f ( x1 ,..., xn ) is called a threshold (linearly separable) function, if it is possible to find such a real-valued weighting vector W = ( w0 , w1 ,..., wn ) that equation

f ( x1 ,...xn ) = sign ( w0 + w1 x1 + ... + wn xn )

(1)

will hold for all the values of the variables x from the domain of the function f.

Of course, this is one of possible equivalent definitions, and it is possible to find many of them (see e.g., [10], [11], etc.) Geometrical interpretation of the threshold functions is very simple: weights define an equation of a hyperplane, which separates “-1s” of the function from “1s” (or “0s” from “1s”, if the classical Boolean alphabet {0, 1} is used). We have to make a remark. Here and further we will use the Boolean alphabet {1, -1}, which correspondence with the classical Boolean alphabet {0, 1} as follows:

0 → 1; 1 → -1; y ∈ {0 , 1}, x ∈ {1,-1} ⇒ x=1- 2 y = ( −1) y .

9

1)

(1, 1)

(-1, 1)

(1, 1)

(-1,-1)

(1,-1)

(-1,-1)

(1,-1)

(-1,

Fig. 3 Disjunction f ( x1 , x 2 ) = (1,−1,−1,−1) is an example of the threshold (linearly separable) Boolean function: “ -1s” are separated from “1” by a line;

Fig. 4 XOR f ( x1 , x 2 ) = (1,−1,−1, 1) is an example of the non-threshold (not linearly separable) Boolean function: it is impossible separate “1s” from “-1s” by any single line

Fig. 3 shows an example of threshold function (disjunction of two variables). Fig. 4 shows a typical and most popular example of non-threshold function (XOR, or mod 2 addition (or multiplication in the alphabet {1, -1}) of two variables. Table I contains the values of variables and corresponding values of both functions. It should be noted that the weight w0 is often called a threshold. It is important to mention that any threshold function has infinite number of different weighting vectors. It is clear from Fig. 3 that there is infinite number of the lines that can separate “-1s” of the threshold function from its “1s”.

Table I. Examples of the threshold and not threshold functions XOR ∨ x1 x2 1 1 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 n

The number of all Boolean functions of n variables is equal to 2 2 , but the number of the threshold ones is substantially smaller. Really, for n=2 fourteen from sixteen functions (excepting XOR and not XOR) are threshold, for n=3 there are 104 threshold functions from 256, but for n>3 the following correspondence is true (T is a number of threshold functions of n variables):

T 22

n

→0. n> 3

For example, for n=4 there are only about 2000 threshold functions only from 65536, as it was shown by Muroga in 1971 [11]. The important and natural conclusion is that

Any threshold Boolean function may be implemented using a single neuron with the threshold activation function. Another natural conclusion is that 10

It is not possible to implement a non-threshold function using a neuron with the threshold activation function. A non-threshold function can be implemented only using a neural network from neurons with the threshold activation function. One of the most popular types of network is the MLF (MLP). We will consider it below in more details. So a very important observation, which we have to make: The functionality of neural element with the threshold activation function is limited to threshold Boolean functions. To implement any non-threshold function using the threshold activation function, a network should be designed.

Definition 3 A neuron with the threshold activation function is called the Threshold Neuron (Threshold Element or Perceptron).

IV. LEARNING PROCESS AND LEARNING OF THE THRESHOLD NEURON Among the many interesting properties of a neuron and of a neural network, the property that is of primary significance is the ability of the network to learn from its environment, and to improve its performance through learning. A neuron (a neural network) learns about its environment through an iterative process of adjustments applied to its synaptic weights. Ideally, the network becomes more knowledgeable about its environment after each iteration of the learning process. There are different learning algorithms based on the different learning rules. Let us consider in detail the error-correction learning, which is one of the most popular and widely used learning rules. For the beginning, let us consider for simplicity the learning process for the single neuron. Let T (t ) denote some desired response or target response for a neuron at time t. Let the corresponding value of the actual response of this neuron be denoted by Y (t ) .This response is produced by the input (stimulus) vector

X = ( x1 ,..., x n ) applied to the inputs of a neuron at time

t. The input vector X (t ) and the desired response T (t ) determine a particular example presented to the network at time t. It is assumed in general that this example and all other examples presented to the network (neuron) are generated by the environment that is probabilistic in nature, but the underlying probability distribution is unknown [2]. These examples form a learning set. If Y (t ) = T (t ) , which means that the actual response is equal to the desired response, no learning is required. However, typically the actual response is different from the desired one. Hence, it is possible to define an error, which is the difference between the desired response T (t ) and the actual response Y (t ) :

11

δ( n ) = T ( n ) − Y ( n ) .

(2)

The ultimate purpose of error-correction learning is to minimize a cost function based on the error signal δ(t ) , such that the actual response of each output neuron in the network approaches the target response for that neuron in some statistical sense [2]. The same is true for a single neuron. Once, a cost function is selected, error-correction learning becomes strictly an optimization problem to which the usual tools may be applied. A criterion commonly used for the cost function is the mean square error criterion, defined as the mean square value of the sum of squared errors for all output neurons of the network. For a single neuron a criterion should be the same: the mean square value of the sum of squared errors for all examples from the learning set must be minimized. If the error must be minimized to zero, this is called a zero error case. For example, this case is natural for the threshold neurons and networks from them. According to the error-correction learning rule (or delta rule, as it is sometimes called), the adjustment ∆wi (t ) made to the synaptic weight wi is given by [2]:

∆wi (t ) = αδ(t ) xi (t ) ,

(3)

where α is a constant that determines a learning rate. Let us consider how it works for the threshold neuron. Let f ( x1 ,..., x n ) be a threshold Boolean function to be implemented using the threshold neuron. If this function is fully defined, it takes exactly 2 n values and therefore its learning set contains exactly 2 n samples of input vectors

X j = ( x1j ,..., xnj ), j = 1,...,2 n and the corresponding values of the function T j = f j , j = 1,...2 n . Let us remind that since f ( x1 ,..., x n ) is a Boolean function, it and its variables take binary values from the set {1,−1}. Equation (2) that determines the error, will be transformed as follows:

δ j = T j − Y j , j = 1,..., 2 n .

(4)

Respectively, equation (3) is transformed as follows:

∆wij = α j δ j xij = α j (T j − Y j ) xij ; j = 1,...,2 n ; i = 1,..., n .

(5)

Thus, according to (4) and (5) the correction of the weights is performed in the following way:

~ j = w j + ∆w j = w j + α j δ j x j = w j + α j (T j − Y j ) x j ; j = 1,...,2 n ; i = 1,..., n; w i i i i i i i . j j j j j j j n ~ wo = w0 + ∆w0 = w0 + α (T − Y ); j = 1,...,2 .

12

(6)

The learning process consists in the consecutive application of the learning rule (6) until there will be no error for all elements from the learning set: δ j = 0, j = 1,...,2 n . The learning process should start from the random weights, for example, from the random numbers taken from the interval [0, 1]. The iteration of the learning process consists of the complete pass through all elements from the learning set. The convergence of this learning algorithm based on the rule (6) has been proven may times (see, for example, [8]-[11] and [18]). Let us check, how it works in practice.

Example 1. Let us implement the function f ( x 1 , x 2 ) = x1 ∨ x 2 (disjunction of the two variables) using the threshold neuron. #

x1

x2

x1 ∨ x2

1) 2) 3) 4)

1 1 -1 -1

1 -1 1 -1

1 -1 -1 -1

The table shows the function values and the entire learning set containing 4 input vectors and 4 values of the function, respectively. Let us start the learning process from the weighting vector W 0 = (1,1,1) . Iteration 1. 1) Inputs (1, 1). The weighted sum is equal to z = 1 + 1 ⋅ 1 + 1 ⋅ 1 = 3 ; ϕ(z ) = sign(z)=sign(3)=1. Since f(1, 1)=1, no further correction of the weights is needed. 2)

Inputs

(1,

-1).

The

weighted

sum

is

equal

to

z = 1 + 1 ⋅ 1 + 1 ⋅ (−1) = 1 ;

ϕ(z ) = sign(z)=sign(1)=1. Since f(1, -1)= -1, we have to correct the weights. According to (4) δ = −1 − 1 = −2 . Let α = 1 in (5) and (6) (It should be mentioned that if we had chosen a very small ~ value of α , the obtained weighting vector W may still be unable of producing the right output. A new correcting iteration would be needed). Then we have to correct the weights according to (5) and (6):

~ 2 = 1 − 2 = −1; w ~ 2 = 1 + (−2) ⋅1 = −1; w ~ 2 = 1 + (−2) ⋅ (−1) = 3 . Thus W~ = (−1,−1,3) . w 0 1 2 The

weighted

sum

after

the

correction

is

equal

to

z = −1 + (−1) ⋅ 1 + 3 ⋅ (−1) = −5 ;

ϕ(z ) = sign(z)=sign(-5)= -1. Since f(1, -1)= -1, no further correction of the weights is needed. 3)

Inputs

(-1,

1).

The

weighted

sum

is

equal

to

z = −1 + (−1) ⋅ (−1) + 3 ⋅ 1 = 3 ;

ϕ(z ) = sign(z)=sign(3)=1. Since f(-1, 1)= -1, we have to correct the weights. According to (4) δ = −1 − 1 = −2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and

(6):

13

~ 3 = −1 − 2 = −3; w ~ 3 = −1 + (−2) ⋅ (−1) = 1; w ~ 3 = 3 + (−2) ⋅ 1 = 1 . Thus W~ = (−3,1,1) . w 0 1 2 The weighted sum after the correction is equal to z = −3 + 1 ⋅ (−1) + 1 ⋅ 1 = −3 ; ϕ(z ) = sign(z)= =sign(-3)= -1. Since f(-1, 1)= -1, no further correction of the weights is needed. 4) Inputs (-1, -1). The weighted sum is equal to

z = −3 + 1 ⋅ (−1) + 1 ⋅ (−1) = −5 ;

ϕ(z ) = sign(z)=sign(-5)=-1. Since f(-1, -1)= -1, no further correction of the weights is needed. Iteration 2. 1) Inputs (1, 1). The weighted sum is equal to z = −3 + 1 ⋅ 1 + 1 ⋅ 1 = −1 ; ϕ(z ) = sign(z)=sign(-1)= = -1. Since f(1, 1)=1, we have to correct the weights. According to (4) δ = 1 − (−1) = 2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and (6): ~ 1 = −3 + 2 = −1; w ~ 1 = 1 + 2 ⋅ 1 = 3; w ~ 1 = 1 + 2 ⋅ 1 = 3 . Thus W~ = (−1,3,3) . w 0 1 2 The weighted sum after the correction is equal to z = −1 + 3 ⋅ 1 + 3 ⋅ 1 = 5 ; ϕ(z ) = sign(z)=sign(5)= 1. Since f(1, 1)= 1, no further correction of the weights is needed. 2) Inputs (1, -1). The weighted sum is equal to z = −1 + 3 ⋅ 1 + 3 ⋅ (−1) = −1 ; ϕ(z ) = sign(z)= sign(-1)= -1. Since f(1, -1)= -1, no further correction of the weights is needed. 3) Inputs (-1, 1). The weighted sum is equal to z = −1 + 3 ⋅ (−1) + 3 ⋅ 1 = −1 ; ϕ(z ) = sign(z)= sign(-1)= -1. Since f(-1, 1)= -1, no further correction of the weights is needed. 4) Inputs (-1, -1). The weighted sum is equal to z = −1 + 3 ⋅ (−1) + 3 ⋅ (−1) = −7 ; ϕ(z ) = sign(z)= sign(-7)= -1. Since f(-1, -1)= -1, no further correction of the weights is needed. This means that the iteration process converged, there are no errors for all elements from the learning set, and the disjunction of the two variables is implemented on the threshold neuron using ~ the weighting vector W = (−1,3,3) obtained as the result of the learning process.

Example 2. Let us implement the function f1 ( x1 , x 2 ) = x1 & x 2 using the threshold neuron. #

x1

x2

1) 2) 3) 4)

1 1 -1 -1

1 -1 1 -1

x1 & x2

1 -1 1 1

The table shows the function values and the entire learning set containing 4 input vectors and 4 values of the function, respectively. Let us start the learning process from the same weighting vector W 0 = (1,1,1) as in Example 1. Iteration 1.

14

1) Inputs (1, 1). The weighted sum is equal to z = 1 + 1 ⋅ 1 + 1 ⋅ 1 = 3 ; ϕ(z ) = sign(z)=sign(3)=1. Since f(1, 1)=1, no further correction of the weights is needed. 2)

Inputs

(1,

-1).

The

weighted

sum

is

equal

to

z = 1 + 1 ⋅ 1 + 1 ⋅ (−1) = 1 ;

ϕ(z ) = sign(z)=sign(1)=1. Since f(1, -1)= -1, we have to correct the weights. According to (4) δ = −1 − 1 = −2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and

(6):

~ 2 = 1 − 2 = −1; w ~ 2 = 1 + (−2) ⋅1 = −1; w ~ 2 = 1 + (−2) ⋅ (−1) = 3 . Thus W~ = (−1,−1,3) . w 0 1 2 The

weighted

sum

after

the

correction

is

equal

z = −1 + (−1) ⋅ 1 + 3 ⋅ (−1) = −5 ;

to

ϕ(z ) = sign(z)=sign(-5)= -1. Since f(1, -1)= -1, no further correction of the weights is needed. 3)

Inputs

(-1,

1).

The

weighted

sum

is

equal

z = −1 + (−1) ⋅ (−1) + 3 ⋅ 1 = 3 ;

to

ϕ(z ) = sign(z)=sign(3)=1. Since f(-1, 1)= 1, no further correction of the weights is needed. 4) Inputs (-1, -1). The weighted sum is equal to

z = −1 + (−1) ⋅ (−1) + 3 ⋅ (−1) = −3 ;

ϕ(z ) = sign(z)=sign(-3)= -1. Since f(-1, -1)= 1, we have to correct the weights. According to (4) δ = 1 − (−1) = 2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and (6):

~ 4 = −1 + 2 = 1; w ~ 4 = −1 + 2 ⋅ (−1) = −3; w ~ 4 = −1 + 2 ⋅ (−1) = −3 . Thus W~ = (1,−3,−3) . w 0 1 2 The weighted sum after the correction is equal to

z = 1 + (−3) ⋅ (−1) + (−3) ⋅ (−1) = 7 ;

ϕ(z ) = sign(z)=sign(7)= 1. Since f(1, 1)= 1, no further correction of the weights is needed. Iteration 2. 1)

Inputs

(1,

1).

The

weighted

sum

is

equal

z = 1 + (−3) ⋅ 1 + (−3) ⋅ 1 = −5 ;

to

ϕ(z ) = sign(z)=sign(-5)= -1. Since f(1, 1)=1, we have to correct the weights. According to (4) δ = 1 − (−1) = 2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and (6):

~ 1 = 1 + 2 = 3; w ~ 1 = −3 + 2 ⋅ 1 = −1; w ~ 1 = −3 + 2 ⋅ 1 = −1 . Thus W~ = (3,−1,−1) . w 0 1 2 The

weighted

sum

after

the

correction

is

equal

to

z = 3 + (−1) ⋅ 1 + (−1) ⋅ 1 = 1 ;

ϕ(z ) = sign(z)=sign(1)= 1. Since f(1, 1)= 1, no further correction of the weights is needed. 2) Inputs (1, -1). The weighted sum is equal to

z = 3 + (−1) ⋅ 1 + (−1) ⋅ (−1) = 3 ;

ϕ(z ) = sign(z)=sign(3)=1. Since f(1, -1)= -1, we have to correct the weights. According to (4) δ = −1 − 1 = −2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and

(6):

~ 2 = 3 − 2 = 1; w ~ 2 = −1 + (−2) ⋅1 = −3; w ~ 2 = −1 + (−2) ⋅ (−1) = 3 . Thus W~ = (1,−3,3) . w 0 1 2

15

The

weighted

sum

after

the

correction

is

equal

to

z = 1 + (−3) ⋅ 1 + 3 ⋅ (−1) = −5 ;

ϕ(z ) = sign(z)=sign(-5)= -1. Since f(1, -1)= -1, no further correction of the weights is needed. 3)

Inputs

(-1,

1).

The

weighted

sum

is

equal

to

z = 1 + (−3) ⋅ (−1) + 3 ⋅ 1 = 7 ;

ϕ(z ) = sign(z)=sign(7)=1. Since f(-1, 1)= 1, no further correction of the weights is needed. 4) Inputs (-1, -1). The weighted sum is equal to

z = 1 + (−3) ⋅ (−1) + 3 ⋅ (−1) = 1 ;

ϕ(z ) = sign(z)=sign(1)= 1. Since f(-1, -1)= 1, no further correction of the weights is needed. Iteration 3. 1) Inputs (1, 1). The weighted sum is equal to z = 1 + (−3) ⋅ 1 + 3 ⋅ 1 = 1 ; ϕ(z ) = sign(z)=sign(-5)= = -1. Since f(1, 1)=1, no further correction of the weights is needed. Since the elements from 2) till 4) from the learning set were already checked for the current ~ weighting vector W = (1,−3,3) (see Iteration 2), the iteration process converged, there are no errors for all elements from the learning set, and the Boolean function f1 ( x1 , x 2 ) = x1 & x 2 of the two

~ variables is implemented on the threshold neuron using the weighting vector W = (1,−3,3) obtained as the result of the learning process.

Example 3. Let us implement the function f 2 ( x 1 , x 2 ) = x1 & x 2 using the threshold neuron. #

x1

x2

1) 2) 3) 4)

1 1 -1 -1

1 -1 1 -1

x1 & x2

1 1 -1 1

The table shows the function values and the entire learning set containing 4 input vectors and 4 values of the function, respectively. Let us start the learning process from the same weighting vector W 0 = (1,1,1) as in Example 1 and Example 2. Iteration 1. 1) Inputs (1, 1). The weighted sum is equal to z = 1 + 1 ⋅ 1 + 1 ⋅ 1 = 3 ; ϕ(z ) = sign(z)=sign(3)=1. Since f(1, 1)=1, no correction of the weights is needed. 2)

Inputs

(1,

-1).

The

weighted

sum

is

equal

to

z = 1 + 1 ⋅ 1 + 1 ⋅ (−1) = 1 ;

ϕ(z ) = sign(z)=sign(1)=1. Since f(1, -1)= 1, no further correction of the weights is needed. 3)

Inputs

(-1,

1).

The

weighted

sum

is

equal

to

z = 1 + 1 ⋅ (−1) + 1 ⋅ 1 = 1 ;

ϕ(z ) = sign(z)=sign(1)=1. Since f(-1, 1)= -1, we have to correct the weights. According to (4)

16

δ = −1 − 1 = −2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and

(6):

~ 2 = 1 − 2 = −1; w ~ 2 = 1 + (−2) ⋅ (−1) = 3; w ~ 2 = 1 + (−2) ⋅ (1) = −1 . Thus W~ = (−1,3,−1) . w 0 1 2 The

weighted

sum

after

the

correction

is

equal

z = −1 + 3 ⋅ (−1) + (−1) ⋅ 1 = −5 ;

to

ϕ(z ) = sign(z)=sign(-5)= -1. Since f(1, -1)= -1, no further correction of the weights is needed. 4) Inputs (-1, -1). The weighted sum is equal to

z = −1 + 3 ⋅ (−1) + (−1) ⋅ (−1) = −3 ;

ϕ(z ) = sign(z)=sign(-3)= -1. Since f(-1, -1)= 1, we have to correct the weights. According to (4) δ = 1 − (−1) = 2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and (6):

~ 4 = −1 + 2 = 1; w ~ 4 = 3 + 2 ⋅ (−1) = 1; w ~ 4 = −1 + 2 ⋅ (−1) = −3 . Thus W~ = (1,1,−3) . w 0 1 2 z = 1 + 1 ⋅ (−1) + (−3) ⋅ (−1) = 3 ;

The weighted sum after the correction is equal to

ϕ(z ) = sign(z)=sign(3)= 1. Since f(1, 1)= 1, no further correction of the weights is needed. Iteration 2. 1) Inputs (1, 1). The weighted sum is equal to z = 1 + 1 ⋅ 1 + (−3) ⋅ 1 = −1 ; ϕ(z ) = sign(z)= =sign(-1)= -1. Since f(1, 1)=1, we have to correct the weights. According to (4) δ = 1 − (−1) = 2 . Let α = 1 in (5) and (6). Then we have to correct the weights according to (5) and (6): ~ 1 = 1 + 2 = 3; w ~ 1 = 1 + 2 ⋅ 1 = 3; w ~ 1 = −3 + 2 ⋅ 1 = −1 . Thus W~ = (3,3,−1) . w 0 1 2

The

weighted

sum

after

the

correction

is

equal

to

z = 3 + 3 ⋅ 1 + (−1) ⋅ 1 = 5 ;

ϕ(z ) = sign(z)=sign(5)= 1. Since f(1, 1)= 1, no further correction of the weights is needed. 2)

Inputs

(1,

-1).

The

weighted

sum

is

equal

to

z = 3 + 3 ⋅ 1 + (−1) ⋅ (−1) = 7 ;

ϕ(z ) = sign(z)=sign(7)=1. Since f(1, -1)= 1, no further correction of the weights is needed. 3)

Inputs

(-1,

1).

The

weighted

sum

is

equal

to

z = 3 + 3 ⋅ (−1) + (−1) ⋅ 1 = −1 ;

ϕ(z ) = sign(z)=sign(-1)= -1. Since f(-1, 1)= -1, no further correction of the weights is needed. 4) Inputs (-1, -1). The weighted sum is equal to

z = 3 + 3 ⋅ (−1) + (−1) ⋅ (−1) = 1 ;

ϕ(z ) = sign(z)=sign(1)= 1. Since f(-1, -1)= 1, no further correction of the weights is needed. This means that the iteration process converged, there are no errors for all elements from the learning set, and the function f 2 ( x1 , x 2 ) = x1 & x 2 of the two variables is implemented on the

~ threshold neuron using the weighting vector W = (3,3,−1) obtained as the result of the learning process. Notice that Example 3 may be obtained by changing the order of variables from Example 2. ~ Therefore a weighting vector W for Example 3 may be obtained from the solution known for

17

~ Example 2 by reordering the weights in the same way. This would lead to W = (1,3,−3) as a solution for Example 3. It is easy to check that this weighting vector gives a correct realization of the function. The learning algorithm applied to Example 3 gave a quite different weighting vector, ~ namely W = (3,3,−1) . As shown in Fig. 3, and as it was mentioned above, a threshold function may have infinite many implementations. The learning algorithm stops at the first solution it finds.

V. IMPLEMENTATION OF THE NON-THRESHOLD BOOLEAN FUNCTIONS USING FEEDFORWARD NEURAL NETWORK FROM THE THRESHOLD NEURONS Non-threshold or nonlinearly separable Boolean functions are of common occurrence. As it was mentioned above, these functions form a great majority. However, they cannot be implemented using a single neuron with the threshold activation function. A classical example of the non-threshold Boolean function or in other words nonlinearly separable problem is the XOR (the Exclusive OR) problem or modulo 2 addition of the two Boolean variables: f ( x1 , x 2 ) = x1 ⊕ x 2 (see Table I and Fig. 4). Geometrically, this problem belongs to the classification of the points in the hypercube, as any problem described by the Boolean function (see Fig. 3 and Fig. 4). Each point in the hypercube is either in class "1" or class "-1". In the case of XOR problem the input patterns (1, 1) and (-1, -1) that are in class "1" are at opposite corners of the square (2D hypercube). On the other hand, the input patterns (1, -1) and (-1, 1) are also at opposite corners of the same square, but they are in class "-1". It is clear from this that the function XOR is not linearly separable, because it is impossible to find a line that will separate its "1s" from "-1s". To solve this problem within a "threshold basis", it is necessary to build a network. Let us consider a network from three neurons (see Fig.

x1

Neuron 1 Neuron 3

5). This network contains the input layer, which distributes the input signals x1 and x 2 , one hidden layer containing neurons 1 and 2 and one output layer containing a single neuron 3. This is

x2

Neuron 2

a typical example of the feedforward neural network. We will consider and study this network

Fig. 5 A network for solving the XOR problem

and its learning in more detail later. Let us remind that the function XOR may be

presented in the disjunctive normal form as follows:

x1 ⊕ x2 = x1 x2 ∨ x1 x2 = f1 ( x1 , x2 ) ∨ f 2 ( x1 , x2 ) , 18

where f1 ( x1 , x2 ) and f 2 ( x1 , x 2 ) are those functions, whose implementation on the threshold neuron was considered in Example 2 and Example 3, respectively. Let us also remind that the implementation of the disjunction was considered in Example 1. This means that if Neuron 1 implements function f1 ( x1 , x2 ) , Neuron 2 implements function

f 2 ( x1 , x 2 ) , and Neuron 3 implements disjunction then the network presented in Fig. 5 implements the XOR function.

x1

3 -1

x2

Let us consider how it works. We will use those

N1 -3 1 3 3

N3 3 -1 3

N2

Fig. 6 Neurons with the synaptic weights

weighting vectors obtained in Example 1 - Example 3. Thus Neuron 1 operates with the weighting ~ vector W = (1,−3,3) , Neuron 2 operates with the

~ weighting vector W = (3,3,−1) , and Neuron 3 ~ operates with the weighting vector W = (−1,3,3) (see Fig. 6). The network performs in the following way. The input signals

x1

and

x2

are accepted in

parallel from the input layer by both neurons from the hidden layer (N1 and N2). Their outputs are coming to the corresponding inputs of the single neuron in output layer (N3). The output of this neuron is the output of the entire network. The results are summarized in Table II.

Table II Implementation of the XOR function using a neural network presented in Fig. 6. Neuron 1 Neuron 2 Neuron 3 Inputs ~ ~ ~ XOR= W = (1,−3,3) W = (3,3,−1) W = (−1,3,3) # sign ( z ) sign ( z ) sign ( z ) = x1 ⊕ x 2 x2 Z Z Z x1 output output output 1) 1 5 5 1 1 1 1 1 1 2) -5 7 -1 1 -1 -1 1 -1 -1 3) 7 -1 -1 -1 1 1 -1 -1 -1 4) 1 1 5 -1 -1 1 1 1 1 For all three neurons their weighted sums and outputs and are shown. To be convinced that the network implements definitely the XOR function, its actual values are shown in the last column of Table II. The natural questions followed from the above considerations are: Is it possible to design a neuron (or more exactly, an activation function), which could implement the non-threshold Boolean functions on the single neuron? Could this implementation be similar to the implementation (1), i.e., is it possible to express a value of the function as a value of the activation function evaluated on the value of the weighted sum? 19

For many years it was a strong view that there is no positive answer on this question. After M. Minsky and S Papert conclusion [12] about principal limitation of the classical perceptron and impossibility of implementation the non threshold functions using the threshold neuron made in 1969, major efforts of the neural community were directed to the development of different networks. At the same time a positive answer for both questions was recently obtained.

VI. IMPLEMENTATION OF THE NON-THRESHOLD BOOLEAN FUNCTIONS USING THE UNIVERSAL BINARY NEURON Let us return to the Definition 2 of the threshold Boolean function and to equation (1), which is a key part of this definition:

f ( x1 ,...xn ) = sgn( w0 + w1 x1 + ... + wn xn ) . We will consider now, how it is possible to produce another representation of Boolean function with n+1 weights and to implement non-threshold Boolean functions using a single neuron. The sign function, which is used as the activation function for the threshold neuron, should be considered as a binary predicate. However, it is possible to consider a different binary predicate, which will have different properties and which will have a totally different nature of its nonlinearity. This can be the way to expand a set of functions that can be implemented using a single neuron. A classical and common view on neurons and neural networks for many years was concentrated on the consideration of real-valued weights. However, the extension of the basic space of the weights contains a principal opportunity to define a different activation function, which can make it possible to implement nonlinearly separable functions using a single neuron. In the papers [19]-[20] a mathematical background of the neural element with a “complete functionality” (later it was called the universal binary neuron (UBN) [21]) has been proposed. It was shown for the first time that the XOR problem could be solved on the single neuron by considering the complex domain. The following approach has been proposed: the weights are complex, and activation function of the neuron is a function of the argument of the weighted sum. This approach is considered in detail in [21]. Let us consider the following activation function:

⎧ 1, if 0 ≤ arg( z ) < π / 2 or π ≤ arg( z ) < 3π / 2 ϕ( z ) = PB ( z ) = ⎨ ⎩− 1, if π / 2 ≤ arg( z ) < π or 3 π / 2 ≤ arg( z ) < 2π.

(7)

Equation (7) is illustrated in Fig. 7. The function (7) separates the complex plane onto four equal sectors (quadrants): in two of them PB = 1 , in other two of them PB = −1 depending of the argument of the variable z, of which the function depends. 20

PB ( z ) = −1

PB ( z ) = 1 i 1 PB ( z ) = −1

PB ( z ) = 1

Fig. 7 Activation function (7) The activation function (7) determines the Universal Binary Neuron (UBN). It works according to the traditional model (see Fig. 2). But it is very important that this neuron operates with complexvalued weights. Its activation function is a function of the argument of the weighted sum, which is a complex number.

Example 4. Let us show, that using the weighting vector W = (0,1, i ) , where i is imaginary unit, on a single UBN implements the XOR function.

Table III Solution of the XOR problem on the single universal binary neuron XOR= x1 x2 z = w0 + w1 x1 + w2 x2 ϕ( z ) = PB ( z ) # = x1 ⊕ x 2 W = (0,1, i ) 1) 1 1+i 1 1 1 2) 1 -1 1- i -1 -1 3) -1 1 -1+i -1 -1 4) -1 -1 -1- i 1 1 Table III shows the weighted sums and the corresponding outputs of the UBN. To be convinced that the network implements definitely the XOR function, its actual values are shown in the last column of Table III. This means that a traditional view that the XOR problem can not be solved on a single neuron (see, for example, [2], [5] [12]) is not true. The XOR problem can be solved on the single

universal binary neuron (UBN), as we see from Example 4.

VII. P-REALIZABLE BOOLEAN FUNCTIONS AND UNIVERSAL BINARY NEURON As it was just shown, it is possible to implement non-threshold (nonlinearly separable) Boolean functions using UBN. This means that such an implementation using the activation function (7) determines a new class of Boolean functions.

21

Definition 4 [20], [[21]. The Boolean function f ( x1 , x 2 ) of two variables is called P-realizable over the field of the complex numbers C, if it is possible to define a predicate PB on this field and to find a weighting vector W = ( w0 , w1 , w2 ) ,

w j ∈ C , j=0, 1,2 such that the equality

f ( x1 , x 2 ) = PB ( w0 + w1 x1 + w2 x 2 )

(8)

holds for all the values of the variables x in the domain of the function f. It follows from the Example 4 that if a predicate PB is defined by equation (7), a Boolean function XOR, which is not a threshold function, becomes a P-realizable function exactly according Definition 4. Let us consider some features [21], which follow from the Definition 4 of P-realizable Boolean function and equality (7) for the function PB . We will consider here for simplicity Boolean functions of two variables and then we will generalize the corresponding features to functions of n variables. Let us suppose that a Boolean function f ( x1 , x 2 ) is P-realizable (the function PB is defined by (7)). It means that if f (α 1 , α 2 ) = f (α) = 1 , then a number rα ∈ C exists such that

w0 + w1α 1 + w2 α 2 = W (α ) = rα f (α 1 , α 2 ) = rα f (α )

(9)

(0 ≤ arg(rα ) < π / 2) ∨ (π ≤ arg(rα ) < 3π / 2)

(10)

and evidently

If f (α 1 , α 2 ) = f (α) = −1 , then the equality (9) is true, but for rα we have the following (π / 2 ≤ arg(rα ) < π) ∨ (3π / 2 ≤ arg(rα ) < 2π)

(11)

It is clear from (9)-(11) that if f (α) = 1 , then rα = W (α ); | W (α ) |=| rα |

(12)

rα = −W (α); | W (α ) |=| rα |

(13)

and if f (α) = −1 then

22

Theorem 1 [21]1. The Boolean function f ( x1 , x 2 ) is P-realizable with the weighting vector W = ( w 0 , w1 , w2 ) , if for any values (α1 , α 2 ) = (α) from the domain of function f such a number rα ∈ C exists that for f (α) = 1 (9), (10), (12) are true, and for f (α) = −1 (9), (11), (13) are true. Let us denote the vector of values of the Boolean function f ( x1 ,..., x n ) as f = ( f 0 , f1 , ..., f 2n −1 ) and n

the vector-column of values of the Boolean variable xi as X iT = ( xi0 ,..., xi2 −1 )T , i = 1,..., n , where "T" means a transposition of the line-vector to the column-vector (see T

⎛ ⎞ ⎟ be a 2 n dimensional column vector, whose all components Table IV). Let also X 0 = ⎜11 ,12 ,..., 1 ⎜ n 3⎟ ⎝ 2 times ⎠

equal to 1.

Table IV The Boolean variables, function and vectors of their values

X0

X1

X2



Xn

f

1

x10

x 20



x n0

f0

1 …

x11 …

x12 …

… …

x1n …

f1

2 n −1 1

2 n −1 2



2 n −1 n

1

x

x

x

… f 2n −1

Theorem 2 [21]. The Boolean function f ( x1 , x 2 ) is P-realizable if and only if a complex-valued vector r = (r0 , r1 , r2 , r3 ), r j ≠ 0, j = 0, 1, 2, 3 exists such that the equality

((r o f ), ( X 1 o X 2 ) ) = 0 ,

(14)

where ° denotes the component by component multiplication of the vectors, (a, b ) is the scalar product of the vectors a and b, is true, and all the components of the vector r r j , j = 0, 1, 2, 3 satisfy the conditions (9), (10), (12), if f j = 1 or the conditions (9), (11), (13) if f j = −1 . Theorem 3 [21]. If the Boolean function f ( x1 , x 2 ) is P-realizable then the components of its weighting vector may be obtained in the following way: w j = (r * o f , X j ), j=0 , 1, 2 ,

(15)

where r * = (r0* , r1* , r2* , r3* ) is a solution of the equation (14) with respect to r. Theorem 1-Theorem 3 determine a process of the synthesis for the UBN. This is an alternative way (to learning) to find a weighting vector implementing the corresponding input/output mapping. We will also consider later a learning algorithm for the UBN. It is important to mention that if a 1

Here and further we will not give the proofs for the theorems. Everywhere we will give the reference to the paper or book, where the corresponding theorem is proven.

23

Boolean function is P-realizable then there are infinite many different complex-valued vectors

r = (r0 , r1 , r2 , r3 ), r j ≠ 0, j = 0, 1, 2, 3 that satisfy (14). This means that if some Boolean function is P-realizable then it has infinite many different weighting vectors. Let us apply now the results of Theorem 1-Theorem 3 to obtain a weighting vector for a Boolean function, in order to implement it on a single UBN.

Example 5. Let us take the XOR function, which is the favorite smallest example of a nonlinearly separable problem. XOR( x1 , x 2 ) = X 1 o X 2 = (1, -1, -1, 1) . The equation (14) for this function will be the following: ((r0 , r1 , r2 , r3 ), (1, -1, -1, 1) o (1, -1, -1, 1) ) = 0 and evaluating the scalar product: r0 + r1 + r2 + r3 = 0 . According to the conditions (9)-(13) r0 and r3 have to be within the 1st or 3rd quadrant of the complex plane, r1 and r2 have to be within the 2nd or 4th one. One of the possible

and

acceptable

solutions

is

for

example

the

following:

r0* = 1 + i; r1* = 1 − i; r2* = −1 + i; r3* = −1 − i . Notice that any possible scaling of this vector is also a solution. According to (15) w0 = 0, w 1 = 4i, w2 = 4 , or dividing all the w by 4 we obtain the more “beautiful” components of the weighting vector w0 = 0, w 1 = i, w2 = 1 . Table V illustrates that the weighting vector (0, i, 1) solves the XOR problem for two inputs. One may compare the Table V with the Table III, which illustrates solution of the XOR problem for another weighting vector (0, 1, i), but with the same activation function (7).

Table V Solution of the XOR problem by the synthesis XOR= # x1 x2 z = w0 + w1 x1 + w2 x2 PB (z )

= x1 ⊕ x 2

1) 2) 3) 4)

1 1 -1 -1

1 -1 1 -1

1+i -1 + i 1-i -1 - i

1 -1 -1 1

1 -1 -1 1

It is easy to confirm by a simple computing experiment that the equation (14) has an infinite number of acceptable solutions for any of the sixteen Boolean functions of two variables. This means that all of them are P-realizable with the activation function (7). Table VI contains examples of the weighting vectors for all Boolean functions of two variables. All the vectors are obtained in the same way as for the XOR function in the Example 5 (i.e. using Theorem 1-Theorem 3). The Boolean functions are numbered by decimal numbers, which are decimal representations of the binary numbers created by concatenation of all the values of function in alphabet {0, 1} ( f 0 = (0,0,0,0); f 1 = (0,0,0,1) , etc.).

24

Table VI. Weighting vectors of the Boolean functions of two (3.1.6). Func # 0 1 2 3 4 5 6 7 8 9 w0 1 1 0 1 0 i i 1 i i w1 0 1 1 -i 0 1 1 i i i w2 0 0 i 1 -1 -1 i i i i

variables relatively the predicate 10 -1

11 i

12 -1

13 -1

14 i

15 i

0 i

1 -1

i 0

i -1

-1 -1

0 0

We have just convinced that it is possible to define the activation function PB as the alternating sequence 1, -1, 1, -1 (see (7) ) over the field of the complex numbers that all Boolean functions of two variables including not threshold function XOR and not-XOR will be P-realizable according to the Definition 4. Let us consider a general case of the activation function deeply studied in [20][21]. The equation (7), which defines a predicate implementing all the function of two variables, separates the complex plane in 2 x 2=4 equal sectors. Let us generalize this definition. We will define the predicate PB over the field of the complex numbers in the following way. Let k be a natural number. We will separate the complex plane on m = 2k equal sectors and the predicate PB will be defined by following equation:

PB ( z )=(−1) j , if 2πj/m ≤ arg( z ) < 2π( j+1) /m, m=2k, k ∈ N ,

(16)

where i is imaginary unity, m is some even positive integer, j is a non-negative number 0 ≤ j < m . So if we separate complex plane on m equal sectors, the function PB is equal to 1 for the complex numbers in the even sectors 0, 2, 4, ..., m-2, and it is equal to -1 for the numbers in the odd sectors 1, 3, 5, ..., m-1. The equation (16) is illustrated in Fig. 8. The equation (16) does not define the value PB (0) . Without loss of generality let us set PB (0) =1. 2 1

1

i

-1 0 1

0

m-1

-1 1 m-2 Z

PB (z ) = 1 Fig. 8 Definition of the function PB (see the equation (16))

25

Generalizing now Definition 4 for the case of n variables, we can define now a P-realizable Boolean function of n variables and the Universal Binary Neuron in general.

Definition 5 [20], [[21]. The Boolean function f ( x1 ,..., xn ) is called P-realizable over the field of the complex numbers C, if it is possible to define a predicate PB on this field and to find a weighting vector W = ( w0 , w1 ,..., wn ) ,

w j ∈ A, j=0, 1, ..., n such that the equality

f ( x1 ,...xn ) = PB ( w0 + w1 x1 + ... + wn xn )

(17)

holds for all the values of the variables x in the domain of the function f.

Definition 6 [21]. The Universal binary neuron (UBN) over the field of the complex numbers is an element, which performs according to (17) with the activation function (16) for a given input/output mapping described by a Boolean function of n variables. The important conclusion is that the functionality of the UBN with the activation function (16) is

always higher than the functionality of the neuron with the threshold activation function for k>1 and m>2 in (16). For example, for k=2, m=4 UBN has complete functionality for n=2 (which means that all Boolean functions of two variables are P-realizable and implementable on the single UBN, as it was shown above, see Example 5 and Table VI), and its functionality for n>2 is always much higher than functionality of the threshold neuron or traditional perceptron. The following important existence theorem is proven in [21]: Theorem 4 [21]. For any value of n it is possible to define the predicate PB in such a way that it ensures a P-realization of all the Boolean functions of n variables. For example, all 256 Boolean functions of 3 variables are P-realizable and may be implemented on the single UBN with m=6 in (16); all 65536 Boolean functions of 4 variables are P-realizable 5

and may be implemented on the single UBN with m=8 in (16); all 2 2 = 2 32 Boolean functions of 5 variables are P-realizable and may be implemented on the single UBN with m=10 in (16). A parity problem, which is a generalization of the XOR problem for the case of n variables is of a special interest. The parity n function is the result of the modulo 2 addition of n Boolean variables: f ( x1 ,..., x n ) = x1 ⊕ ... ⊕ x n (or the multiplication of the same variables in the Boolean alphabet {1, -1}). This is a non-threshold function and it cannot be implemented on the single 26

threshold neuron. This function is widely and commonly used as a benchmark for training the feedforward neural networks. However, the parity n function can be implemented on the single

UBN! For example, it was experimentally shown using the learning algorithm [21] the parity 11 function can be implemented on the single UBN with m=26 in (16), the parity 10 function can be implemented on the single UBN with m=24 in (16), the parity 9 function can be implemented on the single UBN with m=22 in (16), the parity 8 function can be implemented on the single UBN with m=16 in (16), the parity 7 function can be implemented on the single UBN with m=14 in (16), the parity 6 function can be implemented on the single UBN with m=12 in (16), etc.

VIII. MULTIPLE-VALUED THRESHOLD FUNCTIONS AND MULTI-VALUED NEURON Only a limited set of recognition/classification and prediction problems can be reduced to the problem of two classes, whose representatives are described by the binary parameters. Exactly this is the class of the problems that can be described by Boolean functions. However, there are many problems that are described by multiple-valued functions. In the terms of neural networks, a piecewise linear and sigmoid activation functions are typical representatives of the continuous multiple-valued activation functions. Those neurons that are operating with these functions have continuous (but band-limited, usually to the interval [0, 1] or [-1, 1]) multiple-valued inputs and outputs. These neurons are commonly and widely used. They are described in detail, for example, in [2]. We will consider here a different neuron, namely, a multi-valued neuron, which is based on the theory of multiple-valued threshold functions over the field of the complex numbers. The basic principles of this theory were proposed in 1971 by N. Aizenberg in [22] and then they were developed in [23] and [21]. Let f ( y1 ,..., y n ) be a function of n variables in k-valued logic ( y i ∈ {0,1,...k − 1} , i =1, ..., n). Let us consider the set of the kth roots of unity. We consider the univalent mapping j ↔ ε j , where

ε j = exp(i 2πj/k ) . We will code the values of k-valued logic by the complex numbers, which are exactly the kth roots of unity. In other words integer-valued variables y i ∈ {0,1,...k − 1} = E k are

~ mapped onto complex-valued variables xi ∈ {ε 0 , ε, ..., ε k -1 } = E k , i=1, ..., n. Let us consider the following nonlinear function P, which is a function of the argument of the complex number z and is simultaneously a k-valued predicate defined on the field of the complex numbers C:

P( z )= exp(i 2πj/k ) , if 2πj/k ≤ arg z < 2π ( j+1) /k

27

(18)

In other words, if the complex plane is partitioned in k equal sectors by continuing of the rays corresponding to the kth roots of unity, and the argument of the number z is between arguments of the roots of unity ε j and ε j +1 , then P(z) = ε j . Equation (18) is illustrated in Fig. 9. i 1

0 k-1

J-1

Z

J

k-2 J+1

P( z ) = ε j Fig. 9 Geometrical interpretation of the MVN activation function (18) Now we can define a multiple-valued threshold function over the field of the complex numbers and multi-valued neuron.

Definition 7 [22], [21]. A function f ( x1 ,..., x n ) of the k-valued logic is called a multiple-valued threshold function, if it is possible to find a complex-valued weighting vector W = ( w0 , w1 ,..., wn ) such that equality

f ( x1 ,...x n ) = P ( w0 + w1 x1 + ... + wn x n )

(19)

(where P is the function (18)) holds for all values of the variables xi , i = 1,..., n from the domain of the function f.

Definition 8 [22], [21]. A neuron with the threshold activation function (18), which operates according to (19), is called the discrete-valued Multi-Valued Neuron (MVN).

The MVN is working according a traditional model (see Fig. 2). But its complex-valued activation function, which is a function of the argument of the weighted sum, and complex-valued weights are the issue of some important specific properties and advantages. The MVN has several wonderful properties that distinguish it among other neurons. The most important is that this neuron has much higher functionality in comparison with other neurons, for example, with a commonly and widely used neuron with the sigmoid activation function. A very interesting property is that the state of "absolute inhibition" coincides with the state of "absolute excitation". Indeed, the state of 28

"absolute inhibition" corresponds to the weighted sum with the argument 0, while the state of "absolute excitation" corresponds to the weighted sum with the argument 2π (See equation (18), which defines the activation function and Fig. 9). It is very interesting to mention a similarity of MVN and UBN and their activation functions (one may compare (16) with (18) and Fig. 8 with Fig. 9). Both activation functions are the functions of the argument of the weighted sum. The next very important and interesting property is the simplicity of MVN learning. It is based on the simple generalization of the error-correcting learning rule, and it will be considered in the next Section. Definition 8 determines a discrete-valued MVN. Evidently, the activation function (18) is discrete. More exactly, it is piece-wise discontinuous, it has discontinuities on the borders of the sectors. Let us modify the function (18) in order to generalize it for the continuous case in the following way. Let us consider, what will happen, when k → ∞ in (18). It means that the angle value of the sector (see Fig. 9) will go to zero. It is easy to see that the function (18) is transformed in this case as follows:

P( z ) = exp(i (arg z )) = e iArg z =

z , |z|

(20)

where z is the weighted sum, Arg z is a main value of its argument and |z| is a modulo of the complex number z. The function (20) maps the complex plane into the whole unit circle, while the function (18) maps a complex plane just on a discrete subset of the points belonging to the unit circle. The function (18) is discrete, while the function (20) is continuous. The MVN with the activation function (20) is a continuous-valued MVN. It was recently introduced in [24]. The existence of both types of the MVN (discrete-valued and continuous-valued MVN) is very important, because there are many problems of both natures (discrete and continuous) that may be solved using MVN and networks based on them.

IX. LEARNING ALGORITHMS FOR MULTI-VALUED AND UNIVERSAL BINARY NEURONS MVN learning is reduced to the movement along the unit circle. This movement does not require a derivative of the activation function (which is required for the neurons with the sigmoid activation function), because it is impossible to move in the incorrect direction. Any direction of movement along the circle will lead to the target. The shortest way of this movement is completely determined by the error that is a difference between the "target" and the "current point", i.e. between 29

the desired and actual outputs, respectively. This MVN property is very important for the further development of the learning algorithm for a multilayer network. Let us consider how it works. Let ε be a desired output of the neuron (see Fig. 10). Let ε = P (z ) be an actual output of q

s

the neuron. The MVN learning algorithm based on the error-correction learning rule is defined as follows [21]:

Wm+1 = Wm+

Cm (ε q -ε s ) X (n+1)

,

(21)

where X is an input vector2, n is the number of neuron inputs, X is a vector with the components complex conjugated3 to the components of vector X, m is the number of the learning iteration, is a current weighting vector (to be corrected),

Wm

Wm +1 is the following weighting vector (after

correction), C m is a learning rate. The convergence of the learning process based on the rule (21) is proven in [21]. What is the sense of the rule (21)? It ensures such a correction of the weights that a weighted sum is moving from the sector s to the sector q (see Fig. 10). The direction of this q s movement is completely defined by the difference δ = ε − ε . Thus δ = ε − ε determines the

q

MVN error. According to (21) a correcting term ∆wi =

s

Cm (ε q -ε s ) xi , i = 0,1,..., n , which is added (n+1)

to the corresponding weight in order to correct it, is proportional to δ . The learning rule (21) for MVN can be considered as a generalization of the learning rule (6) for the threshold neuron (the perceptron). The key point of both rules is the weights adjustment using a correcting term ∆wi , which is proportional to the error δ .

i q

εq

s

εs

Fig. 10 Geometrical interpretation of the MVN learning rule

2

We will add to the n-dimensional vector X an n+1th component x 0 ≡ 1 realizing a bias, in order to simplify

mathematical expressions for the correction of the weights Here and further x is a number complex conjugated to x and X is a vector with the components complex conjugated to the components of X.

3

30

The correction of the weights according to (21) changes the value of the weighted sum exactly on δ . Indeed, let z = w0 + w1 x1 + ... + wn xn be a current weighted sum. Let us correct the weights according to the rule (21) (we take C=1): ~ =w + w 0 0

δ ~ = w + δ x ; ... ; w ~ =w + δ x . ; w 1 1 1 n n n (n + 1) (n + 1) (n + 1)

The weighted sum after the correction is obtained as follows:

δ δ δ ) + ( w1 + x1 ) x1 + ... + ( w1 + xn ) x n = ( n + 1) (n + 1) ( n + 1) δ δ δ + ... + wn xn + = = w0 + + w1 x1 + ( n + 1) ( n + 1) ( n + 1) = w0 + w1 x1 + ... + wn xn + δ = z + δ . ~ +w ~ x + ... + w ~ x = (w + ~ z =w 0 1 1 n n 0

Equation (22) shows the importance of the factor

(22)

1 in the learning rule (21). This factor shares n +1

the error δ uniformly among the neuron's inputs. The learning rule (21) can be modified for the continuous-valued case in the following way:

Wm+1 = Wm+

C ⎛ Cm z ⎞ (ε q -e iArg z ) X = Wm+ m ⎜⎜ ε q ⎟X (n+1) ( n+1) ⎝ | z | ⎟⎠

.

(23)

The UBN learning can be reduced to the MVN learning. This approach is based on the following Theorem 5 [21]. If some Boolean function f ( x1 ,..., x n ) is P-realizable with the predicate PB then ~ a partially defined multiple-valued function f ( x1 ,..., x n ) , which is defined only on the of Boolean inputs, and takes the values defined like follows ~ ⎧ f (α1 ,..., α n ) = P(rα* ), if f (α1 ,..., α n ) = 1 ⎨~ * ⎩ f (α1 ,..., α n ) = P(−rα ), if f (α1 ,..., α n ) = −1, is the m-valued threshold function. The weighting vector of such a partially defined m-valued threshold function is simultaneously the weighting vector of the corresponding Boolean function. ~ It is evident that a partially defined multiple-valued function f ( x1 ,..., x n ) is defined in this case exactly on the set of Boolean variables E 2n , where E 2 = {1,−1} , but takes its values in the set E m = {0,1,..., m − 1} . This m-valued function can be built during the learning process. The learning algorithm is based on the following considerations. An incorrect output of the UBN for some input vector X from the learning set means that a weighting sum has fallen into an “incorrect” sector. It is evident from the equality (16), which

31

establishes the activation function of UBN, and Fig. 8 that if the weighted sum gets into the “incorrect” sector, both of the neighborhood sectors are “correct” in such a case because PB ( z1 ) = PB ( z 2 ) = − PB ( z ), z ∈ (s) z1 ∈ ( s + 1), z 2 ∈ ( s − 1) (Fig. 11).

s+1 s s-1

i

Fig. 11 Geometrical interpretation of the MVN learning rule Thus, the weights should be corrected to direct the weighted sum into one of the neighborhood sectors. A natural choice of the “correct” sector (left or right) is based on the closeness of the current value of the weighted sum. To correct the weights, we can use the same learning rule (21) that have been used for MVN learning, and q (the number of the desired sector in (21) ) has to be chosen based on the following rule [21]: q = s- 1 (mod m), if Z is closer to (s- 1)-th sector q = s+1 (mod m), if Z is closer to (s+1)-th sector.

(24)

Let us illustrate how the learning algorithm for UBN and MVN works returning to the implementation of the XOR function, for this time, using learning.

Example 6. Let us implement the XOR function using the UBN and learning algorithm (21)(24). XOR= #

x1

x2

= x1 ⊕ x 2

1) 2) 3) 4)

1 1 -1 -1

1 -1 1 -1

1 1 -1 1

The table shows the function values and the entire learning set containing 4 input vectors and 4 values of the function, respectively. Let us put m=4 in the UBN activation function (16). This means that the activation function will take a form (7) (it is also illustrated in Fig. 7).

32

Let us start the learning process from the same weighting vector W 0 = (1,1,1) as in Example 1, Example 2 and Example 3. Iteration 1. 1) Inputs (1, 1). The weighted sum is equal to z = 1 + 1 ⋅ 1 + 1 ⋅ 1 = 3 ; PB ( z ) = PB (3) = 1 . Since f(1, 1)=1, no further correction of the weights is needed. 2) Inputs (1, -1). The weighted sum is equal to z = 1 + 1 ⋅ 1 + 1 ⋅ (−1) = 1 ; PB ( z ) = PB (1) = 1 . Since f(1, -1)= -1, we have to correct the weights. According to (24) ε q = ε 3 = i . δ = −i − 1 . Let C=1 in (21). Then we have to correct the weights according to (21):

~ 2 = 1 + 1 (−i − 1) ⋅ 1 = 2 − 1 i; w ~ 2 = 1 + 1 (−i − 1) ⋅ (−1) = 4 + 1 i . Thus ~ 2 = 1 + 1 (−i − 1) = 2 − 1 i; w w 0 1 2 3 3 3 3 3 3 3 3 3 ~ ⎛2 1 2 1 4 1 ⎞ W = ⎜ − i , − i, + i ⎟ . ⎝3 3 3 3 3 3 ⎠ The weighted sum after the correction is equal to z =

2 1 ⎛2 1 ⎞ ⎛4 1 ⎞ − i + ⎜ − i ⎟ ⋅ 1 + ⎜ + i ⎟ ⋅ (−1) = −i ; 3 3 ⎝3 3 ⎠ ⎝3 3 ⎠

PB ( z ) = PB (−i ) = −1 . Since f(1, -1)= -1, no further correction of the weights is needed.

3) Inputs (-1, 1). The weighted sum is equal to

z=

2 1 ⎛2 1 ⎞ 4 1 ⎛4 1 ⎞ ⎛4 1 ⎞ − i + ⎜ − i ⎟ ⋅ (−1) + ⎜ + i ⎟ ⋅ 1 = + i ; PB ( z ) = PB ⎜ + i ⎟ = 1 . Since f(-1, 1)= -1, we 3 3 ⎝3 3 ⎠ 3 3 ⎝3 3 ⎠ ⎝3 3 ⎠

have to correct the weights. According to (24) ε q = ε 3 = i . δ = −i − 1 . Let C=1 in (21). Then we have to correct the weights according to (21): ~ 3 = 2 − 1 i + 1 (−i − 1) = 2 − 1 − 1 i − 1 i = 1 − 2 i; w 0 3 3 3 3 3 3 3 3 3 ~ 3 = 2 − 1 i + 1 (−i − 1) ⋅ (−1) = 2 − 1 i + 1 i + 1 = 1; . w 1 3 3 3 3 3 3 3 ~ 3 = 4 + 1 i + 1 (−i − 1) ⋅ (1) = 4 + 1 i − 1 i − 1 = 1. w 2 3 3 3 3 3 3 3

1 2 ~ Thus W = ( − i, 1, 1) . 3 3 The weighted sum after the correction is equal to

z=

1 2 1 2 − i + 1 ⋅ (−1) + 1 ⋅ 1 = − i ; 3 3 3 3

⎛1 2 ⎞ PB ( z ) = PB ⎜ − i ⎟ = −1 . Since f(-1, 1)= -1, no further correction of the weights is needed. ⎝3 3 ⎠ 4) Inputs (-1, -1). The weighted sum is equal to z =

1 2 5 2 − i + 1 ⋅ (−1) + 1 ⋅ (−1) = − − i ; 3 3 3 3

⎛ 5 2 ⎞ PB ( z ) = PB ⎜ − − i ⎟ = 1 . Since f(-1, -1)= 1, no correction of the weights is needed. ⎝ 3 3 ⎠ 33

Iteration 2.

z=

1) Inputs (1, 1). The weighted sum is equal to

1 2 7 2 − i + 1 ⋅1 + 1 ⋅1 = − i ; 3 3 3 3

⎛7 2 ⎞ PB ( z ) = PB ⎜ − i ⎟ = −1 . Since f(1, 1)= 1, we have to correct the weights. According to (24) ⎝3 3 ⎠ ε q = ε 0 = 1 . δ = 1 − (−i ) = 1 + i . Let C=1 in (21). Then we have to correct the weights according to (21): ~ 1 = 1 − 2 i + 1 (1 + i ) = 1 − 2 i + 1 + 1 i = 2 − 1 i; w 0 3 3 3 3 3 3 3 3 3 ~ 1 = 1 + 1 (1 + i ) ⋅ 1 = 4 + 1 i; . w 1 3 3 3 ~ 1 = 1 + 1 (1 + i ) ⋅ 1 = 4 + 1 i. w 2 3 3 3

~ ⎛2 1 4 1 4 1 ⎞ Thus W = ⎜ − i, + i, + i ⎟ . ⎝3 3 3 3 3 3 ⎠ The weighted sum after the correction is equal to z =

2 1 ⎛4 1 ⎞ 10 2 ⎛4 1 ⎞ − i + ⎜ + i ⎟ ⋅1 + ⎜ + i ⎟ ⋅1 = + i; 3 3 ⎝3 3 ⎠ 3 3 ⎝3 3 ⎠

⎛ 10 2 ⎞ PB ( z ) = PB ⎜ + i ⎟ = 1 . Since f(1, 1)= 1, no correction of the weights is needed. ⎝3 3 ⎠ 2)

z=

Inputs

(1,

-1).

The

weighted

sum

is

equal

to

2 1 ⎛4 1 ⎞ 2 1 ⎛4 1 ⎞ ⎛2 1 ⎞ − i + ⎜ + i ⎟ ⋅ 1 + ⎜ + i ⎟ ⋅ (−1) = − i ; PB ( z ) = PB ⎜ − i ⎟ = −1 . Since f(1, -1)= -1, no 3 3 ⎝3 3 ⎠ 3 3 ⎝3 3 ⎠ ⎝3 3 ⎠

correction of the weights is needed. 3)

z=

Inputs

(-1,

1).

The

weighted

sum

is

equal

to

2 1 ⎛4 1 ⎞ 2 1 ⎛4 1 ⎞ ⎛2 1 ⎞ − i + ⎜ + i ⎟ ⋅ (−1) + ⎜ + i ⎟ ⋅ 1 = − i ; PB ( z ) = PB ⎜ − i ⎟ = −1 . Since f(-1, 1)= -1, no 3 3 ⎝3 3 ⎠ 3 3 ⎝3 3 ⎠ ⎝3 3 ⎠

correction of the weights is needed. 4)

z=

Inputs

(-1,

-1).

The

weighted

sum

is

equal

to

2 1 ⎛4 1 ⎞ ⎛4 1 ⎞ − i + ⎜ + i ⎟ ⋅ (−1) + ⎜ + i ⎟ ⋅ (−1) = −2 − i ; PB ( z ) = PB (2 − i ) = 1 . Since f(-1, -1)= 1, no 3 3 ⎝3 3 ⎠ ⎝3 3 ⎠

correction of the weights is needed. This means that the iteration process converged, there are no errors for all elements from the learning set, and the XOR function of the two variables is implemented on the single universal

~ ⎛2 1 4 1 4 1 ⎞ binary neuron using the weighting vector W = ⎜ − i, + i, + i ⎟ obtained as the result of ⎝3 3 3 3 3 3 ⎠ the learning process. 34

The last example illustrates again that nonlinearly separable functions that cannot be implemented on the threshold neuron (or the classical perceptron) can be implemented on the single universal binary neuron. Please, pay also attention to the importance of the factor

1 in the learning rule (21). It is n +1

easy to check that without it the learning iterative process presented in the Example 6 will be at least much longer or (depending on the starting weights) cannot converge at all, while the presence of this factor, which may be considered as an important part of the learning rate, ensures the quickest convergence of the learning algorithm.

X. A CLASSICAL MULTILAYER FEEDFORWARD NEURAL NETWORK We will consider here a network with may be the most popular architecture, namely a multilayer feedforward neural network (MVF), widely referred as a multilayer perceptron (MLP) [2]. We will also consider the basic principles of the backpropagation learning algorithm for this network. Typically, a MLF consists of a set of sensory units (source nodes) that constitute the input layer, one or more hidden layers of computation nodes, and an output layer of computation nodes [2]. The input signals progresses through the network in a forward direction, on a layer-by-layer basis. Usually, outputs of all neurons from the previous layer are connected to the corresponding inputs of all neurons from the following layer. This means a full connection between consecutive layers (see Fig. 12). This architecture is the result of a "universal approximator" computing model based on Kolmogorov's Theorem [25]. It has been shown in [26] that these neural networks are universal approximators. 1

2

N

Input layer

Hidden layers (1...m-1)

Fig. 12 Feedforward neural network 35

Output layer

The simplest case of a MLF is presented in Fig. 13. This network contains the input layer, a single hidden layer and the output layer consisting from a single neuron. By the way we used exactly this network consisted of the threshold neurons (classical perceptrons) for the implementation of the XOR function using threshold neurons (see Section 5, Fig. 5 and Fig. 6).

Input layer

Hidden layer

Output layer

Fig. 13 The simplest feedforward neural network So there are two main ideas behind a feedforward neural network. The first idea is a full connection architecture: the outputs of neurons from the previous layer are connected with the corresponding inputs of all neurons of the following layer. The second idea is a backpropagation learning algorithm, when the errors of the neurons from the output layer are being sequentially backpropagated through all the layers from the "right hand" side to the "left hand" side, in order to calculate the errors of all other neurons. Basically, the error backpropagation process consists of two passes through the different layers of the network: a forward pass and a backward pass. In the forward pass the input vector is applied to the sensory nodes (input layer) of the network, and its effect propagates through the network, layer by layer. Finally, a set of outputs is produced as the actual response of the network. Evidently, during the forward pass the synaptic weights of the network are all fixed. During the backward pass the synaptic weights are all adjusted in accordance with the learning rule. One complete iteration of the backpropagation process (a forward pass and a backward pass together) is often called an epoch. This algorithm is based on the generalization of the error-correcting learning rule for the case of MLF. Specifically, the actual response of the network is subtracted from a desired response to produce an error signal. This error signal is then propagated backward through the network, against the direction of synaptic connections – hence the name "backpropagation". The synaptic weights are adjusted so as to make the actual response of the network move closer to the desired response. The basic ideas of the backpropagation learning were proposed in 1986 by D.E. Rumelhart, G.E. Hilton and R.J. Williams [27]. One more common property of a major part of the feedforward networks is the use of sigmoid activation functions for its neurons. 36

Let us describe the backpropagation learning algorithm. Let us consider a multilayer neural network with traditional feedforward architecture (see Fig. 12), when the outputs of neurons of the input and hidden layers are connected with the corresponding inputs of the neurons from the following layer. Let us suppose that the network contains one input layer, m-1 hidden layers and one output layer. We will use here the following notations. Let

T km

- be a desired output of the kth neuron from the output (mth ) layer

Ykm

- be the actual output of the kth neuron from the output (mth ) layer.

Then a global error of the network related to the kth neuron of the output (mth ) layer can be calculated as follows:

δ*km = Tkm − Ykm - error for the kth neuron from output (mth ) layer.

(25)

δ* km will denote here and further a global error of the network. We have to distinguish it from the local errors δkm of the particular output neurons because each output neuron contribute to the global error equally with the hidden neurons. The learning algorithm for the classical feedforward network is derived from the consideration that the global error of the network in terms of the square error (SE) must be minimized. The (normalized) square error is defined as follows:

Ε (W ) =

1 Nm * 2 ∑ (δ km ) (W ) 2 k =1

(26)

where w denotes the weighting vectors of all the neurons of the network, N m indicates the number of output neurons and the factor

1 is used so as to simplify subsequent derivations resulting from 2

the minimization of E (W ) . It is a principal assumption that the error depends not only on the weights of the neurons at the output layer, but on all neurons of the network. The functional of the error may be defined as follows:

Εms

1 = N

N

∑E s =1

s

,

(27)

where Ems denotes the (normalized) mean square error, N is the total number of patterns in the training set and Es denotes the (normalized) square error of the network for the sth pattern. The minimization of the functional (27) is reduced to the search for those weights for all the neurons that ensure a minimal.

37

The most important problem for network training is to make the correction of the weights of all neurons in such a way that each weight wi has to be corrected by an amount ∆wi , which must be proportional to the gradient

∂E of the error function E(W) with respect to the weights [2]. ∂wi

For the next analysis, the following notation will be used. Let wijk denote the weight corresponding to the ith input of the jth neuron at the kth layer. Furthermore let

z jk , y jk and

Y jk = y jk ( z jk ) represent the weighted sum (of the input signals) the activation function and the output signal of the jth neuron at the kth layer, respectively. Let N k be the number of neurons in the kth layer (notice that this means that neurons of the k+1st layer will have exactly N k inputs.) Finally recall that

x1 ,..., x n

denote the inputs to the network (and as such, also the inputs to the neurons of

the first layer.) Then applying the chain rule we have:

∂E (W ) ∂E (W ) ∂y jm ∂z jm = , i = 0,1,..., N m −1 , ∂wijm ∂y jm ∂z jm ∂wijm where

∂ ∂E (W ) = ∂y jm ∂y jm

⎛1 1 ∂ ∗ 2⎞ ∗ 2 (δ km ) = ) ⎟= ∑ ⎜ ∑ (δ km y 2 ∂ 2 k ⎝ k ⎠ jm 1 ∂ ∂ ∂ = (δ ∗jm ) 2 = δ *jm (δ ∗jm ) = δ ∗jm (T jm − Y jm ) = −δ ∗jm ; 2 ∂y jm ∂y jm ∂y jm

∂y jm ∂z jm

= y′jm ( z jm )

and

∂z jm ∂wi

jm

=

∂ w0jm + w1jmY1( m−1) + ... + wnjmYn ( m−1) = Yi ( m−1) , i = 0, 1, ..., N m−1 jm ∂wi

(

)

Then we obtain the following:

∂E (W ) ∂E (W ) ∂y jm ∂z jm = = −δ∗jm y′jm ( z jm )Yi ( m −1) , i = 0, 1, ..., N m −1 ; Y0 ( m −1) ≡ 1 jm jm ∂wi ∂y jm ∂z jm ∂wi and

∆wi

jm

∗ ∂E (W ) ⎧⎪αδ jm y′jm ( z jm )Yi ( m −1) =⎨ ∗ = −α ∂wijm ⎪⎩ αδ jm y′jm ( z jm )

38

i = 1,..., N m −1 i = 0,

(28)

where α > 0 is a coefficient representing a learning rate. The part of the rate of change of the square error E(W) with respect to the input weight of a neuron, which is independent of the value of the corresponding input signal to that neuron, will be called the local error (or simply the error) of that neuron. Accordingly, the local error of the jth neuron of the output layer, denoted by δ jm , is given by

δ jm = y′jm ( z jm ) ⋅ δ∗jm ,

(29)

To propagate the error to the neurons of all hidden layers, a sequential error backpropagation through the network from the mth layer to the m-1st one, from the m-1st to the m-2nd one, ..., and from the 3rd to the 2nd one will be done. When the error is propagated from the layer k+1 to the layer k, the local error of each neuron of the k+1st layer is multiplied by the weight of the path connecting the corresponding input of this neuron at the k+1st layer with the corresponding output of the neuron at the kth layer. For example, the error

δ j ( k +1)

the lth neuron at the kth layer, multiplying

of the jth neuron at the k+1st layer is propagated to

δ j ( k +1)

with wlj ( k +1) , namely the weight corresponding to

the lth input of the jth neuron at the k+1st layer. This analysis leads to: N k +1

δ lk = ylk′ ( zlk ) ∑ δi ( k +1) wli ( k +1) , k=1, …, m-1- error for the lth neuron from the kth layer.

(30)

i =1

It should be mentioned that equations (28)-(30) are obtained for the general case, without the connection with some specific activation function. On the other hand, as it was mentioned above a sigmoid activation function is commonly used as the activation function for the MVF. A derivative of the sigmoid activation function

y ( z ) = ϕ( z ) =

1 1 + e−z

is the following:

′ 1 ⎞ ⎛ − z −1 ′ y ′( z ) = ϕ′( z ) = ⎜ e ( 1 ) = + = −(1 + e − z ) − 2 ⋅ (−e − z ) = ⎟ −z ⎝1+ e ⎠ e −z e−z y z = = ( ) = y ( z )(1 − y ( z ) ) (1 + e − z )(1 + e − z ) (1 + e − z )

(

)

because 1 1 + e−z −1 e−z = 1 − y( z) = 1 − = . 1 + e −z 1 + e −z 1 + e −z Thus y ′( z ) = ϕ′( z ) = y ( z )(1 − y ( z ) ) and substituting this to (30) we obtain the equation for the error of the MVF hidden neurons with the sigmoid activation function: 39

N k +1

δ lk = ylk ( zlk ) ⋅ (1 − ylk ( zlk ) )∑ δ i ( k +1) wli ( k +1) , k=1, …, m-1 i =1

th

th

(31)

error for the l neuron from the k layer of the MVF based on the neurons with the sigmoid activation function.

XI. A MULTILAYER FEEDFORWARD NEURAL NETWORK BASED ON MULTIVALUED NEURONS (MLMVN) Multilayer feedforward networks have been applied successfully to solve a number of difficult and diverse problems [2]. At the same time, there is at least one important problem, which is still open. Although it is proven that a feedforward network is a universal approximator, a practical implementation of learning often is a very complicated task. It depends on several factors: the complexity of the mapping to be implemented, the chosen structure of the network (the number of hidden layers, the number of neurons on each layer), and the control over the learning process (usually this control is being implemented using the learning rate). Increasing both the number of hidden layers and neurons on them, we can make the network more flexible to the mapping to be implemented. However, increasing the number of hidden neurons increases the risk of overfitting. An alternative solution, which preserves a traditional MLF architecture, is based on the use of the different basic neurons. We will consider here a multilayer neural network based on multivalued neurons (MLMVN) described in [24]. It was mentioned above that the functionality of a MVN is higher than the functionality of a neuron with the sigmoid activation function [21]. We also convinced that a MVN has an effective and simple learning algorithm based on the error-correction learning rule. So it is very attractive to consider a multilayer neural network with the same architecture as a MLF, but with MVN as a basic neuron. A backpropagation learning algorithm for this network is based on the same background as traditional backpropagation learning, but its implementation has some important distinctive features. It is also necessary to say that its efficiency is very high. The MVN activation function (20) is not differentiable. It means that the formulas (29)-(30) that determine the error calculation for a MLF and (28) that determine the weights correction for a MLF cannot be applied for the case of MLMVN because all of them contain the derivative of the activation function. However, for the MVN-based network this is not a problem! Let us consider a MLMVN with the standard architecture (Fig. 12). As it was shown above for the single neuron, the differentiability of the MVN activation function is not required for learning. Since MVN learning is reduced to the movement along the unit circle (see Section 9), the correction 40

of the weights is completely determined by the neuron's error. The same property is true not only for the single MVN, but for the network (MLMVN). The errors of all the neurons from MLMVN are completely determined by the global errors of the network (25). As well as classical MLF learning, MLMVN learning is based on the minimization of the error functional (27). Let us generalize all considerations that we made above for the single neuron to the case of a MLMVN. To make things easier, let us start from the simplest case, which is a network with two neurons (one hidden neuron and one output neuron) with a single input (see Fig. 14).

x1

Y11

11

12

Y12

Fig. 14 The simplest MLMVN The error on the network output is equal to δ = T − Y12 , where T is a desired output. We need *

to understand, how we can obtain the errors for each particular neuron, backpropagating the error

δ* form the right-hand side to the left-hand side. Let us suppose that the errors for the neurons 11 and 12 are already known. We will use the learning rule (23) for the correction of the weights. Let us suppose that the neuron 11 from the 1st layer is already trained. Let us now correct the weights for the neuron 12 (the output neuron) and estimate its weighted sum (recall eq. (22)):

~z = ( w12 + 1 δ ) + ⎛⎜ w12 + 1 δ (Y + δ ) ⎞⎟(Y + δ ) = 12 0 12 1 12 11 11 11 11 2 2 ⎝ ⎠ 1 1 1 1 12 = w012 + δ12 + w112Y11 + w112 δ11 + δ12 = w12 δ12 + δ12 + w112 δ11 = 0 + w1 Y11 + 14243 2 2 2 2

(32)

z12

= z12 + where

1 1 δ12 + δ12 + w112 δ11 = z12 + δ12 + w112 δ11 , 2 2

( w012 , w112 ) is an initial weighting vector of the neuron 12, Y11 is an initial output of the

neuron 11, Y12 is an initial output of the neuron 12, before the correction,

z12

is the weighted sum on the neuron 12

δ11 is the unknown error of the neuron 11 and δ12 is the unknown error of

the neuron 12. To ensure that after the correction procedure the weighted sum of the neuron 12 will be exactly equal to z12 + δ , it is clear from (32) that we need to satisfy the following: *

δ12 + w112 δ11 = δ* .

41

(33)

Of course, if we consider (33) as a formal equation, we will not get something useful. This equation has an infinite number of solutions, while we have to find among them a single solution, which will correctly represent the local errors of each neuron through the global error of the network and through the errors of the neurons of the "oldest" layers. Let us come back to the learning rule (23) for the single MVN. In this rule ∆W =

Cm ⎛ q z ⎞ 1 ⎜⎜ ε ⎟⎟ X . It contains a factor in order to distribute the contribution of the (n+1) ⎝ | z | ⎠ (n + 1)

correction uniformly among all n+1 weights w0 , w1 ,..., wn . It is easy to see that if we would omit this factor then the corrected weighted sum would not be equal to z + δ (see (22)), but to

z + (n + 1)δ . On the other hand, since all the inputs are equitable, it will be correct and natural that during the correction procedure ∆W will be distributed among the weights uniformly, so it must be shared among the weights. This makes clear the important role of the factor

1 in (23). (n + 1)

If we have not a single neuron, but a feedforward network, we have to take into account the same property. It has to be used, to implement properly a backpropagation of the error through the ~ ~ network. It means that if the error of a neuron on the layer j is equal to δ , this δ must contain a factor equal to

1 , where s j = N j −1 + 1 is the number of neurons whose outputs are connected to the sj

inputs of the considered neuron ( N i is the number of neurons on the layer i, and all of this neurons are connected to the considered neuron) incremented by 1 (the considered neuron itself). This ensures sharing of the error among all the neurons on which the error of the considered neuron depends. In other words, the error of each neuron is uniformly distributed among the neurons connected to it and itself. It should be mentioned that for the 1st hidden layer s1 = 1 because there is no previous hidden layer, and there are no neurons, with which the error may be shared, respectively. If we apply this rule to our network (see Fig. 14), then we will conclude that contain in its expression through a global error δ a factor *

1 (the input of the neuron 12 is 2

connected with 1 neuron and we have to share the error with the neuron 12 itself), while not contain the mentioned factor "de-facto", because

δ12 must

δ11 will

1 = 1 . It is clear now that for the neuron 12 we 1

have

1 δ12 = δ* . 2

42

(34)

We have to take also into account that the error δ11 is a result of backpropagation of the error δ12 to the first layer (to the neuron 11). On the other hand, from (33) we obtain:

δ* − δ12 δ11 = = (δ* − δ12 ) w112 12 w1

( )

But from (34)

−1

=

(δ* − δ12 ) w112 w112

2

,

(35)

δ* = 2δ12 and therefore from (35) we obtain the following expression for δ11 : δ11 =

(δ* − δ12 ) w112 12 2 1

w

=

(2δ12 − δ12 ) w112 12 2 1

w

(36) is not only a formal expression for

=

δ12 w112 12 2 1

w

( )

= δ12 w112

−1

.

(36)

δ11 , but it leads us to the following important conclusion:

during a backpropagation procedure the backpropagated error must be multiplied by the inverse values of the corresponding weights. Let us now substitute

δ11 and δ12 from (34) and (36) into (32):

( )

* * ⎛ 12 −1 δ* ⎞ 1 * 12 12 ⎜ w1 ~ ⎟ = z12 + δ + δ = z12 + δ* . z12 = z12 + δ12 + w1 δ11 = z12 + δ + w1 ⎜ ⎟ 2 2 2 2 ⎝ ⎠

So, we obtain exactly the result that is our target:

(37)

~z = z + δ* . 12 12

Now it will not be difficult to generalize everything that we considered for the simplest network (Fig. 14) to the network with an arbitrary number of layers and an arbitrary number of neurons in each layer. We will obtain the following. The global errors of the whole network are determined by (25). For the errors of the mth (output) layer neurons:

δ km =

1 * δ km , sm

(38)

where km specifies a kth neuron of the mth layer; s m = N m−1 + 1 (the number of all neurons on the previous layer (m-1, to which the error is backpropagated) incremented by 1). For the errors of the hidden layers neurons:

1 δ kj = sj

N j +1

∑ i =1

1 ij +1 2 k

w

ij +1 k

δ ij +1 ( w

1 )= sj

N j +1

∑δ i =1

ij +1

( wkij +1 ) −1 ,

(39)

where kj specifies a kth neuron of the jth layer (j=1,…,m-1); s j = N j −1 + 1, j = 2,..., m; s1 = 1 (the number of all neurons on the previous layer (previous to j, to which the error is backpropagated) incremented by 1). It should be mentioned that the backpropagation rule (39) is based on the same 43

heuristic assumption as for the classical backpropagation. According to this assumption we suppose that the error of each neuron from the previous (jth) layer depends on the errors of all neurons from the following (j+1st) layer. After calculation of the errors, the weights for all neurons of the network must be corrected. To do it, we can use the learning rule (23) applying it sequentially to all layers of the network from the first hidden layer to the output one. We have also take into account the following consideration. For the output layer neurons, we have the exact errors calculated according to (25), while for all the hidden neurons the errors are obtained according to the heuristic rule. This may cause a situation, where either the weighted sum for the hidden neurons (more exactly, the absolute value of the weighted sum) may become a not-smooth function with dramatically high jumps, or the hidden neuron output will be close to some constant with very small variations around it. In both cases thousands and even the hundreds of thousands of additional steps for the weights adjustment will be required. To avoid this situation, we can use a factor

1 (the inverse value to the current value of z

the weighted sum for this given neuron) as a variable part of the learning rate for the hidden neurons. Thus we obtain finally the following rules for the MLMVN weights adjustment [24]: Correction rule for the neurons from the mth (output) layer (kth neuron of mth layer):

~ km = wkm + Ckm δ Y~ , i = 1,..., n w km im −1 i i (n + 1) ~ km = wkm + Ckm δ w 0 0 km (n + 1)

(40)

Correction rule for the neurons from the 2nd till m-1st layer (kth neuron of the jth layer (j=2, …, m-1):

~ kj = wkj + w i i ~ kj = wkj + w 0 0

Ckj (n + 1) | z kj | Ckj (n + 1) | z kj |

δ kj xi , i = 1,..., n (41)

δ kj

Correction rule for the neurons from the 1st hidden layer:

~ k 1 = wk 1 + w i i

Ck 1 δ k1 xi , i = 1,..., n (n + 1) | zkj |

Ck 1 ~ k 1 = wk 1 + w δ k1 0 0 (n + 1) | zkj | 44

(42)

XII. APPLICATION OF A MULTILAYER FEEDFORWARD NEURAL NETWORK BASED ON MULTI-VALUED NEURONS (MLMVN) IN TIME SERIES PREDICTION The time series prediction is a popular applied problem, which often it is possible to solve using neural networks. Any time series consists of data that represent some process, which is changing in time. The data samples are taken within the equal time intervals. A nature of this data can be very different: it can be a financial time series (currency exchange rate, stock indexes), meteorological time series (temperature, pressure, wind speed), etc. A problem of time series prediction is usually formulated as follows. Let series. It is necessary to predict the next values function of the previous n values:

x1 ,..., x n be a time

xn+1 ,..., xn+ k in the assumption that a value xi is a

xi −n ,..., xi −1 : xi = f ( xi − n ,..., xi −1 ) .

In these terms a problem is to predict the following values:

xn+1 = f ( x1 ,..., xn ) xn+ 2 = f ( x2 ,..., xn+1 ) ... xn+ k = f ( xk −1 ,..., xn+ k −1 ). Since a function f is unknown, and its existence is only our assumption, this means that its analytic form can not be found. Let us suppose that n+s members of the time series are known. A commonly used solution is to train a neural network using these known samples x1 ,..., x n , x n +1 ,...., x n + s to predict the values x n +1 ,...., x n + s as follows:

xn+1 = f ( x1 ,..., xn ) xn+ 2 = f ( x2 ,..., xn+1 ) ... xn+ s = f ( xs −1 ,..., xn+ s −1 ). Then it will be possible to predict the unknown values

x n + s +1 ,...., x n + s + k ,...

A feedforward neural network usually is used for solving the problem. We will show here, how it is possible to solve the problem using a feedforward neural network based on multi-valued neurons (MLMVN). A traditional and commonly test for the time series prediction is the prediction of Mackey-Glass time series. This time series is generated by the chaotic Mackey-Glass differential delay equation defined as follows [28]: 45

dx(t ) 0.2 x(t − τ ) = − 0.1x(t ) + n(t ), dt 1 + x10 (t − τ )

(43)

where n(t) is a uniform noise (it is possible that n(t)=0). x(t ) is quasi-periodic, and chosing τ = 17 it becomes chaotic [28]. This means that only short term forecasts are feasible. We will use here exactly τ = 17 . To integrate the equation (43) and to generate the data, we used an initial condition

x(0) = 1.2 and a time step ∆t = 1 . The Runge-Kutta method was used for the integration of the equation (43). The data is sampled every 6 points. The task of prediction is to predict x(t + 6) from

x(t ), x(t − 6), x(t − 12), x(t − 18) . We generated 1000 points data set. The first 500 points were used for training (a fragment of the first 250 points is shown in Fig. 15a) and the next 500 points were used for testing (a fragment of the last 250 points is shown in Fig. 15b). The true values of x(t + 6) were used as the target values during training. The generated data was transformed linearly in order to change its range to [0,2 π[ and therefore, to transform it to the form acceptable for the MVN. This transformation is performed as follows: if y ∈ [a, b] then yˆ = yˆ min +

y−a ( yˆ max − yˆ min ) , b−a

(44)

where yˆ min and yˆ max determine minimum and maximum values of the arguments of the transformed data (as the arguments of the complex numbers): yˆ min = 0 + η1 ; yˆ max = 2π − η2 , where η1 and η 2 are the values that make it possible to create a not used zone in the region of the border " 0 = 2π ". This is necessary to do, to avoid the permutation of the values that are close to a and b after their transformation to the complex numbers that are lying on the unit circle. Finally any y ∈ [a, b] is transformed to the complex numbers that are lying on the unit circle as follows:

x = exp( iyˆ ) , where



is defined .in (44). The inverse transformation of the data to their natural real-valued form (to the number belonging to the interval [a, b ] ) is performed as follows:

y=a+

yˆ − yˆ min (b − a) . yˆ max − yˆ min

Since the root mean square error (RMSE) is a commonly used estimation of the quality for the Mackey-Glass time series prediction, we also use it here. We do not require a convergence of the 46

training algorithm to the zero error. Since RMSE is a usual estimator for the prediction quality, it also was used for the training control. Thus instead of the MSE criterion (27), we used the following RMSE criterion for the convergence of the training algorithm:

1 N

N

∑∑ (δ s =1

) (W ) =

* 2 km s

k

1 N

N

∑E s =1

s

≤λ,

(45)

where ε determines a maximum possible RMSE for the training data.

(a) Mackey-Glass time series: a fragment of the first 250 points of the training data

(b) Mackey-Glass time series: a fragment of the last 250 points of the testing data

(c) training error (RMSE=0.0032)

(d) testing error (RMSE=0.0063)

Fig. 15 Mackey-Glass time series prediction A neural network with one hidden layer and one output layer containing a single output neuron (see Fig. 13) was used in these experiments. Their results are summarized in Table VII. For each of the three series of experiments we made 30 independent runs of training and prediction. Our 47

experiments show that choosing a smaller

λ in (45) it is possible to decrease the RMSE for the

testing data significantly. The results of training and prediction are illustrated in Fig. 15c and Fig. 15d, respectively. Since both the testing and prediction errors are very small and it is practically impossible to show the difference among the actual and predicted values at the same graph, Fig. 15c and Fig. 15d show not the actual and predicted data, but the error among them. Fig. 15c presents the error on the training set after the convergence of the training algorithm for the network containing 50 hidden neurons. ε = 0.0035 was used as a maximum possible RMSE in (45). An actual RMSE on the training set at the moment, when training was stopped, was equal to 0.0032. Fig. 15d presents the error on the testing set (RMSE=0.0063, which is a median value of the 30 independent experiments). To estimate a training time, one can base on the following data for the networks containing 50 and 40 hidden neurons, respectively. 100000 epochs require 50 minutes for the first network and 40 minutes for the second one on a PC with a Pentium-III 600 MHz CPU. Table VII The results of Mackey-Glass time series prediction using MLMVN

λ

# of neurons on the hidden layer

- a maximum possible RMSE in (45) Actual RMSE for the training set (min - max) Min Max RMSE for the testing Median set Average SD Min Number of Max training Median epochs Average

50

50

40

0.0035

0.0056

0.0056

0.0032 - 0.0035

0.0053 – 0.0056

0.0053 – 0.0056

0.0056 0.0083 0.0063 0.0066 0.0009 95381 272660 145137 162180

0.0083 0.0101 0.0089 0.0089 0.0005 24754 116690 56295 58903

0.0086 0.0125 0.0097 0.0098 0.0011 34406 137860 62056 70051

Comparing the results of the Mackey-Glass time series prediction using MLMVN to the results obtained using the classical MLF, we have to conclude that MLMVN outperforms it (see Table VIII).

Table VIII Comparison of Mackey-Glass time series prediction using MLMVN with other models MLMVN MLMVN Classical MLF, average RMSE (taken from [29]) min average 0.02 0.0056 0.0066 The additional important advantage of MLMVN is an opportunity to control a level of the prediction error by choosing an appropriate value of the maximum training error λ in (45).

48

XIII. CELLULAR NEURAL NETWORKS: BASIC PRINCIPLES It is well known that many useful algorithms of the spatial domain image filtering are reduced to the convolution of a local window with some weighting kernel. This processing within a local window around a pixel might be organized simultaneously for all the pixels, independently on each other. Thus it is natural to perform this process using an appropriate neural network. The most appropriate neural network for solving of these problems is the Cellular Neural Network (CNN). The CNN has been introduced in [17] as a special high-speed parallel neural structure for image processing and recognition. The originality of the CNN in comparison with other neural networks is first of all in its local connectivity. All other neural networks usually are the fully connected networks (e.g., the Hopfield network), or multilayer neural networks with full connections between neurons of the neighboring layers (e.g., multilayer feedforward neural network considered above).

(i, j)

Fig. 16 CNN of a dimension 3x5 with the local connections in a 3x3 neighborhood: each neuron is connected with 8 neurons around it and with itself A CNN concept supposes a cellular structure of the network: each neuron is connected only with the neurons from its nearest neighborhood (see Fig. 16). This means that the corresponding inputs of each neuron are connected with outputs of neurons from the nearest rxr neighborhood including the own output (a feedback connection). On the other hand, the output of each neuron is connected with the inputs of neurons from the same neighborhood. The neurons of such a network are also often called cells. Depending on the type of neurons, from which the CNN is built, it is possible to distinguish continuous-time CNN (CTCNN) [17], discrete-time CNN (DTCNN) [31] (oriented especially on the binary image processing), CNN based on multi-valued neurons (CNN-MVN) [21] and CNN based on universal binary neurons (CNN-UBN) [21]. The CNN-MVN implements those image processing algorithms that may be described by multiple-valued threshold functions. The CNNUBN, respectively, implements those image processing algorithms that may be described by some Boolean function, which is not necessarily the threshold one. CNN local connectivity (see Fig. 16) is a key point for the implementation of the different spatial domain image filtering algorithms. 49

The dynamics of the continuous time CNN (CTCNN) neuron is described by the following equation [30]: r r ⎡ r r ⎤ yij (t + 1) = F ⎢ ∑∑ Akl yi+k, j+l (t ) +∑∑ Bkl xi+k, j+l (t ) + I ⎥ . k=-r l=-r ⎣ k=-r l=-r ⎦

(46)

The dynamics of the discrete time CNN (DTCNN) neuron is described by the following equation [31]: r r ⎤ ⎡ r r yij (t + 1) = Fd ⎢ ∑∑ Akl yi+k, j+l (t ) +∑∑ Bkl xi+k, j+l (t ) + I ⎥ . k=-r l=-r ⎦ ⎣ k=-r l=-r

(47)

Here i, j are the coordinates of the ijth neuron, y is the neuron’s output, x is the neuron’s input, r is the size of a neuron’s nearest neighborhood (2r x 2r in the terms of equations (46) and (47)), t, t+1 are the time slots t and t+1, respectively, A, B are the 2r x 2r matrices, which define the synaptic weights (feedback and control templates, respectively), I is a bias, F and Fd are the activation functions of the CTCNN and DTCNN neuron, respectively. It is seen from equations (46) and (47) that a CNN can implement the iterative processing, performing the same processing algorithm several times at the time slots t, t+1, t+2, …, applying this algorithm every time to the result of the previous iteration. For the CTCNN the activation function is the following:

F ( z) =

| z + 1 | - | z -1 | . 2

(48)

For the DTCNN the activation function is the following:

Fd = sgn( z ) .

(49)

The activation function (48) is a piecewise linear function. The activation function (49) is a simple sign function, and therefore DTCNN is based on the traditional threshold neurons. Evidently, the linear part of the function (48) ensures the implementation of any algorithm of linear 2D filtering in spatial domain, which is reduced to the linear convolution with the weighting window, on the CTCNN neuron. For example, the following 3 x 3 templates ⎛ 0.11 0.11 0.11⎞ ⎛ 0 0 0⎞ ⎟ ⎟ ⎜ ⎜ A = ⎜ 0 0 0 ⎟; B = ⎜ 0.11 0.11 0.11⎟ ⎜ 0.11 0.11 0.11⎟ ⎜ 0 0 0⎟ ⎠ ⎠ ⎝ ⎝

50

⎛ 1 1 ⎞ ⎜ ∑ ∑ xi + k , j + l ⎟ ⎝ k = −1l = −1 ⎠ implement a simple mean filter y ij = (take into account that 0.11=1/9). 9 The next pair of templates

⎛ 0 0 0⎞ ⎛1 2 1⎞ ⎛ 0 0 0⎞ ⎛ −1 0 1⎞ ⎜ ⎟ 1 ⎜ ⎟ ⎜ ⎟ 1 ⎜ ⎟ 1 A = ⎜ 0 0 0 ⎟; B = ⎜ − 2 0 2 ⎟ ; A = ⎜ 0 0 0 ⎟; B = ⎜ 0 0 0 ⎟ ⎜ 0 0 0⎟ ⎜ − 1 2 − 1⎟ ⎜ 0 0 0⎟ ⎜ −1 0 1⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 1

implement the two parts of the well known Sobel edge detection operator, which performs a 2D spatial gradient measurement on an image and so emphasizes regions of high spatial gradient that correspond to edges. Typically it is used to find the approximate absolute gradient magnitude at each point in an input grayscale image. The example of edge detection using this technique is shown in Fig. 19d. It should be mentioned that we just used a new method for obtaining the weights (the CNN weighting templates). This method is called design [30]. Indeed, we did not use here any learning or synthesis algorithm. We obtained our weights from the consideration that they have to implement an priori known algorithm. This method is often used for the CNN in order to transfer a particular image processing algorithm to the corresponding weighting template, which implements this algorithm on the CNN. A number of other different templates that implement some useful image processing algorithms can be found in [30]. DTCNN ensures the implementation of binary image processing algorithms that may be described by the threshold Boolean functions. Let us now consider CNN-MVN and CNN-UBN. Both these networks have the same local connection structure (see Fig. 16) of a dimension N x M. Consider MVN or UBN as a basic neuron of such a network. Each cell of the CNN-MVN and CNN-UBN, respectively, performs the following correspondence between neuron’s inputs and output: r r ⎡ ⎤ Yi j (t+1)= P ⎢ w0 + ∑ ∑ wkli j xkli j (t ) ⎥ ; k =− rl =− r ⎣ ⎦

(50)

r r ⎡ ⎤ Yi j (t+1)= PB ⎢ w0 + ∑ ∑ wkli j xkli j (t ) ⎥ , k =− rl =− r ⎣ ⎦

(51)

where: i, j are the two-dimensional coordinates of a neuron; Yij is the neuron's output; wklij is the synaptic weight corresponding to the input of ijth neuron, to which a signal from the output of the klth neuron is transferred (one may make a comparison to the elements of B-template in the traditional CNN (see (46) and (47) ); xklij is the input of the ijth neuron, to which a signal from the 51

output of the klth neuron is transferred; P is either the activation function (18) of the discrete-valued MVN or (20) of the continuous-valued MVN, PB is the activation function (16) of the UBN; r



means that each neuron is only connected with the neurons from its nearest r-neighborhood,

m=− r

as it is usual for CNN. A feedback template for the CNN-MVN and CNN-MVN is not used. Thus a single weighting template, which is used for the CNN-MVN and CNN-MVN may be written in the following form:

⎛ wiij− r , j − r ⎜ w0 ; W = ⎜ ... ⎜ ij ⎝ wi + r , j − r

... w ijij ...

wiij− r , j + r ⎞ ⎟ ... ⎟ , ⎟ wiij+ r , j + r ⎠

where W is a rxr matrix.

XIV. CELLULAR NEURAL NETWORKS: APPLICATION IN IMAGE PROCESSING Let us show how the CNN-MVN and the CNN-UBN can implement several original and effective image processing algorithms. 1) Multi-valued nonlinear filtering. 2D multi-valued nonlinear filter (MVF) was introduced in [21] and then developed in [32] [33]. It is based on the nonlinearity of the multi-valued neuron activation function, which determines a specific nonlinear averaging and can be effectively used for the noise reduction. Let 0 ≤ B ≤ k − 1 be a dynamic range of a 2D signal. Let us consider a set of the kth roots of unity. We can put a root of unity to the correspondence with the integer-valued brightness B as follows: (52)

e B = exp(i 2πB/k )=Y . Thus, we have the univalent mapping

B ↔ ε B , where ε is a primitive kth root of unity.

A two-dimensional Multi-Valued Filter (MVF) is defined by the following formula:

) Bij = P( w0 +

∑w Y

kl kl i − n ≤ k ≤i + n j − m≤ l ≤ j + m

)

(53)

where P is either the activation function (18) of the discrete-valued multi-valued neuron or the activation function (20) of the continuous-valued multi-valued neuron, Ykl is obtained from Bkl according to (52), I ,j are the coordinates of the filtered pixel, n × m is a filter window, wkl are the weighting coefficients (complex-valued in general). Evidently, (16) defines a class of filters. Each of them is defined by the particular by the particular weighting template. 52

Comparing (53), which defines a MVF, with (50), which defines the CNN-MVN it is easy to conclude that the CNN-MVN is the most appropriate structure for the MVF implementation. The most simple, but effective template for the reduction of the additive and multiplicative (speckle) Gaussian noise is the following: ⎛1 1 1⎞ ⎜ ⎟ W = ⎜1 G 1⎟ , ⎜1 1 1⎟ ⎝ ⎠

(54)

where G is a parameter. The template (54) is designed from the following considerations. If G = 0 and all other weights are equal to 1 then a maximal signal averaging is achieved. If G > 1 then the image boundaries will be preserved with more accuracy. At the same time for G > 32 the filter practically will not change a signal. Let us consider an example. It shows how zero-mean Gaussian noise with a dispersion equal to 0.3σ (σ is a dispersion of the original clean image) can be reduced using a MVF implemented on the CNN-MVN software simulator. The results are presented in Fig. 17. The enhanced differences between a noisy image and the resulting images are also shown.

(a) the original image

(c) MVF, 3x3 PSNR=29.05

G=4

in

(54);

(e) MVF, 3x3, G=16, 2 iterations; PSNR=28.50

(b) noisy image (Gaussian noise, 0.3σ); PSNR=20.69

(d) difference between images (b) and (c)

(f) difference between images (b) and (e)

Fig. 17 Noise reduction using multi-valued filtering implemented on the CNN-MVN

53

This example shows that a MVF is a highly effective noise suppressor. Significant increasing of the PSNR (peak signal to noise ratio) and very accurate preservation of the image boundaries (see Fig. 17df) after the processing are the most important features of a MVF. In the terms of PSNR a MVF outperforms, for example, a rank-order filter and other order statistic filters. 2) Multi-valued frequency correction. A multi-valued filter is also a highly effective mean for solving of such problems as image sharpening and extraction of image details against a preventing background. To achieve these effects, it is necessary to amplify the high and medium frequencies preserving the low ones. The corresponding multi-valued filters are defined in the following way [32]-[33]:

⎤ ⎡ Bˆ ij = P ⎢c + (G1 + G 2 )Yij + (G3 − G 2 ) ∑ Ykl ⎥ , Y ( k ,l )∈Rij ⎦ ⎥ ⎣⎢

(55)

where Ykl is obtained from Bkl according to (52), Rij is a local window around the processed pixel

Yij ; Bij ( Yij ) and Bˆ ij are the signal values in ijth pixel before and after processing, respectively, G1 , G3 are the coefficients, which define the correction of the medium frequencies, G 2 defines the correction of the high frequencies, c is a constant. Filter (55) solves a global frequency correction problem. Its application to an image leads to the extraction of details whose sizes are approximately equal to the sizes of the filter window. The estimation for the values of the weighting coefficients in (55) is the following [33]:

nm nm ≤ G 2 < nm , G 2 − 1 < G3 < G 2 , − G3 < G1 < 2nm , where n x m is a 2 2

filter window. The MVF for the high frequency correction may be obtained from (20) with G3 = 0 :

⎡ ⎤ Bˆ ij=P ⎢c+(G1+G2 )Yij -G2 ∑ Ykl ⎥ , Ykl ∈Rij ⎣⎢ ⎦⎥

(56)

where Ykl is obtained from Bkl according to (52), Rij is a local window around the processed pixel

Yij ; Bij ( Yij ) and Bˆ ij are the signal values in ijth pixel before and after processing, respectively, G1 and G 2 are the correction coefficients, c is a constant. Filter (56) extracts the smallest image details. Thus, the most appropriate size for the window of this filter is 3x3. The estimation for the values of the weighting coefficients in (56) is the following [33]: nm ≤ G1 < 2nm , 0 < G2 < 1 . Filters (55) and 2

(56) may be implemented on the CNN-MVN using the weighting template ( G3 = 0 for (56) )

54

⎛ G3 ⎜ ⎜ ⎜ w0=c; W=⎜ ⎜ ⎜ ⎜G ⎝ 3

− G2 . . . − G2

G3 − G 2 ⎞ ⎟ . ⎟ ⎟ ... G1 + G3 ... . ⎟. . ⎟ ⎟ ... G3 − G 2 ⎟⎠

...

...

(57)

Fig. 18 shows that using the frequency correction algorithm and its CNN-MVN implementation it is possible to obtain an image with the clearly visible details from a very poor quality initial image.

(a) The original X-ray image (tumor of lung)

(c) MVF-global frequency correction, window size 35x35; G1 = −5.0; G 2 = 700.0; G3 = 699.5 in (57)

Fig. 18 Extraction of image details using multi-valued filtering implemented on the CNN-MVN There are some interesting and useful image processing algorithms that can be described by Boolean functions. For example, there is the original edge detection algorithm, which is called precise edge detection [21], [32], [33]. Those edges on the binary image corresponding to the upward brightness jumps may be detected using the following Boolean function, which operates with the values from a local 3x3 window around each pixel of an image: ⎡ x1 ⎢ f ⎢ x4 ⎢⎣ x7

x2 x5 x8

x3 ⎤ ⎥ x6 ⎥ = x5 & ( x1 ∨ x 2 ∨ x3 ∨ x 4 ∨ x 6 ∨ x 7 ∨ x8 ∨ x9 ). x9 ⎥⎦

(58)

Respectively, those edges on the binary image corresponding to the upward brightness jumps may be detected using the following Boolean function, which operates with the values from a local 3x3 window around each pixel of an image: ⎡ x1 ⎢ f ⎢ x4 ⎢⎣ x7

x2 x5 x8

x3 ⎤ ⎥ x6 ⎥ = x5 & ( x1 ∨ x 2 ∨ x 3 ∨ x 4 ∨ x 6 ∨ x 7 ∨ x8 ∨ x9 ). x9 ⎥⎦ 55

(59)

To process a gray-scale image using the same technique, there are two opportunities. It is possible to use threshold Boolean filtering, where the image is decomposed into binary images using the threshold decomposition, each binary plain is processed separately by the Boolean function, and then the resulting binary images are component by component summarized into the resulting grayscale image. Another opportunity is cellular Boolean filtering [21], where the image is decomposed into binary images directly: 1st bit (1st binary slice), 2nd bit (2nd binary slice), … . Then each binary plain is processed separately by the Boolean function and the resulting binary images (planes) are merged into the resulting gray-scale image. The corresponding binary images may be processed using the CNN-UBN. It is not possible to obtain the weighting templates for this case by design, as it was done for multi-valued filters. However, it is possible to do it by learning! Only a few iterations are required for the convergence of the UBN learning algorithm (21)-(24) for the UBN activation function (16) with m=4 both for both Boolean functions (58) and (59). For, example, the following weighting template is obtained using the learning algorithm for the function (58) [21]: ⎡(−0.82,0.32) (−0.95,−0.16) (−0.04,0.01)⎤ w0 = (−6.3,−5.6); W = ⎢⎢(0.25,−1.40) (−0.32,−0.05) (−0.03,0.01) ⎥⎥ . ⎢⎣ (0.00,0.10) (0.63,0.60) (−0.02,0.01)⎥⎦

(a) Input image from the electronic microscope

(b) Edge detection using function (58) and cellular Boolean filtering

(c) Edged image (b) with the enhanced contrast

(d) Edge detection using Sobel operator: edges are smoothed

Fig. 19 Edge detection using CNN-UBN and using a Sobel operator The results of edge detection using the described algorithm and its CNN-UBN implementation on the software simulator are illustrated in Fig. 19. Here the just described technique of the cellular neural Boolean filtering was used for the gray-scale image processing. The main advantage of the described algorithm is that all brightness jumps are detected and their intensity on the edged image is proportional to the level of the corresponding jump. At the same time other edge detection 56

techniques are less sensitive to the small brightness jumps and as a result, the edges detected on the low contrast image (like the one in Fig. 19a) are smoothed (like those obtained using a Sobel operator, Fig. 19d).

REFERENCES [1] S. Ramón y Cajál Histologie du Systéme nerveux de I'homme et des Vertebrés. Paris: Maloine; Edition Francaise Revue: Tome I, 1952; Tome 2, 1955; Madrid: Consejo Superior de Investigaciones Cientificas (1911). [2] S. Haykin "Neural Networks: A Comprehensive Foundation (2nd Edition)", Prentice Hall, 1998. [3] G.M. Shepherd and G. Koch "Introduction to Synaptic Circuits". In The Synaptic Organization of the Brain (G.M. Shepherd, ed.), New York: Oxford University Press, 1990, pp. 30-31. [4] F. Faggin "VLSI Implementation of Neural Networks". Tutorial Notes. International Joint Conference on Neural Networks, Seattle, WA, 1991. [5] I. Aleksander and H. Morton. An Introduction to Neural Computing. London: Chapman & Hall, 1990. [6] W.S. McCulloch and W. Pits. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys., Vol. 5, 1943, pp. 115-133. [7] D.D. Hebb. The Organization of Behavior, Wiley & Sons, New York, 1949. [8] R. Rosenblatt. Principles of Neurodynamics, Spartan Books, New York, 1958. [9] A.B.J. Novikoff. On Convergence Proof for Perceptrons. Stanford Research Institute. Report prepared for the Office of Naval Rees. Under Contract No 3438(00), 1963. [10] M.L. Dertouzos. Threshold Logic. A Synthesis Approach. The MIT Press. Cambridge, Massachusets, 1965. [11] S. Muroga. Threshold Logic and its Applications. Wiley & Sons, New York, 1971. [12] M.L. Minsky and S.A. Papert. Perceptron: An Introduction to Computational Geometry. MIT Press, Massachusets, 1969. [13] T. Kohonen. Associative Memory – a Systemtheoretical Approach. Springer-Verlag, Berlin, 1977. [14] J. Hopfield. Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proc. Of the National Academy of Sciences of the USA. Vol. 79, 1982, pp. 2554-2558. [15] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin, 1984. [16] D.E. Rumelhart and J.L. McClelland Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, 1986. [17] L.O. Chua and L. Yang, "Cellular neural networks: Theory", IEEE Transactions on Circuits and Systems. Vol. CAS-35, No 10, 1988, pp. 1257-1290. [18] M.L. Minsky "Steps Toward Artificial Intelligence", Proceedings of the Institute of Radio Engineers, Vol. 49, pp. 8-30, 1961. [19] I.N. Aizenberg "Model of the Element with Complete Functionality", Izvestia AN SSSR. Technicheskaya Kibernetika ("The News of the USSR Academy of Sciences. Technical Cybernetics"), 1985, No. 2., pp. 188-191 (in Russian, journal is translated into English by Kluwer/Plenum Publisher, New York). [20] I.N. Aizenberg "The Universal Logical Element over the Field of the Complex Numbers ", Kibernetika (Cybernetics and Systems Analysis), 1991, No.3, pp.116-121 (in Russian, journal is translated into English by Kluwer/Plenum Publisher, New York). [21] I. Aizenberg, N. Aizenberg and J.Vandewalle “Multi-valued and universal binary neurons: theory, learning, applications”, Kluwer Academic Publishers, Boston/Dordrecht/London, 2000. [22] N.N. Aizenberg, Yu. L. Ivaskiv and D.A. Pospelov "About one Generalization of the Threshold Function" Doklady Akademii Nauk SSSR ("The Reports of the Academy of Sciences of the USSR"), vol. 196, No 6, pp. 1287-1290, 1971 (in Russian). 57

[23] N.N. Aizenberg and Yu.L. Ivaskiv Multiple-Valued Threshold Logic, Naukova Dumka Publisher House, Kiev, 1977 (in Russian). [24] I. Aizenberg and C. Moraga "Multi-Layered Neural Network based on Multi-Valued Neurons (MLMVN) and a Backpropagation Learning Algorithm", Technical Report No CI 171/04 (ISSN 1433-3325) of the Collaborative Research Center for Computational Intelligence of the University of Dortmund (SFB 531), available online at http://sfbci.cs.uni-dortmund.de/home/English/Publications/Reference/Downloads/AM04.pdf [25] A.N. Kolmogorov "On the Representation of Continuous Functions of many Variables by Superposition of Continuous Functions and Addition", Doklady Akademii Nauk SSSR, vol. 114, pp. 953-956, 1957 (in Russian). [26] K. Hornik, M. Stinchcombe, H. White. "Multilayer Feedforward Neural Networks are Universal Approximators", Neural Networks, vol. 2, pp. 259-366, 1989. [27] D.E. Rumelhart, G.E. Hilton and R.J. Williams "Learning Internal Representations by Error Propagation". In Parallel Distributed Processing: Explorations in the Microstructure of Cognition (D.E. Rumelhart and J.L. McClelland, eds.), Vol. 1, Chapter 8, Cambridge, MA: MIT Press. [28] M.C. Mackey and L. Glass "Oscillation and chaos in physiological control systems", Science, vol. 197, pp. 287-289, 1977. [29] S.-H. Lee and I. Kim “Time Series Analysis using Fuzzy Learning,” in Proc. Int. Conf. Neural Inform. Processing, Seoul, Korea, vol. 6, pp. 1577–1582, Oct. 1994. [30] L.O. Chua and T. Roska, Cellular neural networks and Vision Computing. Cambridge University Press, 2002. [31] H. Harrer and J.A. Nossek “Discrete-time cellular neural networks”, International Journal of Circuit Theory and Applications, vol. CTA-20, 1992, pp. 453-467. [32] I. Aizenberg, N. Aizenberg, J. Hiltner, C. Moraga and E. Meyer zu Bexten “Cellular Neural Networks and Computational Intelligence in Medical Image Processing”, Journal of Image and Vision Computing (Elsevier), Vol. 19, Feb. 2001, pp. 177-183. [33] I. Aizenberg and C. Butakoff “Image Processing Using Cellular Neural Networks Based on Multi-Valued and Universal Binary Neurons”, Journal of VLSI signal processing (Kluwer), Vol. 32, 2002, pp. 169-188.

58