Fuzzy Neural Networks for Pattern Recognition

0 downloads 0 Views 9MB Size Report
·The introduction and Part I of this work have been authored by A. Petrosino, whereas ... ity of integrating the advantages of neural networks with those not less known ...... line with Section 3, we believe that the absence of a known cost function may ...... for vector quantization and its application to time-series prediction. IEEE.
Fuzzy Neural Networks for Pattern Recognition * Andrea Baraldi IMGA-CNR Via Gobetti 101, 43100 Bologna, Italy Ph.: +39-59-362388; Fax: +39-59-374506. e-mail: [email protected]

Palma Blonda IESI-CNR Via Amendbla 163, 70126 Bari, Italy Ph.: +39-80-5481612; fax +39-80-5484311 e-mail: [email protected]

Alfredo Petrosino INFM-University of Salerno Via S. AUende,Baronissi (Salerno), Italy l>h.:· +39-89-761167; f~ +39.,8'1-761189 e-mail: [email protected]

Abstract ,

.

The objective of this paper is to discuss a state-of-the-art of methodology and algorithms for integrating fuzzy sets and neural networks in a unique framework for dealing with patterntecotlnition· problems, in particul!»" classification procedures. Methods of pattern recognition are studied in two mainstreams, namely supervised and unsupervised .learning. We propose ourqwn definition of fuzzy neuralintegra.ted networks. This criteri()11 is proP0se4 as a unifying framework for comparison of 'algorithms. In the first part pf the this paper, classification methods based on rule sets or numerical data!»"e reviewed, together with specific , methods wr hudling classification in image, processing. , In the 8ec()nd part of this paper,severalfuzzy neural clustering models are reviewed and compared. These models are: i) Self-Organizing Map (SOM); ii) Fuzzy Learning Vector Quantization (FLVQ); iii) Carpenter~GroS8berg­ Rosen Fuzzy Adaptive Resonance Theory (CGR Fuzzy ART); iv) Growing Neural Gas (GNG); and v) Fully self-Organizing Simplified Adaptive Resonance Theory (FOSART).

·The introduction and Part I of this work have been authored by A. Petrosino, whereas Part II has been authored by A. Baraldi and P. Blonda.

36

Introduction A lot of scientific effort has been dedicated to pattern recognition problems, especially classification procedures. These include linear classifiers, knowledgebased classification paradigms, as e.g. heuristic classification stemming from artificial intelligence fields, fuzzy logic based systems and neural networks. This underlines that a field of pattern recognition strongly follows recent achievements of different areas of science or engineering incorporating them for formation of more efficient and reliable classification algorithms. In particular, from the very beginning of the development of fuzzy sets it has been obvious that they have a strong impact on techniques of pattern recognition, both from the methodological viewpoint, i.e. it leads to treatment of fuzzy sets as a well-suited theory within which one can establish a plausible tool for modeling cognitive processes of the human being, and from the innovation viewpoint in the sense of the development of a lot of novel algorithms which, with suitable modifications, are useful for designing classification procedures. On the other side, neural networks offer today valuable tools for classification both in the sense of non parametric statistical pattern recognition and in their connotation of knowledge acquisition abilities from a family of training patterns and their distribution along the connections of the structure within their learning. The general tendency of merging fuzzy sets and neural networks has appeared. quite early in the development of fuzzy sets [1]. Bearing in mind an evident overlap in the architectures characteristic for fuzzy sets and neural networks, these two complementary technologies can produce an efficient synergy for the design of "intelligent" classification procedures. So doing, the merits and demerits of each method against featuring elements of information processing in intelligent systems must be considered and analyzed. Table 1 reports such a comparison. From the table, it arises that neural networks enjoy four major advantages over many classical pattern recognition techniques: adaptivity, i.e. the ability to adjust when given new information; speed via massive parallelism; fault tolerance to missing, confusing and/or noisy data; optimality as regards error rates in classification systems. On the contrary, they suffer from low capability to clearly formulate knowledge extracted from data and treatment of imprecise and/or incomplete knowledge. Several researchers have considered the possibility of integrating the advantages of neural networks with those not less known of fuzzy sets, giving rising to fuzzy neural integrated networks. Since the variety of approaches found nowadays is quite impressive, the interested reader may refer to [2,3] as being two representative sources with this regards. Here we will look at the synergy of fuzzy sets and neural networks from a broader perpective of methodology and methods in classification procedures. The paper is subdivided in two parts, relating algorithms for supervised and unsupervised recognition tasks. The first part covers methods for classification with the presence of supervision, together with a broader view of methodologies of integration fuzzy sets and neural networks in the same framework. Afterwards, methods of fuzzy neural clustering (classification with partial or any supervision) are studied in detail in the second part of the paper. We also

37

FIS Mathematical model Learning ability Knowledge representation Expert knowledge Nonlinearity Optimization ability Fault tolerance Uncertainty tolerance Real-time operation

SG B G G G B G G G

ANN SG G B B G SG G G SG

Table 1: Comparison of Fuzzy Inference Systems (FIS) and Artificial Neural Networks (ANN). The terms for grading are good (G), slightly good (SG), slightly bad (SB) and bad (B). provide an extensive and updated bibliography covering publications dealing with fuzzy neural synergism in pattern recognition.

PART I: FUZZY NEURAL NETWORKS FOR CLASSIFICATION

1

Fuzzy Pattern Recognition

To better understand the role of fuzziness in the design of pattern recognition systems, let us consider a decision-theoretic approach to pattern classification. When conventional probabilistic and deterministic classifiers are concerned, the features characterizing the input vectors are quantitative, i.e. numerical in nature. Features having imprecise or incomplete specification are usually either ignored or discarded from the design and test sets. The concept of fuzzy set theory [4,5] can be introduced into the pattern recognition process to cope with impreciseness arising from various sources. For example, instrumental error or noise corruption in the experiment may lead to partially reliable information available on a feature measurement. Again, in some cases the expense incurred in extracting a very precise value of a feature may be high or it may be difficult to decide on the most relevant features to be extracted. In these cases it may become convenient to use linguistic variables and hedges (small, medium, high, very, more and less, etc.) in order to describe the feature information. Again, uncertainty in classification may arise from the overlapping nature of classes; realistically speaking, the feature vector characterizing a specific pattern can and should allowed to have degrees of membership in more than one class. Similarly, another problem arises in pattern recognition when it is necessary to estimate the exact shape based upon the boundary that contains some or all of the sample points. Since the boundary should contain obscured portions not

38

represented in the sampled points, each point is characterized by a multivalued or fuzzy degrees of belonging to certain classes. Examples of how the uncertainty should be described depends on the context considered: (a) involving the feature space IF an object is elongated AND small AND it moves fast THEN it is in class Wi (b) involving the classification stage ' IF an object is black AND cubic THEN it possibly is in class

Wi

(c) a combination of both IF an object is elongated AND small AND it moves fast THEN it possibly is in class Wi where elongated, small, fast and possibly are linguistic variables. Formally speaking, the design of a fuzzy pattern recognition system may be described as follows: Problem formulation

Let X denote the pattern from the environment observed via the data acquisition process. The pattern is described in linguistic terms than in numerical fashion, i.e. the ith feature of X, Xi, is represented as a fuzzy set with membership function Jl.x, E [0,1]. We further assume that there exists finite number of classes W = {Wl,W2,'" ,We}. The goal is to find a fuzzy set n with membership function Jl.n such that the value J-ln(w,), i = 1, ... ,c, denotes the grade of membership of the pattern X to the ith class. If there is a rule base consisting of a collection of fuzzy IF-THEN rules describing the classification process, e.g. Rj : IF

Xl

is A{ AND ... AND

Xn

is A~ THEN (x!, ... , Xn) possibly is in WI

the better way to classify an unknown pattern is to infer its belonging class on the basis of the rule base and the description of the pattern in linguistic terms or more generally characterized by numerical information. The aim of a fuzzy inference system (FIS) is to use these fuzzy IF-THEN rules to determine a mapping from fuzzy sets in the input universe of discourse U eRn to fuzzy sets in the output discourse VcR based on fuzzy logic principles. As above mentioned, if the pattern is described in numerical fashion, a fuzzifier to the input and a defuzzifier to the output of the fuzzy logic system are added. The fuzzifier maps crisp (valued in {O, 1}) points in U to fuzzy sets in U, and the defuzzifier maps fuzzy sets in V to crisp points in V. The relevance of using fuzzifier and defuzzifier should be looked for also in the capability of fuzzy logic system to integrate the processing both of numerical and linguistic information.

39

Generally and formally speaking in the framework of classification procedures, a fuzzy logic system with fuzzifier and defuzzifier is a multi-input-single-output (MISO) fuzzy inference system which computes the function f : nn - n, described by the following collection of single-output systems

Fact:

Xl

is A~ AND ... AND

Rule 1 : IF

Xl

Rule m : IF

Xn

is

is Al AND·· . AND

Xl

A~

Xn

is Ai'" AND· .. AND

is A~ THEN y is B1 ELSE

Xn

is A~ THEN y is B m ELSE

Conclusion: y is B' where (Xl, X2, . .. ,Xn)T E U and y E V. In the setting of the MISO fuzzy inference system the fuzzifier, inference and defuzzifier get the following meanings: Fuzzification:

The input Xi maps to the fuzzy set

AL i = 1, ... , n.

Inference:

Each fuzzy rule defines a fuzzy set (A{ AND··· AND A~) - Bi in the product space U X V, i.e.

The conclusion B' can be deduced as

B'

=(A~ AND·· ·AND A~).

R

where • denotes the max product compositional rule of inference and R is the fuzzy set union of rules, i.e. m

n

I'BI(Y) = ~)nI'A;(Xi»I'Bj(y) ;=1 i=l



Defuzzification:

The output y" is a representative point of B', i.e. if the center of area (COA) method is adopted

The main theoretical result which characterizes the MISO fuzzy inference system is

40

Result

For any given real continuous function f on a compact set U C 'R n and arbitrary f. > there exists a FIS with COA defuzzifier, product inference rule and single fuzzifier, i.e.

°

sup If(x) - g(x)1 mo > mj > 1.1 recommended in [45] . • In [45], it is clarified that FLVQ, like SOM, does not optmize any known objective function, and that it is expected to reach termination when the FCM objective function is approximately minimized [46]. In [67], [68], [69], EFLVQF learning schemes are formally derived to minimize a given functional when m is constant. It is also shown that FLVQ updating can be seen as a special case of EFLVQ-F learning schemes for a restricted range of the weighting exponent. This does not mean, however, that FLVQ minimizes the EFLVQ-F functional since the hypothesis m = constant does not hold true for FLVQ. We conclude that despite recent advances in the field, the objective function minimized by FLVQ is still unknown, as it is unknown the one minimized by SOM [72]. In line with Section 3, we believe that the absence of a known cost function may no longer be perceived as a structural drawback affecting either SOM or FLVQ.

64

termination criterion is removed . • With respect to SOM, FLVQ requires a smaller set of input parameters.

13

CGR Fuzzy ART

In recent years, several ART models have been presented: ART 1 [52], ART 2 [53], and CGR Fuzzy ART [50]. ART 1 categorizes binary patterns with variance to their training sequences. This experimental evidence led to the development ofthe Improved ART 1 system (IART 1) [75]. ART 2, developed to detect regularities in analog data sequences, employs a computationally expensive architecture affected by difficulty in parameter selection. The CGR Fuzzy ART system, that can process binary as well as analog patterns while featuring none of the ART 2 drawbacks, was developed starting from a ''fuzzification" of the choice function and match function employed in the ART 1 model. This means, however, that ART I-based structural problems may also affect the CGR Fuzzy ART system design. The structured organization of ART systems is made up of two subsystems. The ART aUentional subsystem is a completely generic and homogeneous network performing learning by examples. It employs a neuron activation function providing a normalized degree of similarity between the input pattern and one template vector. It is worth noting that the activation function is "unidirectional", i.e., this inter-pattern similarity measure does not satisfy the commutative law. Then, the attentional subsystem selects the winner neuron as the best matching unit. The ART orienting subsystem models the responses of the external environment to the learning activities of the attentional subsystem. In particular, this subsystem employs a "unidirectional" match function to perform a vigilance test. This match function does not satisfy the commutative law either, and is complementary to the activation function. The vigilance test checks whether the degree of match between the input pattern and the winner template is above a user-defined normalized vigilance threshold. If the vigilance test is satisfied (i.e., resonance occurs), then the winner template is updated, otherwise a mismatch reset condition is managed as follows. The orienting subsystem inhibits the winner neuron and submits the second winner neuron to the vigilance test. This search process is repeated until either the vigilance test is passed or no more neurons are available for testing. In this second case, a new neuron is created such that its template (receptive field center) is equal to the input pattern. Coarser grouping of input patterns is obtained when the vigilance parameter is lowered.

13.1

Input parameters

• eGR Fuzzy ART employs parameter a E [0.001,1) to break ties in interpattern similarity measures (see (14» . • It requires vigilance threshold p E [0,1] to control neuron proliferation (see (15». Parameters p and a are interrelated as illustrated in [76] (when p decreases, a must also decrease). A typical a value is 0.001 [51].

66

The left side of the vigilance test is a match function that measures to what normalized degree Twl matches X1c while it does not assess the reverse situation, i.e., M F1(Twb X1c) does not satisfies the commutative law. If the vigilance test is satisfied, then the attentional subsystem is activated to sequentially adjust the winner template according to (16). Otherwise, the mismatch reset condition and search process are activated (see above). • CaR Fuzzy ART employs a WTA strategy. The weight transformation law applied to the winner template is [50]:

Twl = (1 - fJ) . Twl

+ fJ . (X1c -

Twd

(16)

• When no existing neuron satisfies the vigilance test and a new processing unit wI is allocated by the orienting subsystem, then (16) is substituted by Twl = X1c to initialize the uncommited neuron. • Unlike Kohonen's clustering neural networks, CaR Fuzzy ART employs no cooling schedule, because the learning rate for commited neurons is constant in time. • CaR Fuzzy ART substitutes the operators employed in the ART I-based activation function and match function with fuzzy-like operations (intersection and cardinality, see (14) and (15». To be correctly interpreted as fuzzy operations, these operations would have to be applied to fuzzy set membership values, rather than to the parameters (pattern and template vectors) of the fuzzy set membership function [47]. Since CaR Fuzzy ART employs no fuzzy membership function, then it cannot employ fuzzy operations derived from fuzzy logic either. Moreover, CaR Fuzzy ART adopts a WTA adaptation strategy, thus, it cannot be termed fuzzy according to Section 2.

13.3

Architectural features

The main features of CaR Fuzzy ART are summarized: 1) on-line learning; 2) hard competitive learning (i.e., CaR Fuzzy ART does not satisfy the definition of fuzzy clustering NN adopted herein); 3) dynamic-sizing (dynamic neuron generation: yes; dynamic neuron removal: no); 4) no-linking (i.e., this model does not employ synaptic links to make neighboring neurons mutually coupled); 5) neuron receptive field information is represented as a single point identifying the receptive field center (cluster prototype); 6) the size ofthe neuron receptive field is upper bounded by a user defined parameter; 7) network-based variables to be updated: time; 8) neuron-based variables to be updated: receptive field center; 9) connection-based variables to be updated: none; 10) learning rate is constant with time (i.e., it does not satisfy first Kohonen's learning constraint); 11) the size of the update neighborhood is not computed (i.e., it does not satisfy second Kohonen's learning constraint); 12) it does not minimize any known objective function, i.e., termination is not based on optimizing any model of the process or its data [46]; 13) no mathematical tool provided by fuzzy set theory is employed; 14) CaR Fuzzy ART models the responses of the external environment to its learning activities (i.e., it satisfies the econet definition adopted herein); 15) it does not develop independent subsystems

67

(disjointed subnets); and 16) it does not perform perfect topology preserving mapping.

13.4

Limitations

• Experiments reveal that ART I-based models are affected by pattern mismatching sensitive to the order of presentation of the input sequence [54], [75]. This is due to the fact that two non-commutative inter-pattern measures are analyzed in sequence to detect output values of the activation and match functions, respectively. In the first stage, winner neuron wI is detected while ignoring the degree to which Twl matches X k . In the second stage, vigilance decisions are taken while ignoring the degree to which Xk matches Tw1 . • CGR Fuzzy ART is time consuming. Because two "unidirectional" similarity measures are analyzed in sequence, then ART I-based models must schedule a search process. Sorting required by this search process increases computation time as cloge in a serial implementation [48]. For example, Fuzzy SART, by employing an activation function which satisfies the commutative law, no longer requires any search process [54]). • It requires input data preprocessing (e.g., normalization or complement coding) to prevent category proliferation [50]. • Because it is uncapable of removing neurons, it may fit the noise, not just the signal, leading to overfitting. • It can't be employed in topology preserving mapping.

13.5

Advantages

• Feed-back interaction between attentional and orienting subsystems allows CGR Fuzzy ART to self-adjust its size depending on the complexity of the clustering task. • CGR Fuzzy ART can be employed as a vector requantization system. • Distinct sample vectors are employed to initialize reference vectors. This choice reduces the risk of dead units formation and may reduce computation time with respect to random initialization.

14

GNG

GNG combines the growth mechanism inherited from the earlier proposed Growing Cell Structures [80] with the topology geration rule CHR [62]. GNG is capable of generating/removing synaptic links and neurons, i.e, GNG belongs to the class of FSONN models (see Section 4). In particular, starting with very few units (generally, two), one new unit is inserted every A adaptation steps near the unit featuring the largest local error measure. In [82], it is anticipated that the future development of GNG will employ, besides the neuron insertion criterion described above, the following rule for neuron removal: every A adaptation steps, the unit featuring lowest utility for error reduction is removed.

J

68

This utility measure is defined as the increase in overall distorsion error occuring if the unit of interest were removed from the set of templates (codebook). The utility U(Tw d of codebook vector Tw 1, wI E {I, e}, where e is the total number of neurons, is defined as follows [83]:

U(Twd=

L

d(X,.,Tw2 )-d(X,.,Twd

(17)

x.eM.. 1

where k E {I, n}, such that n is the total number of input patterns, and d(X,., Twd is the Euclidean distance. In (17), Tw2 is the second best matching template for vector Xl:, where w2 =F wI, and Mw 1 is the subset of input patterns featuring template Twi as the closest reference vector, i.e., Mwi = {X,.,k E {1,n}: neuron wI satisfies the condition d(X,.,Twd ~ d(X,.,7j),Vj E {1,e}}.

14.1

Input parameters

• GNG employs two user-defined learning rates: (wi E (0,1) is applied to the winner neuron, while (n E (0,1), such that (n < (wi, is applied to neurons belonging to the update neighborhood. Typical values are: (wi = 0.05, (n = 0.0006 [56]. • Parameter A E N+ controls neuron generation at adaptation steps (see above). • Parameters cr and /3 are used to decrease the error variables, so that more recent signals are weighted stronger than previous ones [80]. Typical values are: cr = 0.5, /3 ;::: 0.0005 [56]. • Parameter a ma 3: E N+ is the maximum age of a synaptic link. A typical value is: ama 3: = 88 [56]. • GNG employs a convergence parameter, such as the maximum number of neurons ema 3: [56].

14.2

Description of the algorithm

• In line with SOM, GNG employs geometrical (distance) relationships in the input (measurement) space to enforce competitive learning mechanisms. Let us consider: a) winner neuron wI as the one featuring the shortest inter-pattern distance d(X,., Twd ~ d(X,., 11), Vi E {1,e}, where e is the total number of neurons, Twi is the template pattern of neuron wI, and d(X,.,11) is the the Euclidean distance; and b) the second best neuron w2 as the one featuring activation value d(X,., Tw2 ) ~ d(X,.,11), Vi E {1,e}, i =F wI. The exploitation of the Euclidean inter-pattern distance in competitive learning shapes neuron receptive fields as Voronoi polyedra (see Section 4). • As soos as neuron wI is detected, the local error variable of winner neuron wI is updated; for example, for requantization tasks, the increase of accumulated error Ew 1 is defined as (18)

69

otherwise, for probability density function estimation,

Ewl = Ewl

+ 1.

(19)

• In line with SOM, topological relationships among neurons belonging to l!Jl output neuron lattice are employed to detect neighboring (contextual, softmax) effects. In particular, GNG applies an update equation to the winner neuron wI and to its (topologically) adjacent neighbors, i.e., the resonance neighborhood consists of all neurons adjacent to the winner (such that synaptic links Lnkw1,i, i E {I, c}, i #; wI do exist). Therefore, due to dynamic generation/removal of neurons and synaptic links, the resonance domain changes with time to include neurons that are topologically adjacent to winner wI. Nonetheless, this behavior does not strictly satisfy the Kohonen constraint requiring the size of the update neighborhood to decrease with time monotonically. • Templates belonging to the update neighborhood are adapted according to the Kohonen weight adaptation rule. The update equation is (20)

= fWl, which is user-defined (see Subsection 9.1), if i = wI, and if vertice i is adjacent to the winner (see above). • With regard to generation of neurons, GNG employs the following strategy: one new units is inserted every A adaptation steps near the unit featuring the largest local error measure, identified as unit q. In particular, the receptive field center of the new unit is located half way between the templates of units q and f, where f identifies the neuron featuring the maximum