Clustering Large Data Sets Described With Discrete ...

1 downloads 750 Views 196KB Size Report
Jan 29, 2010 - tering methods for clustering large data sets some new methods were ... large data set with adapted leaders method (to reduce the size of data ...
Clustering Large Data Sets Described With Discrete Distributions and An Application on TIMSS Data Set ∗† ˇ Simona Korenjak-Cerne Vladimir Batagelj §‡ Barbara Japelj Paveˇsi´c k¶

January 29, 2010

Abstract Symbolic Data Analysis is based on a special descriptions of data – symbolic objects. Such descriptions preserve more detailed information about data than the usual representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We present two clustering methods based on the data descriptions with discrete distributions: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward’s method. Both methods are compatible – they can be viewed as two approaches for solving the same clustering optimization problem. In the obtained clustering to each cluster is assigned its leader. The descriptions of the leaders offer simple interpretation of the clusters’ characteristics. The leaders method enables us to efficiently solve clustering problems with large number of units; while the agglomerative method is applied on the obtained leaders and enables us to decide upon the right number of clusters on the basis of the corresponding dendrogram. †

University of Ljubljana, Faculty of Economics, Department of Statistics, [email protected] (corresponding author) § University of Ljubljana, Faculty of Mathematics and Physics, Department of Mathematics, [email protected] k The Educational Research Institute, Slovenia, [email protected]

1

Both methods were successfully applied in analyses of different data sets. In the paper an application on TIMSS data set is presented. Keywords: symbolic object, mixed units, clustering, large data sets, leaders method, agglomerative hierarchical clustering method, k-means method, Ward’s method, TIMSS.

1

Introduction

Although there are many clustering methods ([1], [2], [3], [4]), most of them do not scale well for large data sets. Besides classical k-means and dynamic clustering methods for clustering large data sets some new methods were developed recently. In this paper we propose an approach for clustering large data sets of mixed units based on discrete distributions. One of the possible solutions of the problem with mixed units is to choose an appropriate description for such units. We selected the description of units with discrete distributions because such a description has many advantages: it enables us to deal with all types of variables, reduces large data set, and also saves more detailed information about the data than the mean values. The description with distributions represents a special kind of symbolic object ([5], [6], [7]). Focusing on the problem of large data sets, we adapted the well known leaders method (see [3], page 74) for the selected uniform representations of units and clusters. To reveal the internal structure of the reduced data set we adapted also the agglomerative hierarchical clustering method. We proved that for the selected dissimilarity the adapted method is an adapted version of the Ward’s hierarchical clustering method ([8]). Different applications of the proposed methods were presented in several conferences and some of them were also published in proceedings ([9], [10], [11], [13]). Each of the methods can be used separately. Since they are compatible they can be also used to perform clustering in two stages: first to efficiently cluster large data set with adapted leaders method (to reduce the size of data set), and afterward to cluster the obtained clusters of units with the agglomerative hierarchical method to reveal the internal structure among them.

2

The description based on discrete distributions

In the vector description of the data each component of the vector corresponds to a descriptor – variable. Variables can be measured in different scales: nominal, ordinal, numerical. When all variables are not of the same type we are dealing with

2

the so called mixed data. For the description based on distributions the domain of each variable V j (j = 1, · · · , m) is partitioned into kj subsets {Vij , i = 1, . . . kj }. For a cluster C we denote with f (i, C; V j ) the frequency and with π(i, C; V j ) =

f (i, C; V j ) card(C)

the relative frequency of the values of variable V j in the i-th subset Vij . For all variables V j (j = 1, · · · , m) holds kj X

π(i, C; V j ) = 1.

(1)

i=1

The description C(V j ) of the cluster C for variable V j is a distribution, e.g. the vector of the frequencies on the subsets Vij (i = 1, · · · , kj ). The cluster C is described with the vector C of the distributions which is a special kind of a symbolic object ([6]): C = [C(V 1 ), . . . , C(V m )], C(V j ) = [π(1, C; V j ), . . . , π(kj , C; V j )], A unit is considered as a special cluster with only one element and in our approach can be represented by the variable V j (j = 1, · · · , m) either with a single value or by the distribution over the subsets Vij (i = 1, · · · , kj ). Such a description has the following important properties: 1. it produces an uniform description for all types of variables; 2. it requires a fixed space per variable. This is specially important if we are dealing with large data sets; 3. it is compatible with merging of disjoint clusters: knowing the descriptions of clusters C1 and C2 , C1 ∩ C2 = ∅, we can, without additional information, produce the description of their union π(i, C1 ∪ C2 ; V j ) =

card(C1 ) π(i, C1 ; V j ) + card(C2 ) π(i, C2 ; V j ) ; card(C1 ) + card(C2 )

4. the distributions in the description enable us to reduce large data sets; 5. it also preserves more information about the cluster than the usual value of an appropriate statistics – e.g. mean value used in the usual approach.

3

2.1 Example: TIMSS data set description using distributions The representation with distributions over possible values is specially useful for nominal data. This is the usual case in the data sets in social sciences (for example data obtained by interviews). To present the advantages of such data representation we applied the proposed approach for an exploratory data analysis of different data sets from social science research, such as ego-centered networks. In ego-centered networks objects of interest (units) are respondents of the interview called egos and also persons related with them (called alters) are considered. Alters can be described with the same variables as egos and/or with some (other) variables which describe relationship among ego and alters and properties of alters [13]. This approach was successfully applied on the data set on social support in Ljubljana, collected by the Faculty of social sciences at the University of Ljubljana between March and June 2000 by computer assisted telephone interview (CATI) and computer assisted personal interview (CAPI) [10]. The more complex approach was used to the data set from TIMSS, Trends in International Mathematics and Science Study, large scale study of measuring trends in mathematics achievement of 4th and 8th-grade elementary school students from different countries. The survey repeats every four years. In TIMSS, student achievement is measured by tests of mathematics items while related factors on student, teacher and school level are collected with questionnaires for each group of participants. The aim of TIMSS is to measure knowledge of students and explain differences between students by factors related to teaching, learning and class activities. The main problem of analyzing such hierarchical data is to use linked data from student and their teacher simultaneously to study activities in the class. The main goal of the study is to find efficient methods of teaching and learning, independent from the cultural background, with respect to high student achievement. Since representations with distributions of nominal data give possibility of analyzing two levels data, the clustering method, applied on TIMSS data on teachers and students, translated into an ego-centered network, was found to be the most appropriate and new way to explore which factors describe typical teaching methods over countries. With the adapted leaders clustering method, similar teachers were assigned to each cluster according to their reports about teaching and their students’ reports on the work in classes at the same time. Typical characteristics of clusters were found. With the compatible adapted hierarchical clustering method the most appropriate numbers of clusters were determined. Clusters were also compared between themselves by measured mathematics achievement of students assigned to each clusters through their teacher and other variables on learning mathematics. The results were presented at the 1st IEA (Evaluation of Educational Achievement) International Research Conference IRC-2004 on Cyprus [11], and at IFCS 2006 in Ljubljana [12]. 4

We applied our methods on data from years 1999 and 2003 from most of participating countries. Each data sets included more than 30 countries [15], [16]. TIMSS data sets, prepared for the application of the adapted leaders method consisted of data on teacher and data on his/her students. Units of the analysis were teachers, described by their variables: gender, age, education, their work in classes, pedagogical approaches used in the class, opinions about mathematics, classroom activities, use of IT and issues on homework. Each teacher’s description also includes the distributions of students’ answers, describing students attitudes toward mathematics, such as valuing math, enjoy learning, self confidence in mathematics, activities in class such as students´ use of IT, their participation in learning mathematics and their strengths in mathematics. These values were collected with the separate questionnaires for students. Teachers are considered as egos and their students as alters. For each ego (teacher) values of his/her variables T1 , T2 , T3 , ... are presented with singular values of the appropriate subset numbers in their domains’ partitions, but for their alters’ (students) each alter’s variable S1 , S2 , ... is represented with frequency distribution of alters’ answers. For example, the symbolic object corresponding to teacher with id 4567 is SO4567 = [4567, 4, 6, 3, ... [ 47, 16, 37, 0,0], [0,0,100,0], ...] ↑ ↑ ↑ ↑ ↑ ↑ id T1 T2 T3 ... S1 S2 ... The values for the alter’s variable S1 are divided into the following four subsets 1 = strongly agree, 2 = agree, 3 = disagree, and 4 = strongly disagree, The meaning of the distribution for the variable S1 is that 47 students among all students of the teacher with id 4567 strongly agree with the statement S1 , 16 of students agree with the statement S1 , 37 of students disagree with the statement S1 , and non of students strongly disagree with the statement S1 . The last number in the distribution represents the number of missing values. In this case there were no missing values on this question.

3

Dissimilarity

Based on the introduced descriptions of the clusters and of the units we can use two dissimilarities, both defined as a weighted sum of the dissimilarities on each variable: m m X X j d(C1, C2 ) = αj d(C1, C2 ; V ), αj = 1 (2) j=1

j=1

5

where d(C1 , C2; V j ) is either kj

1X |π(i, C1 ; V j ) − π(i, C2 ; V j )| dabs (C1 , C2 ; V ) = 2 i=1 j

or

(3)

kj

1X dsqr (C1 , C2 ; V ) = (π(i, C1 ; V j ) − π(i, C2 ; V j ))2 . 2 i=1 j

(4)

Here, αj ≥ 0 (j = 1, . . . , m) denote weights, which could be equal for all variables or different if we have same additional information about the importance of the variables. Both dissimilarities dabs and dsqr have range [0, 1]. For the dissimilarity dabs the triangle inequality also holds – it is a semidistance.

4

The adapted leaders method

One of the most popular clustering methods for large data sets is k-means method, which is a special version of the leaders method. k-means method can be applied only on numerical variables. The leaders method as a variant of the dynamic clustering method [17] can be described with the following procedure: determine an initial clustering repeat determine leaders of the clusters in the current clustering; assign each unit to the nearest new leader – producing a new clustering until the leaders do not change any more. The leaders method is a local optimization method and it solves the following optimization problem: Find a clustering C∗ in a set of feasible clusterings Φ for which P (C∗ ) = min P (C)

(5)

C∈Φ

with the criterion function X P (C) = p(C) where

p(C) = p(C, LC ) =

C∈C

X

d(X, LC ).

(6)

X∈C

The set of feasible clusterings Φ is a set of partitions into k clusters of the finite set of units U.

6

The initial clustering can be obtained randomly with selected number of clusters k or can be determined from the units by selection of the maximal allowed dissimilarity between the unit and the nearest leader. The optimal leader LC of the cluster C has to satisfy the following relation LC = argmin p(C, L), L∈Ψ

where Ψ is a set of feasible representations of the clusters. In our approach, the leader L (a representative element) of the cluster C is also described as a vector of distributions L = [L(V 1 ), . . . , L(V m )], L(V j ) = [λ(1, L; V j ), . . . , λ(kj , L; V j )], where

kj X

λ(i, L; V j ) = 1,

(j = 1, · · · , m).

i=1

The error p(C, L) of the cluster C for the leader L can be splitted into variables’ parts p(C, L) = =

X

X∈C m X j=1

d(X, L) =

m XX

αj · d(X, L; V j )

X∈C j=1

αj ·

X

j

d(X, L; V ) =

m X

αj · p(C, L; V j ),

j=1

X∈C

Because of the additivity of the model it is enough to find appropriate distribution of the optimal leader for a single variable denoted by V . Theorem 1 For the criterion function Pabs , where in the expression (6) the dissimilarity dabs (3) is used and where all units are represented with a single value for each variable (usual vector description of the unit), the optimal leader is determined with the maximal frequencies. We select the ’symmetric’ solution  1 if i ∈ M t λ(i, L; V ) = 0 otherwise where M = {j : f (j, C; V ) = maxl f (l, C; V )} and t = card(M). Proof: Let j(X) = i ⇔ V (X) ∈ Vi . Since each unit is represented with a single value for each variable we have  1 if i = j(X) π(i, X; V ) = 0 otherwise 7

Therefore kj

1X dabs (X, L; V ) = |π(i, X; V ) − λ(i, L; V )| = 2 i=1 X 1 = [(1 − λ(j(X), L; V )) + λ(i, L; V )] 2 i6=j(X)

Since

k X

λ(i, L; V ) = 1, the dissimilarity dabs (X, L; V ) can be written as

i=1

dabs (X, L; V ) = 1 − λ(j(X), L; V ). There are f (i, C; V ) units in each subset Vi with the values in this subset. The sum kj X over all subsets is equal to the number of all units in the cluster f (i, C; V ) = i=1

card(C). So we have

p(C, L; V ) =

X

dabs (X, L; V ) = card(C) −

kV X

f (i, C; V ) λ(i, L; V )

i=1

X∈C

This expression will be the smallest when the sum on the right side will be the biggest. This means that for each variable V the subset Vi with the biggest frequency f (i, C; V ) determines the distribution LC (V ) of the optimal leader LC . If Vi is the only subset with the maximal frequency, then λ(i, LC ; V ) = 1 and λ(j, Lc ; V ) = 0 for j = 1, · · · , kV , j 6= i. For polymodal distributions the value of the p(C, L; V ) will be the same if we select only one of the subsets with the maximum frequency or any combination of them. So in this case the optimal leader is not uniquely determined. We select the ’symmetric’ solution  1 if i ∈ M t λ(i, L; V ) = 0 otherwise where M = {j : f (j, C; V ) = maxl f (l, C; V )} and t = card(M).



The important advantages of the criterion Pabs are fast convergence of the procedure and very simple interpretation of the clustering results. The main drawback besides the restriction to the usual vector description of the units (although values of variables could be nominal, because the distributions of values for each variable are considered) is, that the leaders are not uniquely determined in polymodal distributions. 8

Theorem 2 For the criterion function Psqr with dissimilarity dsqr (4) the optimal leaders are uniquely determined with the averages of relative frequencies X 1 π(i, X; V ). card(C) X∈C

λ(i, L; V ) =

The following well known lemma will simplify the proof of the theorem: Lemma 3 The expression F (a) =

N X

(xi − a)2 ,

i=1

where xi are known numbers, has the smallest value when a is equal to N 1 X xi . a = N i=1 ∗

Proof (of the theorem 2): We are looking for such a leader LC that the value of the expression X p(C, X; V ) = dsqr (X, LC ; V ) X∈C

=

kV 1 XX (π(j, X; V ) − λ(j, LC ; V ))2 2 X∈C j=1 k

V X 1X = (π(j, X; V ) − λ(j, LC ; V ))2 2 j=1 X∈C

is the smallest. By lemma 3 this would be when for the components λ(j, LC ; V ) (j = 1, · · · , kV ) the averages of relative frequencies are chosen λ(j, LC ; V ) =

X 1 π(j, X; V ). card(C) X∈C 

Lemma 4 The sum of the components from the theorem 2 is 1.

9

Proof: kV kV X XX X 1 1 π(j, X; V ) = 1 λ(j, LC ; V ) = π(j, X; V ) = card(C) j=1 X∈C card(C) X∈C j=1 j=1

kV X

In the last equality we again used the property that the sum of the relative frequencies is 1.  The main advantages of the method with the criterion function Psqr are: • the input unit can be represented also with nontrivial distributions, and not only with a single value for each variable, • the optimal leaders are uniquely determined, • simple interpretation of cluster’s characteristics obtained from distributions. The leaders method based on Theorem 2 is an extended version of the wellknown k-means method [3]. As all such methods with the similar definitions of the dissimilarity [18] also this one demands several iterations and therefore the corresponding procedure could be expensive. Lemma 5 When each unit is represented only with a single value for each variable (usual vector description) the optimal leader LC has the same description as the cluster C to which it belongs: λ(j, LC ; V ) = π(j, C; V ). Proof: If we have the usual vector description of the unit then  1 if i = j(X) π(i, X; V ) = 0 otherwise Because of that on the component j is as many times 1 as is the number of units in the sub-set Vj and that equals to the frequency f (j, C; V ) on that subset. So it holds X 1 f (j, C; V ) λ(j, LC ; V ) = π(j, X; V ) = = π(j, C; V ). card(C) X∈C card(C)  10

5

Building a hierarchy

Besides local optimization clustering methods also hierarchical clustering methods are very popular. Their advantage is the well known tree representation of the results – a dendrogram. Their main drawback is that they are computationally too expensive for large data sets (the number of units is limited to some hundreds). We developed the adapted agglomerative clustering method based on the descriptions of units, clusters and clusters’ leaders by discrete distributions. Such a description itself reduces large data set. On the other hand, the adapted agglomerative clustering method can also be used to reveal the internal structure between the clusters (obtained by leaders method). The standard agglomerative hierarchical clustering method is described with the following procedure: each unit forms a cluster: Cn = {{X} : X ∈ E} ; they are at level 0: h({X}) = 0, X ∈ E ; for k = n − 1 to 1 do determine the closest pair of clusters (p, q) = argmini,j : i6=j {D(Ci , Cj ) : Ci , Cj ∈ Ck } ; join the closest pair of clusters C(pq) = Cp ∪ Cq Ck = (Ck+1 \ {Cp , Cq }) ∪ {C(pq) } ; h(C(pq) ) = D(Cp , Cq ) determine the dissimilarities D(C(pq) , Cs ), Cs ∈ Ck endfor Ck is a partition of the finite set of units U into k clusters. The level h(C) of the cluster C(pq) = Cp ∪ Cq is determined by the dissimilarity between the joint clusters Cp and Cq by h(C(pq) ) = D(Cp , Cq ). In our approach we take as the units X the clusters from the clustering, obtained by leaders method, represented with their leaders. So h(C) = 0 for all C from the initial clustering. Initial clustering can be obtained for example with the adapted leaders method. The hierarchical clustering methods can be viewed also as a greedy methods for solving the clustering problem (6),(5),(4) [19]. In that context another definition is needed: Definition 1 The criterion function P is compatible with the dissimilarity D iff M (i) P (C) = p(C) C∈C

(ii) p(C) = min (p(C ′ ) ⊕ p(C \ C ′ ) ⊕ D(C ′ , C \ C ′ )) ′ ∅⊂C ⊂C

11

(iii) p({X}) = 0 for all X ∈ U L and (IR+ , 0, ≤) is an ordered Abelian monoid. 0,

For example, minimum, maximum and Ward’s hierarchical clustering methods satisfy the Definition 1 as can be seen from Table 1. Table 1: Some compatible criterion functions and dissimilarities P

METHOD maximal (CL) minimal (SL)

P

Ward

max d(X, Y )

max p(C) P

ni = card(Ci ), Ci =

D(C1 , C2 )

p(C) X,Y ∈C

the value of the minimal spanning tree over C with edge values X d(X, Y ) ||X − C||2

p(c) p(c)

X∈C

max

d(X, Y )

min

d(X, Y )

X∈C1 ,Y ∈C2

X∈C1 ,Y ∈C2

n1 · n2 ||C1 − C2 ||2 n1 + n2

1 X X, ||.|| is Euclidean distance. ni X∈C i

For the criterion function P , compatible with the dissimilarity D, the following relation holds (see [20]): min P (C) = P (C∗k ) = min (P (C) ⊕ min D(Ci, Cj )),

C∈Φk

C∈Φk+1

Ci ,Cj ∈C

where Φk is a set of partitions of U into k clusters. Based on this relation the following greedy approximation is obtained: P (C•k ) ≈ P (C•k+1) ⊕ D(Cp , Cq ),

(7)

where D(Cp , Cq ) =

min

Ci ,Cj ∈C•k+1

D(Ci, Cj )

and C•k+1 is the greedy optimal clustering from the previous step. Lemma 6 The dissimilarity between clusters D(Cp , Cq ) that measures the change of the value of the criterion function produced by the merging of the clusters Cp and Cq D(Cp , Cq ) = P (Ck ) − P (Ck+1) = p(Cp ∪ Cq ) − p(Cp ) − p(Cq ) is compatible with the criterion function P (6). 12

(8)

Proof: L In our case the operation is ordinary addition and the leader of the cluster with only one unit {X} is the unit X itself. The consequence of this is that the first and the last requirements in the Definition 1 are satisfied. By the definition (8) of the dissimilarity D is p(C ′ ) + p(C \ C ′ ) + D(C ′, C \ C ′ ) = = p(C ′ ) + p(C \ C ′ ) + p(C) − p(C ′ ) − p(C \ C ′ ) = p(C) for all subsets ∅ ⊂ C ′ ⊂ C, so also the second requirement in the definition of the compatibility is satisfied.  In our adapted agglomerative clustering method we use the relation (8) for the definition of the dissimilarity D. The computation of the dissimilarity D by the relation (8) could be quite expensive, but fortunately the following theorem holds: Theorem 7 For the criterion function Psqr the dissimilarity D(Cp , Cq ) (8) can be calculated using the Ward’s relation: D(Cp , Cq ) =

card(Cp ) · card(Cq ) dsqr (Lp , Lq ). card(Cp ) + card(Cq )

(9)

The proof of this theorem is rather technical and therefore it is given in the appendix of the article. This method represents a special kind of a generalized Ward clustering method ([21]). The dissimilarity in the Ward method (also in the general case) satisfies the well known Lance-Williams recursive formula ([22]) D(Cp ∪ Cq , Cs ) = αp D(Cp , Cs ) + αq D(Cq , Cs )+ +βD(Cp , Cq ) + γ|D(Cp , Cs ) − D(Cq , Cs )|.

(10)

where αp =

np + ns nq + ns ns , αq = , β=− , γ=0 np + nq + ns np + nq + ns np + nq + ns

and ni = card(Ci ), i = p, q, s. Because also in the adapted hierarchical clustering method the recursive relation is satisfied with the same values of the coefficients αp , αq , β and γ, monotonic clusterings are guarantied by the following theorem: Theorem 8 ([23]) The clustering level h(Cp , Cq ) = D(Cp , Cq ) of the clusters in the dendrogram, where dissimilarity D satisfies the Lance-Williams formula (10), is monotonic iff at each step of the clustering procedure the following conditions hold: αp + αq ≥ 0, γ + min{αp + αq } ≥ 0, αp + αq + β ≥ 1 13

5.1 An application on TIMSS data set As was mentioned in the section 2.1, we applied the adapted leaders method on the TIMSS data for the years 1999 and 2003. Units of analysis were symbolic objects described with discrete distributions. At first stage, we applied the adapted leaders method with different number of initial clusters (50, 40, 30 clusters). Afterwards, the adapted hierarchical clustering method was applied on these clusters to determine the most appropriate number of them. Seven well separated clusters of teachers were detected. At the final stage, we applied the adapted leaders method on the TIMSS data sets again with seven initial clusters and use clustering results to study differences in teaching and learning mathematics in classes over the world. The results of the analysis of the TIMSS 2003 data set are presented in this paper. The data include 6552 teachers and 131000 students, representing more than 10 millions of students of the 8th grade in 30 countries. 5.1.1 Clusters characteristics From the descriptions of the obtained clusters differences between clusters in pedagogical approaches to teaching mathematics were found. Achievement of students belonging to clusters of teachers decrease on the scale: cluster 6 (541 points), cluster 2 (532 points), cluster 7 (523 points), cluster 3 (516 points), cluster 5 (515 points), cluster 1 (499 points). Cluster 4 grouped teachers with missing data. Characteristic variables were defined to be variables that have the same value for at least 50% of units in the cluster. Typically, teacher variables reached higher percentages than student variables. For each cluster, different set of characteristic variables was found. From most frequent values of them, descriptions of differences in cluster characteristics were drown. Here is their summary: In cluster 6, with the highest achievement, teachers report no cooperation with other teachers, they almost never discuss how to teach, observe other teachers or prepare teaching materials together. It seems that teachers who teach weaker students feel more need to discuss and cooperate with other teachers. Teachers in cluster 6 also do not participate in in-service teacher training while teachers in clusters with middle achievement participate in all kind of professional development programs. Teachers in the lowest achieving cluster are trained on mathematics curriculum only. In higher achieving clusters calculators and computers are permitted to be used by students, but in cluster 6 students almost never use them either for routine calculations either for solving complex problems, while in cluster 5 they use them at least in every second lesson for different routine and non-routine tasks. 14

There are also differences in assigning and monitoring homework. In cluster 6, students get homework at every lesson, and it is almost always corrected by a teacher. For homework, these students are always requested to solve problems, as opposite to practice skills. Students of cluster 1 always have to review their homework by themselves. In cluster 5 homework is checked by a teacher but reviewed or corrected by students only. These students often have to find the use of mathematical content for their homework or collect data, while in other clusters students are not asked to do this so often. Half of cluster 5 students have to explain their answers at every lesson, while students of other clusters are not required to this so often. In tests of mathematics in cluster 6, there are always items from applications of mathematics. Most students at least sometimes have to justify or explain solutions of their homework. In cluster 3 students typically have to practice computational skills more than in other clusters and they are rarely asked to work on more challenging contents. Teachers in cluster 2 feel limited in teaching by students’ academic abilities more than in other clusters. These students are most often asked by a teacher to do a lot of computational practice without calculators and few problem solving. 5.1.2 The relationship between the extent of the topics taught and students’ achievement Since the representation with the distributions joins the teachers and the students data sets, we could study some further relationships between variables or also between some other factors not included into clustering procedure through clusters’ memberships. For example, teachers were asked which mathematics content topics were already taught to their students. In Figure 1 differences in mathematics achievement and number of topics taught in school to students (before taking TIMSS tests) by clusters is presented. Graph shows the tendency that achievement of students is higher when teachers reported less topics taught in school. It seems that teachers of higher achieving students more often reported topic taught when the topic is really known by students, while teachers in lower achieving clusters reported many topics taught without their students knowing them well. It can be said that teachers could be more critical in declaring what is already taught, or curriculum could be more concentrated on important basic knowledge. In TIMSS study students are assigned to different benchmark levels of mathematical knowledge. Cluster analysis revealed that students of teachers in cluster 6 have reached higher benchmarks in larger percentages than students of teachers in other clusters. Figure 2 represents distribution of students reaching benchmarks for three clusters with different form of its line graph. Cluster 6 has reached the largest percentage of students at advanced level of mathematical knowledge while 15

Figure 1: Percentages of TIMSS test topics taught to students before test and mathematics achievement on TIMSS test by clusters cluster 1 has most students staying under low level benchmark and least students achieving the advanced level benchmark. 5.1.3 The participation of the countries in the clusters While descriptions of clusters provide deeper look into methods of teaching in each cluster, distribution of teachers over clusters also allow further analyses of variables not included in the clustering procedure. For example, we observed the countries of teachers inside each cluster. Interestingly, the majority of teachers from the same country was assigned to the same one or two clusters, although no variable indicating country or cultural background of teachers was used in the clustering process. The results for the TIMSS 2003 data are presented in the Table 2. It can be seen that some clusters consist of teachers or countries from specific geographical or cultural background. English speaking group of countries, Far East, Eastern and Western Europe clusters can be recognized. Considering also differences in cluster characteristics, mathematics was found to be taught differently and with different efficiency in these geographical areas. Since data from 1999 has shown similar distributions of countries over clusters we believe that it might be said that country specifics and cultural background implicitly influence on teaching and learning mathematics. 16

Table 2: Clusters by country of teacher, TIMSS 2003

17

Figure 2: Benchmark levels of math achievement reached by student for three clusters

6

Conclusion

In the paper two clustering methods adapted for clustering units described with discrete distributions are presented. The representations of units, clusters and clusters’ representatives (leaders) with discrete distributions are special kind of symbolic objects and as such preserve more information than classical presentations with mean values. Such representations also enable us to include all types of variables into the clustering process at the same time. Both methods can be used separately or in combination, since they are compatible. We applied both methods on TIMSS data set from years 1999 and 2003. Some of the results revealed expected relations, but with both methods also some unexpected results were found. Therefore we can say that new clustering methods enable us to make some important conclusions about differences in teaching and learning mathematics in classes over the world that could not be seen with the traditional approaches for analyzing such data. The analysis were done using program package Clamix. It is available at http://educa.fmf.uni-lj.si/datana/pub/clamix/.

18

A Appendix Proof of the Theorem 7: The errors of the clusters can be splitted by the variables: p(Cp ∪ Cq ) − p(Cp ) − p(Cq ) =

m X

αj (p(Cp ∪ Cq ; V j ) − p(Cp ; V j ) − p(Cq ; V j )).

j=1

Because of the additivity of the model it is enough to consider only one variable V . By the definition the error of the cluster is X p(C; V ) = dsqr (X, LC ; V ), X∈C

For the dissimilarity dsqr (4) and optimal leader LC (Theorem 2) we get kV  2 X X 1X 1 π(i, Y ; V ) . π(i, X; V ) − p(C; V ) = 2 i=1 card(C) Y ∈C X∈C

And further kV  X  X 2  2 1X 1 p(C; V ) = π(i, Y ; V ) π(i, X; V ) − 2 i=1 X∈C card(C) Y ∈C

On the other hand the sum

(11)

XX 1 dsqr (X, Y ; V ) can be written as card(C) X∈C Y ∈C

kV X X  2 XX X 1 1 π(i, X; V )−π(i, Y ; V ) = dsqr (X, Y ; V ) = card(C) X∈C Y ∈C 2 · card(C) i=1 X∈C Y ∈C kV  2 X 2  X X 1 2 · card(C) · π(i, X; V ) − 2 · = π(i, Y ; V ) 2 · card(C) i=1 X∈C Y ∈C

Comparing these two equalities we get new expression for the cluster’s error: p(C; V ) =

XX 1 dsqr (X, Y ; V ) 2 · card(C) X∈C Y ∈C

(12)

Equalities (11) and (12) are used in the proof of the equality D(Cp , Cq ) = p(Cp ∪ Cq ) − p(Cp ) − p(Cq ) = 19

np · nq dsqr (Lp , Lq ) np + nq

(13)

Assume that clusters Cp and Cq are disjoint. Based on this assumption also card(Cp ∪ Cq ) = np + nq holds. Using the equality (12) we get X X 1 dsqr (X, Y ; V ) p(Cp ∪ Cq ; V ) = 2 · (np + nq ) X∈C Y ∈C p

X X

+2 ·

dsqr (X, Y ; V ) +

1 2 · (np + nq )

X X



dsqr (X, Y ; V ) =

X∈Cq Y ∈Cq

X∈Cp Y ∈Cq

=

p

  X X dsqr (X, Y ; V ) + 2 · nq · p(Cq ; V ) 2 · np · p(Cp ; V ) + 2 · X∈Cp Y ∈Cq

Also the symmetry of the dissimilarity dsqr was used. The expression can be simplified with the (12) into

=

1 np + nq

p(Cp ∪ Cq ; V ) − p(Cp ; V ) − p(Cq ; V ) =  X X dsqr (X, Y ; V ) − nq · p(Cp ; V ) − np · p(Cq ; V ) · X∈Cp Y ∈Cq

To prove the equality (13) for the variable V the following equivalence has still to be proven: X X dsqr (X, Y ; V )−nq ·p(Cp ; V )−np ·p(Cq ; V ) = np ·nq ·dsqr (Lp , Lq ; V ). (14)

X∈Cp Y ∈Cq

By the definition of the dissimilarity dsqr X X

X∈Cp Y ∈Cq

kV  2 X X 1X dsqr (X, Y ; V ) = π(i, X; V ) − π(i, Y ; V ) . 2 i=1 X∈C Y ∈C q

p

Therefore X X

X∈Cp Y ∈Cq

−2 ·

X

X∈Cp

kV  2 X 1X dsqr (X, Y ; V ) = π(i, X; V ) nq · 2 i=1 X∈C p

2    X X π(i, Y ; V ) . π(i, Y ; V ) + np · π(i, X; V ) · Y ∈Cq

Y ∈Cq

The expression kV 2 X 1X π(i, X; V ) nq · 2 i=1 X∈C p

20

by (11) equals to k

V 2 1X nq  X π(i, X; V ) . nq · p(Cp ; V ) + 2 i=1 np X∈C p

Similarly the expression kV 2 X 1X π(i, Y ; V ) np · 2 i=1 Y ∈C q

can be substituted by the sum k

V 2 np  X 1X π(i, Y ; V ) . np · p(Cq ; V ) + 2 i=1 nq Y ∈C q

And finally X X

dsqr (X, Y ; V ) − nq · p(Cp ; V ) − np · p(Cq ; V ) =

X∈Cp Y ∈Cq

kV  2 2  X X 1X np  X nq  X π(i, Y ; V )+ π(i, X; V ) π(i, X; V ) −2 p(i, Y ; V ) . 2 i=1 np X∈C n q Y ∈C X∈C Y ∈C q

p

p

q

On the other hand also the following equivalence holds 2 kV  1 X 1X 1 X π(i, X; V )− π(i, Y ; V ) . np ·nq ·dsqr (Lp , Lq ; V ) = np ·nq · 2 i=1 np X∈C nq Y ∈C p

q

After simplification and the comparison of last two equalities we finally set D(Cp , Cq ) = p(Cp ∪ Cq ) − p(Cp ) − p(Cq )   m X = αj p(Cp ∪ Cq ; Vj ) − p(Cp ; Vj ) − p(Cq ; Vj ) j=1

=

m X j=1

αj ·

np · nq · dsqr (Lp , Lq ; V ) np + nq

m np · nq X · αj · dsqr (Lp , Lq ; V ) np + nq j=1 np · nq · dsqr (Lp , Lq ) = np + nq

=

 21

Acknowledgment This work was partially supported by the Ministry of Higher Education, Science and Technology of Slovenia, Grant P1-0294.

References [1] P. H. A. Sneath and R. R. Sokal, Numerical taxonomy. Freeman, San Francisco, 2 edition, 1973. [2] M. R. Anderberg, Cluster Analysis for Applications. Academic Press, New York, 1973. [3] J. A. Hartigan, Clustering Algorithms. Wiley, New York, 1975. [4] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990. [5] H. H. Bock and E. Diday, E, Symbolic Objects. In Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer, Heidelberg, 2000. [6] L. Billard and E. Diday, Symbolic Data Analysis. Conceptual Statistics and Data Mining. Wiley Series in Computational Statistics. Wiley, 2006. [7] E. Diday and M. Noirhomme-Fraiture, Symbolic Data Analysis and the SODAS Software. JohnWiley & Sons, Ltd., 2008. [8] J. H. Ward, Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, volume 58, 1963, pp. 236–244. ˇ [9] S. Korenjak-Cerne and V. Batagelj, Symbolic clustering of large datasets. In Classification, clustering and data analysis: recent advances and applications, (Studies in classification, data analysis, and knowledge organization). Berlin; Heidelberg; New York: Springer, 2002, pp. 319–327. ˇ [10] S. Korenjak-Cerne, Symbolic Data Analysis Approach to Clustering Large Ego-Centered Networks. Metodoloˇski zvezki 17: Developments in Statistics, FDV Ljubljana, 2002, pp. 45–53. ˇ [11] B. Japelj Paveˇsi´c and S. Korenjak-Cerne, Differences in teaching and learning mathematics in classes over the world : the application of adapted leaders clustering method. In Proceedings of the IRC - 2004 : IEA International 22

Research Conference. Nicosia: University of Cyprus, Department of Education, 2004, pp. 85–101. [12] B. Japelj Paveˇsi´c, Comparison of teaching mathematics: An application of Adapted Leaders Clustering Method. Presented on the international conference IFCS 2006 in Ljubljana. ˇ [13] N. Kejˇzar, S. Korenjak-Cerne, and V. Batagelj, Clustering of temporal citation profiles. Preprint. Presented on SUNBELT 2007, Corfu Island, Greece, and on COMPSTAT 2008, Porto, Portugal. [14] C. M¨uller, B. Wellman, and A. Marin, How to use SPSS to study egocentered networks. Bulletin de m´ethodologies sociologiques, Number 64, 1999, pp. 63–76. [15] The TIMSS International Database. TIMSS & PIRLS International Study Center. Boston College, Lynch School of Education. USA. http://timss.bc.edu [16] The Educational Research Institute, Slovenia http://www.pei.si/ [17] E. Diday, Optimisation en classification automatique, Tome 1.,2.. INRIA, Rocquencourt (in French), 1979. [18] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, The analysis of a simple k -means clustering algorithm. In Proceedings of the Symposium on Computational Geometry, 2000, pp. 100– 109. [19] A. Ferligoj and V. Batagelj, Clustering with Relational Constraint. Psychometrika, volume 47, number 4, 1982, pp. 413–425. [20] V. Batagelj, Agglomerative Methods in clustering with Constraints. Presented at the Joint European Meeting of the Psychometric and Clasification Societies, Jouy-en-Josas, France, July, 6-8, 1983. [21] V. Batagelj, Generalized Ward and related clustering problems. In Classification and related methods of data analysis, North-Holland, Amsterdam, 1988, pp. 67–74. [22] G. N. Lance and W. T. Williams, A general theory of classificatory sorting startegies, 1. Hierarchical systems. The Computer Journal, Volume 9, 1967, pp. 373–380.

23

[23] V. Batagelj, Note on Ultrametric Hierarchical Clustering Algorithms. Psychometrika, volume 46, number 3, 1981, pp. 351–352.

24