Knowledge Discovery in Large Data Sets

4 downloads 2043 Views 2MB Size Report
Knowledge Discovery from Web Usage Data: Research and Development of Web Access ... Tools and strategies for visualization of large image data sets in ...
Knowledge Discovery in Large Data Sets Tiago Simas, Gabriel Silva, Bruno Miranda, Andre Moitinho, and Rita Ribeiro Citation: AIP Conference Proceedings 1082, 196 (2008); doi: 10.1063/1.3059044 View online: http://dx.doi.org/10.1063/1.3059044 View Table of Contents: http://scitation.aip.org/content/aip/proceeding/aipcp/1082?ver=pdfcov Published by the AIP Publishing Articles you may be interested in Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey AIP Conf. Proc. 1324, 319 (2010); 10.1063/1.3526223 Enhancing Our Knowledge of Northern Cepheids through Photometric Monitoring AIP Conf. Proc. 1170, 108 (2009); 10.1063/1.3246420 The Dual Role of Atomic and Molecular Astrophysics: Discovery and Data AIP Conf. Proc. 1161, 123 (2009); 10.1063/1.3241183 Tools and strategies for visualization of large image data sets in high-resolution imaging mass spectrometry Rev. Sci. Instrum. 78, 053716 (2007); 10.1063/1.2737770 Physics-Based Feature Mining for Large Data Exploration Comput. Sci. Eng. 4, 22 (2002); 10.1109/MCISE.2002.1014977

This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 193.136.132.10 On: Fri, 23 Jan 2015 18:37:38

Knowledge Discovery in Large Data Sets Tiago Simas*, Gabriel Silva*, Bruno Miranda*, Andre Moitinho^ and Rita Ribeiro* *Uninova/CA3, Universidade Nova de Lishoa,Ponugal ^SIM, Universidade de Lisboa,Portugal Abstract. In this work we briefly address the problem of unsupervised classification on large datasets, magnitude around 100,000,000 objects. The objects are variable objects, which are around 10% of the 1,000,000,000 astronomical objects that will be coUected by GAIA/ESA mission. We tested unsupervised classification algorithms on known datasets such as OGLE and Hipparcos catalogs. Moreover, we are building several templates to represent the main classes of variable objects as well as new classes to biuld a synthetic dataset of this dimension. In the future we will run the GAIA satellite scanning law on these templates to obtain a testable large dataset. Keywords: Variable Objects, Machine learning. Knowledge discovery. Exploratory data Analysis PACS: 90, 95.75.-z, 95.75.pq, 97.30.-b

INTRODUCTION The objective of the computer science researchers of Uninova-CA3 is to support astronomers and other scientists providing intelhgent algorithms, techniques and methods, particularly from the knowledge discovery field, for determining catalogues' properties. We are selecting several knowledge discovery algorithms to deal with large amounts of ill-known data (order of 10^). Usually the approach for data exploration is to use tools (algorithms) to analyze and visualize data: examples of such tools are Principal Components Analysis (PCA) with time complexity 0{nm) and Self Organizing Maps (SOM) with time complexity 0{nlog{n)). To capture clusters in data in order to classify the data into groups is not a easy task for large datasets. There are many unsupervised classification algorithms, however not all of them are suitable to deal with large datasets. We found in our study that we have to choose algorithms with time complexity not larger than 0{nlog{n)). A preliminary set of algorithms is found in the following table 1 [1]. In a preliminary study we apphed PCA and SOM and DBSCAN to the OGLE [2] [3] dataset with size lO'*, to test the performance/accuracy of the algorithms, before applying than to large data sets, such as 10*. Now we intend to test the other algorithms, described in table 1, to the same catalogues to select a set of algorithms that guarantees they will have an accuracy that best adapts to our problem. After this we have to test if it performs well in large datasets. To test the performance/accuracy of the algorithms in large data sets we intend to build a main synthetic dataset based on the GAIA scanning law. The reasons for that is because most of the attributes that describes the objects came from analysis of photometric time series captured by the telescopes (Satellite or Ground). In our case GAIA will be a satellite collecting photometric data with a given scanning law. In order to capture most biases originated by the scanning law we intend

CP1082, Classification and Discovery in Large Astronomical Surveys, edited by C. A. L. Bailer-Jones © 2008 American Institute of Physics 978-0-7354-0613-l/08/$23.00

This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 193.136.132.10 On: Fri, 23 Jan 2015 18:37:38

TABLE 1.

Large Datasets Clustering Algoritms, based on [1] Running time

k-means EMAlg, BIRCH mrkd-EM DBSCAN DENCLUE DBCLASD

Estimate k

Arbitrary shapes

Handle noise

One Scan of data

will stop

0(n) 0(n) 0(n) 0(nlog(n)) 0(nlog(n)) 0{n) Oinlogin))

TABLE 2.

Dataset attributes names [3]

Attribute name

Meaning

log-fl log-f2 log-aflhl-t log-aflh2-t log-aflh3-t log-aflh4-t log-aOhl-t log-aGh2-t log-cri'lO pdfl2 varrat B-V V-I

log of the first fi-equency log of the second frequency log ampUtude first harmonic first frequency log ampUtude second harmonic first frequency log ampUtude third harmonic first frequency log ampUtude fourth harmonic first frequency log ampUtude first harmonic second frequency log ampUtude second harmonic second frequency amplitude ratio between harmonics of the first frequency phase difference between harmonics of first frequency variance ratio before and after first frequency subtraction color index color index

to create some templates of a given classes of variable stars and run the scanning law to create a synthetic data set. PRELIMINARY RESULTS In these section we present the results of PC A and SOM and the accuracy of the DBSCAN algorithm.

Datasets Using the dataset OGLE [2], Luis Sarro [3] selected a set of attributes that best represents the variable objects. This set of attributes are based on the photometric data collected by the telescopes. The set of attributes is depicted in table 2. This dataset is classified in 10 classes of variable stars depicted in table 3. Also, in Table 3 we can see the class distribution.

197 This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 193.136.132.10 On: Fri, 23 Jan 2015 18:37:38

TABLE 3. Class names and number of observations [3] Class Class Class Class Class Class Class Class Class Class Class

TABLE 4.

Name 1 2 3 4 5 6 7 8 9 10

Observations

cep rrlyr Ipv dmcep eel ell-ecl ell-eU new eel ptcep rrd

1313 2558 2735 71 2467 80 613 162 14 50

Some Results of DBSCAN

Number

K

Eps

-1 outliers

cluster 1

cluster 2

cluster 3

cluster 4

1 2* 3 4

20 20 15 10

1,5 0,95 0,95 0,95

987 4168 4168 3979

9076 1055 1055 1055

2593 2593 2593

2247 2263 189

2247

Before applying the algorithms, PCA, SOM and DBSCAN we have normalized the data, using the z — score z = ^ ^

(1)

PCA, SOM and DBSCAN The first exploratory analysis done in this work was Principal Component Analysis [4]. We reduced our 13 dimensional space into 4 dimensions then we plot several projections on principal components. In figure 1 we present the projections on principal components 1 and 2 and also 1 and 3. In PCI vs PC2 we can distinguish 2 to 3 big clusters and in PCI vs PC2 projection we can distinguish 4 clusters. Figure 2 shows how are the classes distributed in space. The next visualization technique tested was Self Organizing Maps [4]. We can see in figure 3 that SOM identified 5 clusters. However these 5 clusters are in reahty a mixture of several classes being the more frequent classes the ones we identify with this algorithm. The results of DBSCAN [5] are presented in table 4 and 5. In table 4 we show results for several tunings of the initial parameters, K (minimum number of points in a neighborhood) and Eps (radius of neighborhood). After analyzing the clusters obtained we found that result 2 is the best fits in our dataset.

This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 193.136.132.10 On: Fri, 23 Jan 2015 18:37:38

1vs3

-



'•



,-i" .*•••,

.

•;

aSs-'it'- '• ^

^

^

;

-

-

«^SgK*-' -

• S^SHI j^^nj^gJH | ^ ^ r v > .

iK-'"

^y;lWP!^t5rs??^^^i ,.- --V



••• ••

PF^'"' • ^^^m |«^"i' *?^'^x.'.



F I G U R E 1.

^^^^H



6

-

1

-

2

0



;

P C A non supervised

CiMMl Clntil

1

ClHta4 CIMMS

CU(ta7 ClHtaB

"^ ': < .;^-^-::^'-\-

^^^^Lzv^-

•••; .

" c««« 1

'•'.'jiiS!','';~ fcJLv^': •

"' ^-'^SmM ^n^

^ ^ ^ ^ ^ ^ B K . -' ^H£r^^' ^ ^ ^ ^ ^ K | ^ ^ / W nIX-

nSmnV" •' •

. > - ' . J i ^ ^^ • . " ^ * -

• •

^ ^ ^ ^ •.«^';^S;• E

J

;

D

;

4

FIGURE 2. PCA supervised with 10 classes

DISCUSSION AND CONCLUSIONS We found in this preliminary study that with PCA we can identify 4 main classes out of 10 classes. However as we can see in table 3, these 4 classes correspond to the 4 more TABLE 5. Result DBSCAN number 2 (best) and distribution in classes Class 1 2 3 4 5 6 7 S 9 10 Cluster! Cluster! Clusters

1055 2482

62

1 2094

34

117

2 2

46

This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 193.136.132.10 On: Fri, 23 Jan 2015 18:37:38

FIGURE 3.

PCA supervised with 10 classes

populated classes, classes 1, 2, 3 and 5. Classes such as 4, 6, 8, 9 and 10 are difficult to detect because they are poorly populated. DBSCAN captures 3 main classes 1, 2 and 5. Classes 2 and 5 identified in cluster 2 and 3, respectively, are a mixture with other classes. Table 5. However as we already have seen with PCA these other classes are the less populated classes mixed, with the ones that have more representative points. With DBSCAN we were unable to detected class 3. A possible explanation can be found in PCA results. In figure 2 we see that the cluster with class 3 is less dense than the others and therefore not captured by DBSCAN. From the results obtained so far, methods , the analyzed PCA, SOM and DBSCAN are good candidates for an exploratory data analysis for large datasets.

FUTURE WORK Same study with BIRCH, Mrkd-EM, DENCLUE and DBCLASD and implementation of these algorithms in a synthetic data set with scales of 10*.

REFERENCES 1. D. p. Mercer, Clustering large datasets, Tech. rep., Linacre College (2003). 2. I. Soszynski, A. Udalski, M. Szymanski, and et al., VizieR On-line Data Catalog, ]/other/AcA/53.93 (2003). 3. P. Herrera, andL. M. Sarro, Assessment of the vaUdity of autoclass for cu7 unsupervised classification, Tech. rep., UNED (2008). 4. W. L. Martinez, and A. R. Martinez, Exploratory Data Analysis with Matlab, Chapman and Hall/CRC, New York, 2004. 5. G. Gan, C. Ma, and J. Wu, Data Clustering Theory, Algorithms and appications, SIAM, 2007.

200 This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 193.136.132.10 On: Fri, 23 Jan 2015 18:37:38