A Web-based Interactive Data Visualization System ... - USQ ePrints

4 downloads 94 Views 381KB Size Report
A Web-based Interactive Data Visualization System for Outlier Subspace Analysis. Dong Liu, Qigang Gao. Computer Science. Dalhousie University. Halifax, NS ...
A Web-based Interactive Data Visualization System for Outlier Subspace Analysis Dong Liu, Qigang Gao Computer Science Dalhousie University Halifax, NS, B3H 1W5 Canada [email protected] [email protected]

Hai Wang Sobey School of Business Saint Mary’s University Halifax, NS, B3H 3C3 Canada [email protected]

Ji Zhang Mathematics & Computing University of Southern Queensland Toowoomba, QLD, 4350 Australia [email protected]

log data, cancers in medical data, or simply some errors or

Abstract

noises caused by human mistakes or sensor damage, etc Detecting outliers from high-dimensional data is a

[11, 12, 13]. Outliers should be treated differently in

challenge task since outliers mainly reside in various low-

different situations, such as errors and noises outliers

dimensional subspaces of the data. To tackle this

should be removed, and intrusion and cancer outliers are

challenge, subspace analysis based outlier detection

targets and should be detected for analysis and event

approach has been proposed recently. Detecting outlying

prevention. In other situation, outliers must be detected

subspaces in which a given data point is an outlier

and classified properly.

facilitates a better characterization process for detecting

Traditional outlier detection methods are mainly been

outliers for high-dimensional data stream, and make

designed using whole dimensionality analysis approach.

outlier mining for large high-dimensional data set to be

They work well for low-dimensional data sets. However,

more manageable.

In this paper, to facilitate outlier

nowadays more and more real applications are involved in

subspaces analysis from human perception perspectives in

high-dimensional data. Detecting outlier from high-

supporting the development of efficient solutions for

dimensional data is a challenging task, in that traditional

high-dimensional

web-based

methods become infeasible for high-dimensional data due

interactive data visualization system, which can display

to the Curse of Dimensionality phenomena, in that the

various low-dimensional outlier subspaces to allow users

outliers hidden in low-dimensional subsets of the data will

to observe and analyze the distributions of outliers. The

be disappeared as the dimensionality is increased for

proposed visualization tool can help the developers of

using whole dimensionality analysis methods [2]. The

outlier detection applications to directly examine the

new strategy to deal with high-dimensional data is to

distributions of outliers in various low-dimensional

detect outliers for possible lower dimensional subspaces

subspaces to validate their experiment results.

of the high-dimensional data, such as introduced in [1].

data,

we

propose

a

The idea is to convert the issue of outlier detection in the 1

Introduction

high-dimensional data space into the issue of detecting low-dimensional outlying subspaces since exhaustive

Outliers in a database or data stream are the data

search all subspaces in high-dimensional data space is not

objects that are grossly different from or inconsistent with

tractable. In this paper, we propose a data visualization

the rest of the data, which reflect abnormal behaviours in

system to facilitate analysis and solution development for

the real world. Outliers may stand for toxin spills in

projected outlier subspace finding and gain insight by

chemical sensor data, the network intrusions in network

allowing the developers/users to observe the data

distributions for various low-dimensional outlier subspace

= {,o∈O and S is the outlying subspace set of o},

of the data.

where O denotesset of outliers detected.

Visualization has been proved to be a useful tool for

The visualization system aims to help users to

data analysis. With development of computer hardware

examine the detected outlying subspaces for high-

and software, visualization techniques can use computer

dimensional data set. Users are allowed to adjust the

graphics

in

parameters of the outlier detection algorithms and

understanding of complex, often massive representations

visualize the intermediate detection results. A set of

of data. There are a number of visualization tools

visualization tools is designed for supporting human

available, such as SequoiaView [3], GGobi [6], OpenViz

exploration on projected outlier subspace analysis.

to

create

visual

images

which

aid

[7], VisuMap [8] and ADVIZOR [9]. Some tools are webbased systems for the continence of accessing the tool for

2.1

System Architecture

broad user groups, such as Manyeyes [4] and Drillet [5]. However, there is no data visualization system for directly

The architecture of the visualization system is

analyzing projected outlier subspaces. In this paper, we

illustrated in Figure 1.

present a visualization system for outlier subspace

include both the original high-dimensional data set and

analysis in that the features and interface tools are special

the outlier detection results after data pre-processing

designed for effectively supporting human to observe and

which includes standard steps of data cleaning, data

explore large volume high-dimensional data for gaining

transformation and data normalization. Data cleaning is to

insight on outlier detection on such complex data sets.

remove

incorrect

The data to be displayed can

records

in

the

dataset.

Data

transformation is to correct inconsistent data format and 2

System Design and Implementation

convert continuous data attribute values into a finite set of intervals with minimal loss of information. In data

The proposed visualization system is designed for

normalization, we will find out the minimum and

supporting outlier analysis on high-dimensional data in

maximum value for each dimension and convert value

that human perception can play a role for gaining insight

between 0 and 1.

on outlier subspaces, which is based on the concept of “Stream Projected Outlier Detector (SPOT)” [1]. In SPOT system, the problem of detecting projected outliers from high-dimensional data streams is formulated as follows. Given a data streamD with a potentially unbounded size of ϕ-dimensional data points, each data point pi = {pi1, pi2, . . . , pi'} in D will be labeled as either a projected outlier or a regular data point. If pi is a projected outlier, its associated outlying subspace(s) will be given as well. The results to be returned will be a set of projected

Figure 1 System Architecture

outliers and their associated outlying subspace(s) to indicate the context where these projected outliers exist.

For the prepared high-dimensional data, one data

The results, denoted by A, can be formally expressed as A

point may be considered as outlier in many subspaces,

therefore the outlier detection result may be very large. In

dimensional subspaces. Below is a sample of the first two

order to handle large size of outlier detection results, the

detected outlying subspaces in the file.

system to use a database to store the datasets and the

Outlierness Threshold: 3

information of outlying subspaces. After data preparation

*****************************************

stage, both the datasets and the outliers are stored into two

Top outlier: data #1

tables in the database. By doing so, the database server

In subspace: 11

can quickly retrieve the selected data for feeding into the

Cell index: 1

visualization system for display. With the prepared data

Outlier-ness: 3.3557

sets, the user should be able to access the system through

Top outlier: data #2

internet with a web browser. The system allows the user

In subspace: 1 6

to select different subspaces and – views to display.

Cell index: 15 6

According to user’s subspace selection, the system will

Outlier-ness: 3.57143

connect to the database server with JDBC and send

... ...

queries to database server. The retrieved data and outlier information for the selected subspaces will be transmitted

Field

Type

to client machine over internet and displayed in user’s web browser.

Description Primary Key. Row number of

linenumber

int(11)

data.

The database and web application services are at

valume1

double

Attribute 1

server side. On the client side, user can access the web

valum2

double

Attribute 2

services and visualize data and outliers for the selected

...

...

...

subspaces from the web browser. The system also allows

valume15

double

Attribute 15

the user to visualize different datasets by reading data file Table 1 Schema of Data in Database

name specified by the user from user’s local machine. The system is implemented in Java. The client machine needs to install J2SE 5 and Java 3D 1.5 or higher version to run the system.

2.2

Synthetic Datasets In the experiments, both synthetic data and real data

sets are used. The synthetic data is generated randomly by a high-dimensional data generator

used in SPOT

research [1]. The nature of the data is close to real-life data.

It

exhibits

different

data

characteristics

Field

Type

Description Primary Key and identify each

id

int(11)

outlying subspace.

linenumber

int(11)

Row number of data.

dimension1

int(11)

Attribute 1 of outlying subspace.

dimension2

int(11)

Attribute 2 of outlying subspace.

dimension3

int(11)

Attribute 3 of outlying subspace.

outlierness

double

Outlierness of outlier.

Table 2 Schema of Outlier Information in Database

in

projections of different subsets of features. It consists of

Since the outlier detection result contains only

15 attributes and 10,000 lines of data. The outlier

outlying subspaces of 1, 2 and 3 dimensional subspaces.

detection result directly from SPOT method [1] consists

The corresponding data tables and outlier table are created

of 426,513 outliers from one dimensional to three

in the database. The detailed schema of the data table is

illustrated in Table 1. The detailed schema of the outlier

cases for both synthetic datasets and KDD 1999 network

table is given in Table 2. The attribute values of outlying

log data. The visualization system can help to answer

subspaces are sorted in ascending order. For one-

questions on the outlier detection. For examples,

dimensional outlying subspaces, the values of dimension2 and

dimension3

dimensional

are

outlying

NULL.

Similarly,

subspaces,

the

for

attribute

1. In a two-dimensional subspace of the synthetic

two-

datasets, find out whether a selected particular outlier data

of

point is also an outlier in other two-dimensional

dimension3 is NULL. For three dimensional outlying subspaces, values of all dimensions are not NULL.

subspaces. 2. What distribution of “smurf” network attacks is in KDD 1999 data?

2.3

Real-life Datasets

Case 1: In a two-dimensional subspace, find out whether a selected outlier data point is also considered as

The experiments also include real-life data sets, i.e.

an outlier in other two-dimensional subspaces.

the KDD Cup 1999 data [10], which is a log connection

For answering this question, we visualize four two-

traffic data set from MIT/Lincoln-Lab. It contains

dimensional subspaces (as shown in Figure 2) which are

connections detail in its network such as the protocol-

(Dim4, Dim 6), (Dim3, Dim 6),( Dim 12, Dim 10) and

type, duration, service-use and many related information.

(Dim 2, Dim 4). When click one outlier (index #174) in

We use the first 5000 lines of the data from the corrected

subspace (Dim 4, Dim 6), then click the “Concurrent”

data with labels for our visualization. In the pre-

button in other two-dimensional subspace display

processing stage, we separate label information from

windows. We can easily observe that the outlier data point

datasets into a separated file. The label names are

(index #174) in (Dim4, Dim6) is also considered as

transformed into numbers. Each type of network intrusion

outlier in (Dim3, Dim 6) and (Dim 2, Dim 4). Moreover,

is mapping to one number. There are four types (shown in

we may change the outlierness threshold by moving slide

Table 3) of network intrusion labelled in the first 5000

bar in these two windows. We can get the outlierness

lines data. We use the number of outlier type as

value of data point (index 174) is 19.37 in both (Dim3,

outlierness value. In this way, we can visualize the

Dim 6) and (Dim 2, Dim 4).

distribution of different kind of network intrusion.

Case2: Visualize distribution of “smurf” network attack in KDD 1999 data. The example of visualizing the distribution of outliers in three-dimensional subspaces is shown in Figure 3. We may find out that the “smurf” network attacks are mainly resided closely in the marked area in the selected threedimensional subspace. Figure 4 is an example of use

Table 3 Label Mapping

concurrent display of two-dimensional subspaces. The system reports the selected outlier from the subspace in

3

Experiments and System Demonstration

The experiments are developed based on sample

left window is also marked as an outlier in the other subspace in the right window.

Figure 2 Case 1: Two-Dimensional Subspaces Concurrent Display

Figure 3 Case2: 3D Display

Figure 4 Case2: 2D Concurrent Display

4

Conclusion and Future Work [5] Drillet Visual Tool for interactive data analysis, http://drillet.appspot.com/. The proposed web-based visualization system can

help to observe subspaces of high-dimensional datasets

[6] Data Visulization system: GGobi, http://www.ggobi.org/.

interactively. - The system enables the user to evaluate performance of an outlier detection algorithm by visually

[7] Data Visulization system: OpenViz, http://www.avs.com/.

verifying the correctness of the results, and determining a proper parameter for better outlier detection results.

[8] Data Visulization system: VisuMap, http://www.visumap.net/.

Through visualizing datasets and their labelled results, user can gain insight visually on what real facts are about

[9] Data Visulization system: ADVIZOR, http://www.advizorsolutions.com/default.htm.

the data distribution nature and the outlier distribution. It is also useful for comparing the effectiveness of different

[10] KDD data source: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

algorithms. The user may also adjust the values of different parameters of the algorithms for comparing the changes of performance. This system currently can visualize datasets and their labelled outlier information. It can interact with user and help to explore the datasets and outlier subspaces. In the future work, we may make the system to allow users to directly label outliers from selected subspaces. Users may also manually adjust outlierness value for selected outlier data points for observing sensitivity of the data. Moreover, the system may be integrated with different outlier detection algorithms such as the SPOT algorithm in [1].

5

References

[1] J. Zhang, Q. Gao and H. Wang. SPOT: A System for Detecting Projected Outliers from High-dimensional Data Streams. IEEE 24th International Conference on Data Engineering (ICDE’08), Cancun, Mexico, pp.1628-1631, 2008. [2] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. [3] Data Visualization system: http://www.win.tue.nl/sequoiaview/.

Sequoiaview,

[4] Data Visualization system: Manyeyes, http://manyeyes.alphaworks.ibm.com/manyeyes/.

[11] B. Aleskerov, E. Freisleben and B. Rao. Cardwatch: A Neural Network Based Database Mining System for Credit Card Fraud Detection. Computational Intelligence for Financial Engineering (CIFEr), 1997. [12] J. F. Costa. Reducing the Impact of Outliers in Ore Reserves Estimation. Mathematical Geology, 35(3), 2003. [13] J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann Publishers, 2006.