Measuring similarity in feature space of knowledge entailed by two ...

ARTICLE IN PRESS

+ model

1

58

2

59

3 4

Knowledge-Based Systems xx (xxxx) 1–10 www.elsevier.com/locate/knosys

60 61

5

62

6

63

7

10

Measuring similarity in feature space of knowledge entailed by two separate rule sets

11 12

Gursel Serpen *, Maheshkumar Sabhnani

8 9

13

Received 1 August 2003; accepted 10 November 2003

15

23

F

This paper addresses the task of comparing two rule sets induced within the same feature space for measuring the knowledge entailed jointly by the two. A procedure that quantifies the similarity of knowledge entailed by two separate rule sets in a given feature space is proposed. A formalized description of the proposed procedure along with its computational complexity analysis, applicability and utility is presented. Application of the proposed procedure is demonstrated using two rule sets from the computer security domain. q 2005 Elsevier B.V. All rights reserved.

24 25

Keywords: Measuring knowledge similarity; Learned rule set; Rule set post analysis; Rule set validation; Computer security

29

38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

EC TE

37

R

35 36

R

34

O

33

C

32

N

31

Rule-based systems may utilize automated techniques or human expertise or a combination of both to develop its knowledge base in the form of a set of rules. In problem domains where human expertise exists and can be captured through a set of rules formulated by the human expert, it might be desirable to compare learned rules (i.e. through a rule induction algorithm [1–3]) against those formulated by a human expert to cross-validate or assess ‘relevance’ between corresponding rule sets. In other words, post-analysis of two rule sets to determine the degree of overlap among the knowledge entailed by each rule set might be needed. For instance, it is often necessary for a human expert to analyze the set of induced rules to see how well it matches to the set of expected rules implicitly conceived by the same expert. In another instance, human experts are often charged with sifting through a large set of rules, which might have been induced from a data set through a machine learning algorithm, to ‘filter’ a much smaller and compact rule set that faithfully represents the knowledge embedded in the original larger rule set. In presence of hundreds or thousands of rules, most of which might be insignificant, a human expert must analyze the set of

U

30

53 54 55 56 57

rules to determine the significant subset to be consequently employed by a rule-based system. Most of the relevant research reported in the literature concentrates on generation of induced rule sets from a given data set and implicitly assumes that the rule set generated can be directly utilized with little or no post-processing by either machines or humans. There appears to be very few reported studies in the literature on quantitatively measuring similarity between knowledge entailed by two or more separate rule sets. One noteworthy exception is by Lin and Hsu [4], who proposed a fuzzy matching technique for rule comparison in the context of classification of rules. They envisioned that a human expert, having conceptualized implicitly a set of rules, could compare his rule set with one generated through a rule induction algorithm. This comparison would be done to assess how similar the knowledge is and possibly assess if time introduced any changes to the knowledge represented by the original data since the last learning episode. The technique by Liu and Hsu is promising yet it requires the user to provide his/her beliefs in fuzzy terms, which might not be necessarily a trivial process. In another notable study, Adomavicius and Tuzhilin [5] introduced an approach for human expert driven validation of a very large number of rules in e-commerce applications. A human expert is perceived to analyze a large set of rules using a set of validation operators to determine a subset of rules that faithfully represent the knowledge in the original much larger rule set. The authors did not present a methodology to measure how well the reduced subset of rules effectively represents the original knowledge in the raw rule set: they defer this issue to

D

1. Introduction

PR

26 27 28

71

75

O

22

70

74

Abstract

O

21

67

73

17 19 20

66

72

16 18

65

68 69

Electrical Engineering and Computer Science, The University of Toledo, 2801 West Bancroft, MS308, Toledo, OH 43606, USA

14

64

* Corresponding author. Tel.: C1 419 530 8158; fax: C1 419 530 8146. E-mail address: [email protected] (G. Serpen).

0950-7051/$ - see front matter q 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2003.11.001

KNOSYS 1511—12/12/2005—17:35—VENILA—191646—XML MODEL 5+ – pp. 1–10

76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

ARTICLE IN PRESS

+ model

2

121 122 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

the skills of the human expert who makes the decisions to validate the rules. Another closely aligned field of study is filtering or ranking a mined rule set through a measure of ‘interestingness’ or usefulness as recently discussed by Liu et al. [6,7]. The idea is to extract a smaller subset of rules from a given large set of rules through techniques that measure computable utility of a given rule, which is somewhat like computing the principal components (dimensionality reduction). Once the smaller subset is computed, it is conceivable that one might be interested in precisely computing the knowledge entailed by the original set and compare it with the knowledge entailed by the smaller subset to determine how much ‘knowledge loss’ occurred in the process of ‘compression’. The perceived void in the current literature, which suggests a need for techniques to measure the correlation between or the similarity in the knowledge entailed by two rule sets, serves as a rationale and basis for the current study. Therefore, the primary goal of this study is to develop quantitative procedures to be able to compute the similarity of knowledge embedded into two separate rule sets. The outline of presentation is as follows. The proposed procedure for measuring the ‘similarity’ between the knowledge entailed by two rule sets is presented in Section 2. Application of the proposed procedure to a benchmark problem in computer security domain is demonstrated in Section 3. Conclusions and future work are discussed in Section 4.

151 152

2. Proposed methodology

153

156

f2

159 160

164 165 166 167

170 171

176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212

217 218 219

1.5

220

1.5

221

0.0 0.0

15

35

50

f1

222 223

0.0 0.0

15

50

168 169

174 175

214 215

3.0 2.7

C U

163

N

162

173

216

3.0

161

172

213

f2

O

157 158

R

155

The feature or attribute space provides the medium for computing the ‘similarity’ between the knowledge entailed by

R

154

The proposed technique will be explained using an example in a two-dimensional feature space. Assume two rule sets, R1 and R2, for a two-class problem (C1 for class 1 and C2 for class 2, where C1gC2Zuniverse of discourse and C1hC2Z: suggesting that C1 and C2 are partitions of the two-dimensional feature space) are as shown in Table 1. Further assume that there are two features, f1 and f2, with associated ranges subscribing to the intervals [0, 50] and [0.0, 3.0], respectively, while noting that the first feature is of type integer and the second feature is of type real. Rule set 1 consists of two rules r11 (rule 1 of rule set 1) and 1 r2 . Similarly, rule set 2 consists of two rules: r12 and r22 . Additionally, there is an implicit rule that defines the subspace belonging to C2 for both rule sets: it is the subspace not covered by either of the two rules in each rule set. The feature space is two-dimensional and occupies a portion of the first quadrant in the Cartesian coordinate system. The partitioning of the overall feature space as induced by the rule set 1 is shown in Fig. 1(a), where there are two types of subspaces: one covered by rules in the rule set (subspace belonging to class 1) and the second one that is the complement of the first (subspace belonging to class 2). The first rule r11 covers the subspace bounded by 0%f1%15 and the second rule r21 covers the subspace bounded by (35!f1%50)o(f2%1.5), which collectively form the subspace covered by the rule set 1. Similarly, subspaces covered by the rule set 2 are shown in Fig. 1(b). The technique initially requires an exhaustive scanning of both rule sets in order to determine the set of features and

EC TE

123

F

120

Given two or more rule sets to be compared, map these rule sets to the feature or attribute space, compute the subspaces covered by each rule set (which represents the ‘knowledge’ entailed by this rule set), and finally compute the overlap of these subspaces. The volume of the overlap subspace in relation to the actual subspaces covered by rule sets suggests the ‘similarity’ of knowledge represented by these rule sets.

O

119

Rule set 1: R1 Rule 1: IF f1%15 THEN C1 Rule 2: IF f1O35of2%1.5 THEN C1 Rule set 2: R2 Rule 1: IF f1O15of2%1.5 THEN C1 Rule 2: IF f1%15of2O2.7 THEN C1

O

117 118

two rule sets. The proposed technique relies upon the following fundamental idea:

PR

116

Table 1 Example rule sets

D

115

G. Serpen, M. Sabhnani / Knowledge-Based Systems xx (xxxx) 1–10

f1

224 225

(a) Subspace Covered by Rule Set 1

(b) Subspace Covered by Rule Set 2

Fig. 1. Subspaces covered by rule sets in feature space.


226 227 228

ARTICLE IN PRESS

+ model


3.0 2.7

233

P4

234 235

P5

P6

1.5

236

P1

237 238

P2

P3

0.0

239 240

0

15

35

f1

50

241 242

Fig. 2. Partitioned feature space for example problem.

243

associated threshold values. It next incorporates partitioning the complete feature space into a number of subspaces. The partitioning along each feature axis is implemented through the set of threshold values for that feature. This partitioning procedure makes sure that any subspace is covered in exactly one of four possible ways as follows:

244 245 246 247 248 249 250 251 252 253

(a) (b) (c) (d)

subspace subspace subspace subspace

covered covered covered covered

by rule set 1 only, by rule set 2 only, by both rule sets jointly, or by none of the rule sets.

254

263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285

Partition volume

Coverage by rule sets

Subspace volume for R1

Subspace volume for R2

Overlap subspace volume

288 289

P1 P2 P3 P4 P5 P6 P7 P8 P9 Total

0.15 0.20 0.15 0.12 0.16 0.12 0.03 0.04 0.03 1.00

R1 R2 R1oR2 R1

0.15 – 0.15 0.12 – – 0.03 – – 0.45

– 0.20 0.15 – – – 0.03 – – 0.38

– – 0.15 – – – 0.03 – – 0.18

291

R1oR2


290 292 293 294 295 296 297 298 299 300

Towards formulating a generalized and formal statement of the procedure for assessment of the ‘similarity’ of knowledge represented by two rule sets, adoption of the following notation will prove to be useful:

301

(a) subspace covered by rule set 1, which denotes the knowledge entailed by rule set 1, is represented by SR1 , (b) subspace covered by rule set 2, which signifies the knowledge entailed by rule set 2, is represented by SR2 , and (c) overlap subspace, which stands for the measure of ‘similarity’ of knowledge entailed by two rule sets, is represented by SR1hR2

306

The ideal case for perfect ‘similarity’ of knowledge represented by the two rule sets is given by SR1 h SR2 h SR1hR2 . However, in general, equality will not hold and an interpretation will be required to be able to assess the ‘degree of similarity’.

314

EC TE

262

R

261

R

260

O

259

C

258

N

257

For the current example, Fig. 2 shows the various subspaces labeled P1–P9 in the complete feature space. The threshold values, including the minimum and maximum values for each feature, establish the boundaries of each partition. As an example, feature 1 appears in all four rules with the set of threshold values as in {15,35}. These values along with boundaries for the range of this feature, {0,15,35,50}, establish the boundaries of partitions (which is an area for this example but hyper-volumes in a general case) through lines (hyperplanes in high-dimensional spaces) perpendicular to the associated feature axis. Consequently, considering the threshold set {0.0,1.5,2.7,3.0} for the feature 2, partitions appearing in Fig. 2 materialize for the four rules in two rule sets. Since, partitions P1, P3, P4 and P7 are covered by the rule set 1, and while partitions P2, P3 and P7 by the rule set 2, then partitions P3 and P7 form the overlap subspace covered by both rule sets. In general terms, for each subspace in the feature space, the proposed technique first checks if the rule set 1 cover this subspace. Next, it checks whether rule set 2 covers this subspace. Then it finally checks for overlap. The results are shown in Table 2, where the entire range for each feature is scaled to the unit interval [0.0,1.0]. Results in Table 2 indicate that rule sets 1 and 2 cover 45 and 38% of the feature space, respectively, while the total overlap in the feature space is 18%. The discrepancy in size between subspaces covered by two-rule set is noticeable yet not very large suggesting difference in knowledge entailed by these two rule sets. The fact that the size of overlap subspace is approximately half of each subspace covered by two rule sets indicates that the knowledge entailed by these two rule sets is somewhat correlated or similar with notable dissimilarity.

U

255 256

287

Scanned partition

F

P9

O

P7

231 232

286

O

230

Table 2 Progression of computation for hyper-volume of subspaces

P8

PR

f2

D

229

3

302 303 304 305 307 308 309 310 311 312 313 315 316 317 318

2.1. Algorithmic description of generalized procedure

The algorithm for the proposed methodology can be generalized for two rule sets which are assumed to utilize identical set of features or attributes. Initially, a high-level description of the proposed procedure is presented in Fig. 3 and is followed up with a more rigorous treatment in a rather detailed pseudocode format in Fig. 4. In Fig. 3, steps 1, 2, and 3 process rules to introduce missing attributes (without incurring any modification to those rules themselves) in order to facilitate mapping each and every rule to the common feature space. Formally, assume F1 and F2

319 320 321 322 323 324 325 326 327 328 329 330 331 332

1. 2. 3. 4. 5. 6.

Consider two separate rule sets belonging to a common feature space. Determine the set of features appearing in all the rules in both rule sets. Identify and incorporate missing features into rules. Determine the set of thresholds along with the range for each feature. Determine each and every partition subspace and compute its hyper-volume. Determine if the partition subspace is covered by any rule set individually, jointly, or none. 7. Compute total hyper-volume of resultant subspaces covered by two rule sets separately as well as jointly. 8. Continue until all partition subspaces are processed. 9. Assess similarity of “knowledge” based on hyper-volumes of subspaces covered by each rule set separately and jointly.

333 334 335 336 337 338 339 340 341

Fig. 3. High-level description of generalized procedure.

342

ARTICLE IN PRESS

+ model

4

G. Serpen, M. Sabhnani / Knowledge-Based Systems xx (xxxx) 1–10 400

344

401

345 346

402 403

347

404

348

405

349

406

350

407

351

408

352

409

353 354

410 411

355

412

356

413

357

414

358

415

359

416

360

417

361 362

418 419

F

343

O

363 364

O

365 366 367

PR

368 369 370 371

D

372 373

EC TE

374 375 376 377 378 379 380

Fig. 4. Pseudocode for generalized procedure.

381

388 389 390 391 392 393 394 395 396 397 398 399

R

O

387

C

385 386

N

384

represent the set of attributes (features) with associated ranges in R1 and R2, respectively. For any attribute fk, fk2F1gF2 with range [a,b], where a and b are reals with a%b, missing in rule rm, expand rule rm with attribute fk. Perform a logical AND with existing premises of rm using the following format of the missing attribute: a%fk%b. Eventually, the set of attributes in any rule in all rule sets will include all of the attributes in F1gF2. However, for features with non-numeric values, the process of introducing missing features into rules will need to be slightly altered: assume that the kth feature fk takes on non-numeric values from the set {value1,value2,.}, perform a logical AND with existing premises of rm using the following format of the missing attribute: (fkZvalue1nfkZvalue2n.). Steps 4–8 implement the computations that lead to values in Table 2. Step 9 leverages the computed values for hypervolumes to develop an assessment for the degree of ‘similarity’.

U

383

R

382


2.2. Computational complexity analysis

420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441

The computation time through this procedure depends on various factors including number of dimensions in the feature space (as dictated by the feature count) and the number of thresholds for each feature (partitioning resolution). As observable in the pseudocode in Fig. 4, the dominant term for determining the computational cost is the nested loop structure that implements scanning along each feature axis at a resolution imposed by the set of thresholds for the same feature. Floating point operations (flops) performed within the innermost looping construct may be leveraged to measure the computational cost of the proposed methodology reasonably accurately, and accordingly the following three observations can be made:

442 443 444 445 446 447 448 449 450 451 452 453 454

1. Attribute thresholds lead to a nested loop structure with the following computing cost:

455 456

ARTICLE IN PRESS

+ model


463 464 465 466 467 468 469 470 471 472 473 474

477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494

Therefore, the overall computational cost can be approximated through the bound given by flops Z Oð½jT1 j !jT2 j !/!jTK j½jR1 j C jR2 j C ðK K1ÞÞ: Let l represent the quantity jR1jCjR2jC(KK1), which is a constant related to the sum of cardinality of two rule sets and the number of features. Further letting jTjjZmax {jT1j, ., jTKj} yields flops Z OðljTj jK Þ: The computational cost might be prohibitive for very highdimensional feature spaces where the partitioning resolution is also relatively high. However, this computational cost profile still facilitates problems of reasonable size to be tackled by the proposed methodology as demonstrated in the next section through a benchmark problem in the computer security literature. 2.3. Critique of proposed procedure: fundamental assumptions and limitations

495

499 500 501

R

498

R

497

The proposed procedure for measuring knowledge entailed by two rule sets is inherently limited to a degree in its applicability and utility by a number of assumptions. Significant assumptions of relevance along with their implications for the limitations on the applicability and utility of the proposed procedure are as follows:

O

496

502

505 506 507 508 509 510 511 512 513

C

N

504

† The procedure assumes a finite feature space. If the feature space is infinite, it will be necessary to incorporate a method of dealing with an infinite feature space. † The procedure assumes uniform importance of all parts of the feature space. However, it is conceivable that the density of the instances in the feature space may not be uniform for certain problems. In such a case, the fact that two rule sets agree exactly on the results for a relatively much smaller part of the feature space where, say 99.9% of the instances are expected to lie, would be more important than the volume-based comparison.

U

503

Against the backdrop of these outstanding issues, it is relevant noting that the proposed procedure, in its current form, does offer nontrivial applicability and utility (as demonstrated on a benchmark problem in computer security domain in the next section) and forms the viable basis for a highly sophisticated algorithm through future improvements and enhancements.

EC TE

475 476

F

462

O

461

† The procedure assumes that the rules are (hyper) ‘rectangular’ in that all descriptions of conditions correspond to a rectangular volume bounded for each attribute by a minimum and a maximum value. This would not match the type of hyper-spherical regions considered by those clustering algorithms, where the maximum distance from a central point is measured, taking into account all the attributes. † Each rule set is assumed to be unambiguous and cover the entire feature space. Thus, rules such as ‘if x1O3 then CZ true’ and ‘if x2O2 then CZfalse’, which are ambiguous with regard to the region ‘x1O3ox2!2’ are assumed not to be present, or to have the ambiguity resolved by some other means such as rule ordering. † Definition of ranges for each feature, which might not be well defined in all cases, might pose challenges since subspace computations and the similarity measurement computations rely heavily on such assumptions. Expert domain knowledge (and, if available, access to the source data) might be needed to be able to specify reasonable and realistic values for attribute range values. † The procedure assumes that the default value for any area not covered by a rule is ‘false’, although this, in rare occasions, might conceivably vary for rule sets. † The procedure details pair wise comparison of two rule sets. Dealing with three or more rule sets may require making additional decisions about whether each rule set should be compared to all others individually (brute force) or some type of union of all others or by grouping some of them to reduce computational cost, etc.

O

459 460

jT1j is the number of times the outermost loop is executed. jT2j is the number of times next to the outermost loop is executed. $$$ jTKj is the number times the innermost loop is executed. Resultantly, the number of times the overall loop structure is executed is given by jT1j!jT2j!/!jTKj. 2. There are KK1 multiplications for computation of the hyper-volume of partition subspace within the innermost loop. 3. Again within the innermost loop, there are two independent looping structures with repetition times related to the cardinality of rule sets: jR1j is the number of times the first loop is executed for all the rules in R1. jR2j is the number of times the second loop is executed for all the rules in R2.

PR

458

D

457

5


3. Application in computer security

514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554

A benchmark problem from the domain of computer security will be employed to demonstrate application of the proposed procedure to measure the similarity of knowledge entailed by two rule sets. The two rule sets, also known as. ‘KDD rule sets’, originally emanated from the KDD data set [8], where it is known that, through extensive prior empirical studies reported in the literature [9–12], these two rule sets do not entail similar knowledge. A brief discussion of the KDD data set is presented in Appendix with references to the literature for further reading.

555

3.1. KDD rule sets

566

556 557 558 559 560 561 562 563 564 565 567

KDD rule sets utilized for demonstrating the application of the proposed procedure are presented in Tables 3 and 4. Table 3 shows the significant rules, partitioned into so-called testing

568 569 570

ARTICLE IN PRESS

+ model

6

576 577 578 579

serviceZother flagZREJ rerror_rateO0.99 dst_host_rerror_rate!Z0.19 U2R

630 631

serviceZftp_data logged_inZ1 dst_bytesO2931 dst_host_srv_diff_host_rate!Z0.03 U2R

serviceZtelnet hotO0 root_shellZ1 is_hot_loginZ0 dst_host_same_srv_port_rate!Z0.02 U2R

580 581 582 583 584 585 586 587 588 589 590 591

serviceZftp_data logged_inZ1 dst_bytesO2931 countO1 U2R

serviceZother flagZSF dst_host_srv_rerror_rateO0.61 U2R

src_bytes!Z533 num_file_creationsO0 num_access_files!Z0 dst_host_same_srv_port_rateO0.11 U2R

srv_count!Z1 dst_host_diff_srv_rate!Z0.02 dst_host_rerror_rate!Z0.12 dst_host_srv_rerror_rateO0.61 U2R

dst_bytes!Z11972 root_shellZ1 hot!Z2 num_rootO1 num_file_creationsO0 U2R

urgentO0 root_shellZ1 U2R serviceZftp is_guest_loginZ0 src_bytes!Z2203 dst_bytesO195 is_hot_loginZ0 U2R

592 594 595 596 597 598 599 600

num_file_creationsO0 same_srv_rate!Z0.41 dst_host_same srv_rateO0.89 U2R

serviceZftp_data dst_bytesO195 U2R

num_compromisedO1 root_shellZ0 num_rootO0 dst_host_same_srv_port_rateO0.13 dst_host_srv_diff_host_rate!Z0.33 U2R

601 602

607 608 609 610 611 612

precedent 1 precedent 2 / precedent M / antecedent

615

Table 3 indicates that for the U2R category, a total of 24 out of 41 original features (as they existed in the KDD data sets) were used by all the rules listed. Among those 24 features, many exist in either the training rule set or the testing rule set. For example, features like logged_in, num_compromised, num_root, num_file_creations, num_access_files, count, same_srv_rate, dst_host_same_srv_rate, and dst_host_srv_ diff_host_rate are present in the training rule set only. These features are not utilized by the testing rule set. There are a total of nine features only present in the training rule set and not in the testing rule set. Again there are a total of nine features only present in the testing rule set. Hence out of 24 features used by both rule sets, only six features are common to both sets of

619 620 621 622 623 624 625 626 627

C

618

N

617

U

616

O

613 614


633 634 635 636 637 638 639 640 641 642 643 644 645

rules. Similar observations are applicable for the R2L attack category. Table 4 shows that for the R2L attack category, a total of 17 out of 41 features are used by either of two rule sets. Four features are only present in the training rule set and are not present in the testing rule set. Similarly, six features are only present in the testing rule set. Remaining seven features are common for both rule sets. Range and type information for each feature employed in rules for a given attack class is presented in Table 5, where the range values are determined through analysis of the rules, the source data as well as heuristics incorporated by a domain expert in computer security.

EC TE

605 606

R

604

and training rule sets, with 80% or more detection performance for the user-to-root (U2R) attack. The two rule sets for the remote-to-local (R2L) attacks with no less than 80% detection performance are presented in Table 4. The if–then rules in both tables, i.e. if precedent 1oprecedent 2o/o precedent M then antecedent, utilize the following template:

R

603

632

PR

593

629

F

575

serviceZtelnet root_shellZ1 hotO2 num_root !Z5 U2R

O

573 574

628

O

572

Table 3 Significant C4.5 rules to detect U2R attack class (left column—training rule set, right column—testing rule set)

D

571


646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673

3.2. Application of proposed procedure

674 675

Information required (i.e. thresholds along with the ranges in Table 5 for each feature) for computation of the number of subspaces for the U2R attack category is shown in Table 6. All entries in this table are based on two rule sets for the U2R attack category as given in Table 3. A ‘*’ entry under columns labeled by ‘Train’ and ‘Test’ represents that the corresponding feature is present in the training rule set or the testing rule set, respectively. ‘thresholds/attribute values’ column shows the different thresholds each feature assumes across all the rules in

676 677 678 679 680 681 682 683 684

ARTICLE IN PRESS

+ model


691 692

serviceZftp_data src_bytesO333 src_bytes!Z334 R2L

693 694 695 696 697

serviceZftp_data same_srv_rateO0.87 dst_host_same_srv_port_rateO0.99 dst_host_srv_diff_host_rateO 0.11 R2L

700 701 702 703 704 705

durationO88 is_guest_loginZ1 R2L

698 699

serviceZpop_3 src_bytesO11 src_bytes!Z37 R2L

logged_inZ1 durationO3 src_bytesO333 dst_bytes!Z63 dst_host_same_srv_port_rateO 0.99 R2L num_failed_loginsO0 dst_host_diff_srv_rate!Z0 R2L

706 707 708 709 710 711 712 713 714 715

serviceZftp_data src_bytesO11 src_bytes!Z14 dst_host_srv_count!Z34 R2L serviceZprivate dst_bytesO125 srv_count!Z1 dst_host_srv_count!Z252 dst_host_diff_srv_rate!Z 0.01 R2L Duration!Z2 src_bytesO11 src_bytes!Z133 dst_bytesO125 srv_count!Z3 dst_host_same_srv_rate!Z 0.99 R2L src_bytesO11 src_bytes!Z133 dst_bytesO52 srv_count!Z2 Same_srv_rateO0.58 srv_diff_host_rate!Z0.01 dst_host_diff_srv_rateO0 dst_host_same_srv_port_rate!Z 0 R2L

717 718 719 720 721 722 723

R

724 725

731 732 733 734 735 736 737 738 739 740 741

R

O

730

C

729

N

727 728

both rule sets. Similarly, Table 7 presents the thresholds for all features for R2L rules as depicted in Table 4. Details related to labeling each feature axis with values (thresholds) of the associated feature especially in the presence of both symbolic and numeric features as well as exposing the same for interval and subhyper-volume computations are described below. There are two main types of features: symbolic (multi-valued or binary) and numeric (discrete multi-valued and continuous). For symbolic multi-valued features like ‘service’, the total number of values a symbolic feature can assume within the context of the KDD data sets was counted: i.e. the ‘service’ feature assumes 71 different values. Additionally, the number of unique values used in a given rule set pair is counted: in the case of R2L rule sets, three different values are employed for the ‘service’ feature, which are ftp_data, private, and pop_3. Each value is then assumed to

U

726

Type

Range

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_login logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_hot_login is_guest_login count srv_count serror_rate srv_error_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_ rate dst_host_diff_srv_rate dst_host_same_srv_ port_rate dst_host_srv_diff_ host_rate dst_host_serror_rate dst_host_srv_serror_ rate dst_host_rerror_rate dst_host_srv_rerror_ rate

Numeric: discrete

[0, 58329]

744 745

Symbolic: multi-valued Symbolic: multi-valued Numeric: discrete Numeric: discrete

71 Values 11 Values [0, 1379963888] [0, 1309937401]

Numeric: discrete Numeric: discrete Numeric: discrete Symbolic: binary Numeric: discrete Symbolic: binary

[0, 15] [0, 101] [0, 6] {0,1} [0, 7479] {0, 1}

Numeric: discrete Numeric: discrete

[0, 7469] [0, 101]

758

Numeric: discrete

[0, 10]

760 761

Symbolic: binary Symbolic: binary Numeric: discrete Numeric: discrete

{0, 1} {0, 1} [0, 511] [0, 511]

762

Numeric: continuous

[0.0, 1.0]

767

Numeric: continuous

[0.0, 1.0]

768 769

Numeric: continuous

[0.0, 1.0]

770

Numeric: discrete Numeric: continuous

[0, 255] [0.0, 1.0]

Numeric: continuous Numeric: continuous

[0.0, 1.0] [0.0, 1.0]

Numeric: continuous

[0.0, 1.0]

EC TE

716

durationO144 serviceZftp_data R2L

Feature name


35 36 37 38 39 40 41

743

746 747 748 749 750 751 752 753 754 755 756 757

F

690

serviceZprivate duration!Z2 src_bytesO40 src_bytes!Z104 count!Z5 R2L

No

O

689

dst_bytes!Z3299 hotO25 R2L

742

O

687 688

Table 5 Type and range information for features for U2R and R2L attack classes

PR

686

Table 4 Significant C4.5 rules to detect R2L attack class (left column—training rule set, right column—testing rule set)

D

685

7

759

763 764 765 766

771 772 773 774 775 776 777 778 779 780 Numeric: continuous Numeric: continuous

[0.0, 1.0] [0.0, 1.0]

represent an interval that is 1/71 of the entire range for this feature, which is incidentally normalized to unity interval [0,1] for the ease of subhyper-volume computations. Note that there are four intervals to be considered in subhyper-volume computations for this feature: three intervals, where one is associated with each feature value, have the length of 1/71, and the fourth interval that corresponds to 68 unused values for this feature has the length of 68/71. Symbolic features with binary values like the ‘logged_in’ are processed similar to multivalued symbolic features. The total number of intervals for symbolic binary features is then 2, where each value now occupies a half of the entire range which may also be normalized to the unity interval [0,1] for the ease of subhypervolume computations.

781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798

ARTICLE IN PRESS

+ model

8

803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822

857

No

Feature name

Train

Test

Thresholds

3 4 5 6 9 10 12 13 14 16 17 19 21 22 23 24 27 29 34 35 36 37 40 41

service flag src_bytes dst_bytes urgent hot logged_in num_compromised root_shell num_root num_file_creations num_access_files is_hot_login is_guest_login Count Srv_count rerror_rate same_srv_rate Dst_host_same_srv_rate Dst_host_diff_srv_rate Dst_host_same_srv_port_rate Dst_host_srv_diff_host_rate Dst_host_rerror_rate Dst_host_srv_rerror_rate Feature count

*

* * * * * *

telnet REJ 533 195 0 0 1 1 0 0 0 0 0 0 1 1 0.99 0.41 0.89 0.02 0.02 0.03 0.12 0.61

* * * * * * * * *

*

* * * * * * * * *

* *

* * 15

15

ftp_data SF 2203 2931

830 831 832 833 834 835 836 837 838

½t1k ; t2k ; ½t2k ; t3k ; ½t3k ; t4k ; .; ½t4k kK1 ; t4k k ;

Table 7 Threshold sets for features in R2L attack category Feature name

840

1 3 5 6 10 11 12 22 23 24 29 31 33 34 35 36 37

duration service src_bytes dst_bytes hot num_failed_login logged_in is_guest_login count srv_count same_srv_rate srv_diff_host_rate dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_srv_port_rate dst_host_srv_diff_host_rate Feature count

845 846 847 848 849 850 851 852 853 854

C

844

N

843

863 864 865

1 1

866 867

5

868 869 870 871 872 873

0.11 0.33 0.19

874 875

0.13

876 877 878 879 880 881

PR

k ðtiC1 Ktik Þ t4k k Kt1k

for

882 883

i Z 1; .; 4k K1:

884 885

Note that the hyper-volume of the entire feature space is normalized to a value of 1.0 once the interval lengths for all features are also normalized. This greatly simplifies the subhyper-volume computations and was implemented for the analysis reported herein. Results of applying the procedure presented in Fig. 4 to rule sets for R2L and U2R attack categories are presented in

Train

Test

Thresholds

* * * * * * *

* * * *

2 ftp_data 11, 14, 37, 40, 104, 133, 333, 334 52 25 0 1 1 5 1 0.58 0.01 34 0.99 0.0 0.0 0.11

*

U

841 842

O

No

862

computed as follows:

where there will be 4kK1 subintervals for the entire range of the feature fk. The normalized length of each subinterval can be

839

861 11972

D

829

860

EC TE

828

R

827

Numeric features with discrete or continuous valuations require the following steps for computation of intervals and consequently subhyper-volumes. Once the set of thresholds along with lower and upper bounds for the range of the feature fk are represented as Tk Z ft1k ; t2k ; .; t4k k g, where the ordering from left-to-right is ascending, subintervals on the associated feature axis are defined through the following partitioning:

R

825 826

ftp

2

823 824

Other

858 859

F

801 802

856

O

800

Table 6 Threshold sets for features in U2R attack category

O

799


* * * 11

* * * * * * * * * 13

855

886 887 888 889 890 891 892 893 894 895 896

3 Private

88 Pop_3

144

63

125

3299

897 898 899 900 901 902 903

2 0.87 252

3

904 905 906 907 908

0.01 0.99

909 910 911 912


ARTICLE IN PRESS

+ model


Testing subspace (%)

Overlap subspace (%)

917

U2R R2L

18.18 0.83

46.67 49.92

2.08 0.41

918 919 920

929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966

EC TE

928

4. Conclusions and future work

Comparison of two rule sets for assessment of the similarity of entailed knowledge was facilitated through a procedure proposed. The procedure quantitatively measures ‘similarity’ of knowledge entailed by two rule sets in the feature space. A formal framework for and application of the proposed procedure to a benchmark problem in the computer security domain was demonstrated. In the context of knowledge based systems, quantitative comparison of two rule sets for the knowledge entailed by each can be performed to assess if two rule sets are ‘equivalent’. The proposed algorithm, in its current form, is a consequence of an initial attempt to address the knowledge similarity measurement task, and accordingly lends itself to improvement in a multitude of ways. Fundamental assumptions and limitations pertaining to the applicability and utility of the proposed procedure as detailed in Section 2.3 also paradoxically indicate potential improvements.

R

927

R

926

O

925

C

923 924

Table 8. Training rule sets for both attack categories, U2R and R2L, lack the knowledge as compared to what testing rule sets entail in the feature space: subspace size is 18% for training and 46% for testing rule sets for the U2R attack class while the same numbers are more dramatically far apart for the R2L attack class (0.8 vs. 49%). The overlap subspace size is only 2% of the entire feature space for U2R attack category although training and testing rule sets cover 18 and 46% of the same. The size of overlap subspace even becomes more insignificant, a mere 0.4% for the R2L attack category. The size covered by the training rule set is approximately 0.8% of the entire feature space for the R2L class, which is very small compared to approximately 50% coverage in the feature space the testing rule set. Given how small the knowledge entailed by KDD training rule sets for both U2R and R2L attack categories as well as the overlap subspace are, it is then apparent why a machine learning algorithm cannot be trained as a misuse detector (pattern classifier) through the KDD data sets [12–14]. The subspace calculations for the U2R attack category took 2 h and 40 min on a Sun Ultra Workstation with 500 MHz processor and 256 MB RAM running Solaris 2.7. On the same machine, subspace calculations for the R2L attack category required 3 min and 17 s.

N

922

U

921

Appendix. KDD data set: user-to-root and remote-to-local attacks

967 968 969

F

Training subspace (%)

O

Attack category

O

915 916

evaluation program (IDEP)’ administered by the Lincoln Laboratory at the Massachusetts Institute of Technology. The goal of this program was to build a data set that would help evaluate different intrusion detection systems (IDS) in order to assess their strengths and weaknesses. The objective was to survey and evaluate research in the field of intrusion detection. The computer network topology employed for the IDEP program involved two subnetworks: an ‘inside’ network consisting of victim machines and an ‘outside’ network consisting of simulated real-world Internet traffic. The victim machines ran Linux, SunOSe, and Solarise operating systems. Seven weeks of training data and two weeks of testing data were collected. Testing data contained a total of 38 attacks, 14 of which did not exist in the training data. This was done to facilitate the evaluation of potential IDSs with respect to their anomaly detection performance. Three kinds of data was collected: transmission control protocol (TCP) packets using the ‘tcpdump’ utility, basic security module (BSM) audit records using the Sun Solarise BSM utility, and system file dumps. This data set is popularly known as DARPA 1998 data set [15]. One of the participants in the 1998 DARPA IDEP [9,10], used only TCP packets to build a processed version of the [15] data set. This data set, named in the literature as the KDD intrusion detection data set [8], was used for the 1999 KDD cup competition, which allowed participants to employ it for developing IDSs. The KDD data set was consequently submitted to the ‘Knowledge Discovery in Databases’ archive with the University of California at Irvine, and consists of approximately 5 million training and 0.3 million testing records. Both training and testing data subsets cover four major attack categories: probing (information gathering attacks), denial-of-service (denial of legitimate requests to a system), user-to-root (unauthorized access to local super-user or root), and remote-to-local (unauthorized local access from a remote machine). Each record consists of 41 features [11], where 38 are numeric and 3 are symbolic, defined to characterize individual TCP sessions. Data mining techniques and domain knowledge were utilized to formulate features for different connections using the TCP packets [9,10]. A user-to-root (U2R) attack is characterized by a process whereby any normal system user illegally gains access to the super-user privileges. Generally, a system defect or bug is exploited to execute a successful privilege transition from user level to root level. Buffer overflows are the most common type of attack mechanisms in this category. Other U2R attacks take advantage of root programs that do not manage temporary files in the system properly. Some U2R attacks occur because of an exploitable race condition in a single program, or two or more programs executing concurrently. Though these defects or bugs can be relatively easily patched, any new attacks with previously unknown mechanisms can result in insurmountable damage to the system, as the malicious user attains full control of the victim machine at the root level. Generally, most machines are accessible over the network through the Internet, but only authorized users are intended to be able to access the machines remotely. A remote-to-local

PR

914

Table 8 Similarity computation results for U2R and R2L attack categories

D

913

In 1998, the United States defense advanced research projects agency (DARPA) funded an ‘intrusion detection


9 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026

ARTICLE IN PRESS

+ model

1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051

References

1052

1055 1056

[1] J.R. Quinlan, C4.5: Program for Machine Learning, Morgan Kaufmann Publishing, Los Altos, CA, 1992. [2] P. Clark, T. Niblett, The CN2 induction algorithm, Journal of Machine Learning 3 (1992) 261–284.

D

1053 1054

1057

EC TE

1058 1059 1060 1061 1062 1063 1064 1065

R

1066 1067

R

1068

O

1069 1070 1071

C

1072 1073

1077 1078

U

1076

N

1074 1075

F

1029 1030

[3] R. Michalski, Pattern recognition as rule-guided induction inference, IEEE Transactions on Pattern Analysis and Machine Intelligence 2 (1980) 349–361 1980. [4] B. Liu, W. Hsu, Post-analysis of learned rules, in: Proceedings of the Thirteenth National Conference on Artificial Intelligence AAAI-96, AAAI Press/MIT Press, Menlo Park, CA 1996, pp. 828–834. [5] G. Adomavicius, A. Tuzhilin, Expert-driven Validation of Rule-based User Models in Personalization Applications, vol. 5, 2001, pp. 33–58. [6] B. Liu, Y. Ma, C.-K. Wong, P.-S. Yu, Scoring the data using association rules, Applied Intelligence 18 (2) (2003) 119–135. [7] B. Liu, Y. Ma, R. Lee, Analyzing the interestingness of association rules from the temporal dimension, IEEE International Conference on Data Mining (ICDM-2001), Silicon Valley, CA, December 2001, pp. 377–384. [8] KDD data set, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999, cited September 2005. [9] W. Lee, S.J. Stalfo, K.W. Mok, A data mining framework for building intrusion detection models, IEEE Symposium on Security and Privacy, Oakland, California, May 1999, pp. 120–132. [10] W. Lee, S.J. Stolfo, K.W. Mok, Mining in a data-flow environment: experience in network intrusion detection, in: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999, pp. 114–124. [11] I. Levin, KDD-99 classifier learning contest LLSoft’s results overview, SIGKDD Explorations, ACM SIGKDD 1 (2) (2000) 67–75. [12] M.R. Sabhnani, G. Serpen, On failure of machine learning algorithms for detecting misuse in KDD intrusion detection data set, Intelligent Data Analysis 8 (4) (2004) 403–415 2004. [13] M.R. Sabhnani, G. Serpen, KDD feature set compliant heuristic rules for R2L attack detection, Proceeding of the International Conference on Security and Management, Las Vegas, NV, June 2003, pp. 310–316. [14] M.R. Sabhnani, G. Serpen, Applying pattern recognition and machine learning algorithms to an intrusion detection dataset, Proceedings of International Conference on Machine Learning: Models, Technologies, and Applications, Las Vegas, NV, June 2003, pp. 209–215. [15] DARPA data set, http://www.ll.mit.edu/IST/ideval/data/1998/1998_data_index.html, 1998, cited September 2005.

O

1028

(R2L) attack occurs when an attacker who does not have an account on the victim machine, gains local access as a user of the victim machine by sending network packets through standard protocols like TCP/IP or user datagram protocol (UDP). There are many ways in which an R2L attack can be executed. Buffer overflow vulnerabilities in some networking programs like sendmail, imap, or named can result in local access on the victim. Attacks like dictionary, guest, ftp-write, and Xsnoop exploit system misconfigurations. Some attacks involve social engineering, an example being the Xlock attack, which is a Trojan horse program and used to initiate the screensaver on Solaris machines in order to capture the user’s password and send it to the attacker. The KDD training data subset has 52 U2R and 1126 R2L records, while the testing data subset has 228 U2R and 16189 R2L records. Four new U2R attacks are present only in the KDD testing data subset and records for these new attacks constitute around 80% of all U2R records in the testing data subset. Similarly, seven new R2L attacks are present only in the KDD testing data subset, and more than 60% of R2L records in the KDD testing data subset belong to these new R2L attacks.

PR

1027


O

10

1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135

1079

1136

1080

1137

1081

1138

1082

1139

1083

1140