IP Traffic Classification for QoS Guarantees: the ... - CiteSeerX

3 downloads 17634 Views 124KB Size Report
example, if we consider the authentication phase in a mail retrieval ... −size(pkti) − K if pkt sent by server. .... allows to compute Γ by imposing that the probability mass function ..... estimation method to a good 94.88% with the joint estimation one. .... (a) UNIBS evaluation set. marginal http pop3 ftp smtp msn torr unknown http.
IP Traffic Classification for QoS Guarantees: the Independence of Packets Maurizio Dusi, Francesco Gringoli, Luca Salgarelli DEA, Universit`a degli Studi di Brescia, via Branze, 38, 25123 Brescia, Italy E-mail: @ing.unibs.it

Abstract— The classification of IP flows according to the application that generated them has become a popular research subject in the last few years. Several recent papers based their studies on the analysis of features of flows such as the packet size and inter-arrival time, which are then used as input to classification techniques derived from various scientific areas such as pattern recognition. In this paper we analyze the impact on flow classification of a hypothesis that is often overlooked, i.e., the tenet that the features of consecutive packets of a given IP flow can be considered statistically independent. We compare two approaches, one based on a technique that considers consecutive packets statistically independent, and one that relies on the opposite assumption. These techniques are then applied to three different sets of traffic traces. Experimental results show that while assuming the independence of consecutive packets has relatively few effects on true positives, it can have a significant negative impact on the false positive and true negative rates, therefore lowering the precision of the classification process.

I. I NTRODUCTION Accurate classification of traffic according to the applications that generated it is at the basis of all network mechanisms that implement Quality of Service (QoS) guarantees. The assignment of a given IP flow to the correct traffic class is necessary before any QoS mechanism can take place, be it based on advanced queueing policies, traffic engineering, or other types of techniques. Historically, traffic classification techniques have been based on transport port numbers, or on the inspection of the packet payloads. Both these techniques are rapidly becoming ineffective, because of the emergence of applications that do not use standard ports, such as Peer-toPeer (P2P), and the resource-intensiveness of payload inspection techniques, especially when applied to high-speed links. In the last few years several research groups have published the results of their research efforts in designing behavioralbased classification techniques. These mechanisms rely on the analysis of statistical features of traffic flows, rather than on deterministic properties such as port numbers or the presence of specific patterns in the packet payloads. The features used in these new techniques are usually a combination of packet size, inter-arrival times, flow duration, etc. We have recently published a few papers on the subject, including [1]. Other approaches use just the packet size as a feature: see for example [2]. All of these behavioral techniques, as well as This work was supported in part by a grant from the Italian Ministry for University and Research (MIUR), under the PRIN project RECIPE.

the ones that we will refer to in Section II, have at least two common goals: (i) real-time operation even on fast links, which requires lightweight classification algorithms, and (ii) the ability to classify flows as soon as the first few packets are seen. A common technique used to reduce the complexity of the classifiers is to consider consecutive packets of each flow as statistically independent. This assumption, although justified by favorable experimental results, might have a significant impact on the precision of the classifiers. In this paper we analyze the extent of such impact, by comparing two classification approaches that differ only in the way they consider the statistical variables that represent the flow’s features. In one case, the classifier builds its training information packet by packet, without considering joint information between consecutive packets. In another case, the classifier builds multi-packet fingerprints, jointly evaluating the information contained by packet trains. Experimental results of the application of the two approaches to three data sets show that assuming independence between consecutive packets has no dramatic effects on the reliability of the classifier with respect to true positive rates, i.e., on its ability to recognize traffic for which it has received training. However, experiments also show significantly better results of the version of the classifier that does consider jointly consecutive packets especially when dealing with some types of false positives, e.g., with the classification of types of traffic that the classifier has received no training for. These results can be useful to precisely tune classification algorithms with respect to the balance between precision and computational complexity: where the precision of the algorithm in terms of false positives is not an issue, one might consider classifiers that do not consider consecutive packets as inter-dependent. On the other hand, when the objective is minimizing false positives caused by traffic classes other than the ones included in the training set, classification techniques that consider the dependencies of multi-packet trains should be preferred. The rest of the paper is organized as follows: in Section II we report of related works. We present our algorithms in Section III where we introduce the features we capture, the training and the evaluation phase along with the tuning of the involved parameters. Section IV presents the classification results we obtained on three different network traces captured

in the access and core networks. Section V concludes the paper. II. R ELATED WORK The pioneering studies by Paxson [3] on the statistical characterization of Internet traffic stimulated the development of new approaches to the problem of traffic classification. Since then, several encouraging results have been reported (see [4], [5], [6], to cite a few), showing that Machine Learning (ML) techniques can overcome the increasing limitations of classical approaches based on port analysis and packet inspection. Other recent works, such as the ones by Roughan et al. [7], McGregor et al. [8] and Li et al. [9] persuaded the scientific community to accept the idea that the behavior of network traffic is so heavily influenced by the underlying application protocols to enable the separation of flows from an aggregate by observing a very limited set of features using techniques such as Nearest Neighbor, Linear Discriminant Analysis and Support Vector Machine. Thanks to their works, several other ML approaches are now being considered to investigate traffic similarities and to determine by means of both supervised or unsupervised algorithms which features should be selected to divide flows into clusters depending on the application that has generated them: e.g. Williams et al. [10] focused their contributions on several ML approaches, including Naive Bayes, Bayesian Networks, Nearest Neighbor, Multilayer Perceptron Network, Sequential Minimal Optimization, and Support Vector Machine. Bernaille et al. in [2] presented a classification approach based on Gaussian Mixture Models and Hidden Markov Model techniques to assign flows to clusters: they show how to set up a classification procedure that uses the determined clusters to assign a new flow to the application that labels the matched model. More interesting is the fact that the presented results have been obtained mapping flows to a n-dimensional space depending on features such as the size and the direction of their first n packets. Similarly to this last work, we have recently introduced a supervised approach based on the notion of protocol fingerprints [1], simple and compact objects that provide a statistical behavioral description of the protocol they model. In that paper we described how we set up a classification experiment based on the probability density estimation (the fingerprint) of the pair of joint variables packet size and inter-arrival time. To deal with a simple classification algorithm that could be easily implemented on real networks, we decided to characterize separately the behavior of each packet inside a flow to reduce the complexity of the mechanism. We left the evaluation of the impact of estimating multi-packet trains in a jointly fashion to a future work, which has stimulated the analysis presented in this paper. III. S TATISTICAL APPROACHES TO TRAFFIC CLASSIFICATION : THE IMPACT OF JOINT ESTIMATION In this paper we compare two simple statistical mechanisms aimed at detecting the application layer protocols behind

a given TCP session. Following recent approaches to this problem, the mechanisms are based on the analysis of a few features of each flow, such as the size and the direction of the packets in the order they appear at a traffic capture device. This choice stems from the intuition that the finite state machine that drives a given application layer protocol has a strong impact on the way the information is broken into packets, especially in the early steps of the exchange. For example, if we consider the authentication phase in a mail retrieval connection, the POP3 protocol requires the client to send a couple of commands carrying the username and the password: this means that the corresponding flows are located in a specific position on the feature space. This intuition is well supported by experimental results. We also assume that packets that do not carry TCP payload do not introduce any additional information useful for classification purposes, and therefore we do not consider them. In fact, empty TCP packets carry only signaling at the transport layer, and their statistical characteristics should have little to do with the application layer. From this point on we define a flow as the bi-directional ordered sequences of packets carrying the same pair of source and destination tuples of IP addresses and TCP port numbers. Moving from this sequence, we create a feature vector F that will be used by the classification algorithms in place of the original TCP session: F = {s1 , s2 , s3 , . . . , sN } ,

(1)

where N is the dimension of the space where each flow is mapped, i.e., the number of packets considered, and each si is a function of the size of the i − th packet and it is expressed by the following rule: ! +size(pkti ) + K if pkt sent by client, si = −size(pkti ) − K if pkt sent by server.

The reason behind the sign transformation is to include the direction of the original packets in the feature sequence; constant K, instead, is added to increase the separation between counter-propagating packets. For example, when the analyzed packets are captured on 802.3 segments, the maximum packet size, including TCP/IP header, is 1500 bytes: with a value of the constant set to K = 1000 we end up with values lying in the intervals [−2500, −1040] and [+1040, +2500]. A. Training the classifier: statistical fingerprinting The algorithms that we present here fall into the class of “supervised approaches”: before any classification activity can be put into practice, the mechanisms need to be trained on previously collected data sets, each one grouping flows generated by the same application protocol. This preliminary training phase ends up with statistical descriptions of the application classes under observation. Building on our previous work [1], the description of each application is based on the estimation of the Probability Density Functions (PDF) calculated on the features derived from the corresponding

training set. We use symbol ωi , 1 ≤ i ≤ M , to indicate each of the M application protocols under observation, or, indifferently, their PDF estimates. In addition, we use symbol "x to indicate a feature vector representing a generic flow, while θ (ωi ) identifies the training set of class ωi . Once the flows that compose the training set of a given application ωi are converted into their equivalent pattern representations "x, a histogram method can be adopted to provide a non-parametric density estimation of the class. Since the N recorded features are integer numbers ranging in R = [−2500, 2500], every feature vector "x comes from the set defined by S=

N "

R

(2)

k=1

where symbol ⊗ denotes a Cartesian product. The PDF of application ωi can then be calculated by counting how many flows in its training set θ (ωi ) fall in each of the equally-sized cells centered at the points defined by Equation 2: n ("a| ωi )dV xj |ωi ) dV ! xj ∈S n ("

pˆ("a|ωi ) = #

(3)

Here n ("a|ωi ) is the number of samples from the training set that fall in the cell of volume dV that straddles the point "a ∈ RN and "xj ∈ S are the centers of all the cells. N This approach requires [R] cells to store each of the joint probability functions pˆ("x|ωi ), where [R] is the cardinality of the set R: this makes it inapplicable in practice, because of its memory requirements. There are two possible solutions to the problems of memory scalability of the PDF evaluation mechanism. In one case, it might possible to reduce the memory requirements of Equation 3 by directly estimating the “distance” of the observed flow from each flow composing the training set, without building the lookup table of the PDF (see Section IIIB). Another, more radical approach is to assume that the variables that represent the features of consecutive packets are statistically independent (see Section III-C). While the first approach takes into consideration the joint information between consecutive packets, and could potentially allow more precise classification, the second has the advantage of both decreasing memory requirements of the classification as well as its computational requirements, therefore making it more amenable to implementation on embedded systems. The focus of this paper is to compare these two approaches with respect to their precision in classifying traffic classes that are present in the training sets, as well as classes that the classifier was not trained for.

pˆ ("a| ωi ) =

$

! xj ∈θ(ωi )

δ ("a − "xj ) [θ (ωi )]

(4)

where [θ (ωi )] is the number of samples "xj in the training set, and δ ("x) is the discrete Dirac delta function. Each time a new flow comes, we first evaluate its feature vector "x and then compute the summation in Equation 4 in "a = "x. Therefore, there is no need to estimate the entire PDF for each possible observable vector solving all possible memory problems at the expenses of a slower algorithm: however, we underline that in this paper we are not investigating how to reduce the computational complexity of the presented algorithms, but only evaluating their classification precision. To address issues concerning possible sparseness of the estimated functions, a filtering step is carried out. The resulting pˆ ("a|ωi ) can, in fact, display wide high-valued regions punctured by a number of tight null-valued subsets for little variation of "a at the boundaries of cells. This issue may mislead the description of the target class ωi or, even worse, compromise its robustness to “noise” factors such as little variations from the behavior of standard PDFs. Such noise factors could be slight variations of different implementations of a given standard protocol, or varying MTUs in network paths, etc. If we limit the density estimation to Equation 4, the subsequent classification will be vulnerable to such kind of noise: new flows belonging to class ωi could, in fact, fall in a zero-valued cell even if the surrounding ones are populated with non zero values. To counter this issue, following the well know kernel method (or Parzen method), we filter the PDF using the simplest possible Gaussian kernel1 : to take into account the filtering, we modify Equation 4 as follows where we use a zero-mean Gaussian kernel with the same σ 2 variance along all disjoint axes:

pˆ ("a|ωi ) =

$

! xj ∈θ(ωi )

1 √

[θ (ωi )] 2πσ 2

N

%

&"a − "xj & exp − 2σ 2

2

&

.

(5) We can not still use Equation 5 in its current form: a final normalization step is needed. The last equation, in fact, provides a PDF which extends on RN while the observable domain S given by Equation 2 is finite. To this end we replace the coefficient of the exponential terms with a “global” normalization constant Γ and rewrite the equation as follows: 1 pˆ ("a|ωi ) = Γ

$

! xj ∈θ(ωi )

%

&"a − "xj & exp − 2σ 2

2

&

.

(6)

B. First approach: joint conditional densities estimation

Now we limit variable "a to the discrete domain S: this allows to compute Γ by imposing that the probability mass function estimation 6 sums up to 1 over S, that means:

To compute the value of the density conditioned to class ωi in the generic point "a ∈ RN , we consider a different formulation, alternative to that of Equation 3:

1 Not only the Gaussian one is the simplest kernel one might employ: as we will see, we take advantage of some interesting properties of such kernel, e.g. its separability along the axis, in the design of our technique.

$

! x∈S

pˆ ("x|ωi ) =

1 Γ

$

! x∈S,! xj ∈θ(ωi )

%

exp −

&

2

&"x − "xj & 2σ 2

D. A common classification algorithm = 1.

This equation leads to:

Γ (ωi ) =

$

N '

! xj ∈θ(ωi ) k=1

+2500 $

%

(n − 'x"j , "uk () exp − 2σ 2 n=−2500

2

&

, (7)

where "uk is the unit vector along k−th axis and we explicitly introduced a dependency of the normalization constant on the application class. C. Second approach: marginal conditional densities estimation If we consider the components of the feature vectors as statistically independent, we can compute separately their density estimation and express the density estimation of the whole feature vector as a product of the marginal ones: ( * 2) exp − |'"a − "xj , "uk (| 2σ 2 √ , (8) pˇ ("a| ωi ) = [θ (ω )] 2πσ 2 i k=1 ! xj ∈θ(ωi ) N '

$

where we already put the Gaussian filter in action as explained in the previous section. This is quite similar to the approach we followed in [1]. With a representative training set θ (ωi ), with the assumption of statistical independence, the following equation holds: pˆ ("a|ωi ) = pˇ ("a|ωi ) . We remark that evaluating the PDFs in this way is only based on the assumption of independence: we are not interested in proving such independence. Actually, we are pretty sure that there are indeed statistical dependencies between consecutive packets. We are simply trying to assess the effects that building the PDFs on such assumption has on the precision of the classification process, since this can lead to a classifier that is much less memory and resource-intensive than the one described in the previous section. We repeat now steps similar to those used in the previous section and finally get the following expression for the alternative, marginal based, PDF estimation: N ' 1 pˇ ("a|ωi ) = Γk k=1

$

! xj ∈θ(ωi )

%

2

|'"a − "xj , "uk (| exp − 2σ 2

&

,

(9)

where each Γk , one per dimension, is given by: % & +2500 2 $ $ (n − '"xj , "uk () Γk (ωi ) = exp − . 2σ 2 n=−2500 ! xj ∈θ(ωi )

Once again, we consider the same variance σ 2 for all axes as we did in [1].

The classification algorithm has to decide whether an unknown flow "x is conforming at least to one of the target classes, which are described by the PDFs calculated by either ones of the approaches introduced in the previous sections. To accomplish this task we define an algorithm that combines a maximum a-posteriori probability rule with a rejection schema. We compute the probability that "x belongs to each class and we take the value so that: ωt = arg max {p ("x|ωi )} , ωi

where the class giving such value represents the candidate class ωt . If we consider a joint estimation of the N packets composing the flow, the value p ("x|ωi ) is given by the application of Equation 6. Conversely, under the hypothesis of statistical independence, p ("x|ωi ) is computed by applying Equation 9. It is next to impossible to be able to train a classifier with all possible types of traffic. Even the most dedicated network managers would probably be able to profile only a subset of the applications that run on their networks. Therefore, we need to face the case where a flow does not belong to any of the traffic classes used in the training phase. In this case the flow should be properly discarded, i.e. assigned to the rejected region ωr . In addition to the maximum a-posteriori probability rule we adopt the following threshold-based rejection schema: ! ωt if p("x|ωt ) ≥ Tt , "x ∈ ωr otherwise. The value Tt is the threshold associated to the candidate class ωt , and that can be computed following the optimization process described in the following section. E. Optimization of parameters We can basically act on three parameters to tune the classifier: the minimum number of packets N to consider before emitting a classification verdict, the variance of the Gaussian filter and the thresholds Tt related to each target class. The minimum number (N ) of packets to consider before classifying a flow has an impact on the complexity of the model. While under the independence hypothesis the evaluation of an additional packet only involves a new marginal density estimation, in the case of joint density estimation, the value N directly corresponds to the size of the feature space, thus increasing the complexity of the overall estimation process. In order to compare the results given by the two models, we set the same value N = 4 for both models. This also affects the completeness of the classifier, since it will not be able to consider flows that terminate before exchanging N = 4 packets. However, especially for QoS applications, this should not pose any problems, since any traffic engineering mechanism is usually focused on dealing with longer-lived flows. The variance of the Gaussian filter and the thresholds related to each class are strictly connected and they need to be

optimized jointly. The threshold influences the size of the acceptance region: if it is set to zero, the algorithm exactly follows a maximum a-posteriori probability rule, loosing any capacity of detecting unknown flows, i.e., flows not belonging to any of the classes which the classifier has received training for. Let us define as True Positives (TP) the number of flows actually generated by one specific class that was included during training, which are correctly classified by our technique. Conversely, False Positives (FP) will refer to flows that are incorrectly assigned to one of the classes that the classifier received training for. Finally, we will define True Negatives as the number of flows that do not belong to any of the classes included in the training set, and that the classifier correctly marks as “unknown”. We use these definitions not only to evaluate our classifier, but also to run the optimization procedure on its parameters as follows. To compute the optimum variance of the Gaussian filter and the thresholds, we use a set of flows for each target class. These sets have to be disjoint from the sets used for density estimation or testing. We then iteratively run the classification algorithm of Section III-D on these sets, making the variance of the Gaussian filter and the threshold values vary. The optimum values are the ones that satisfy the following criterion: max

2 ,T ...,T σ12 ,...,σM 1 M

{T P r − F P r} ,

where T P r and F P r are the True Positive rate and False Positive rate, respectively. IV. E XPERIMENTAL RESULTS Here we report the classification results we achieved by following the mechanisms described in the previous Section. We compare the results provided by a joint estimation density of the application protocol as defined in Equation 6, with the ones given under the independence assumption of the features composing the flow, as defined in Equation 9. We apply our classification algorithms to three datasets. The idea is to assess the techniques on different protocols and collection points, thus emphasizing how the differences between the two approaches impact the classification results in different operating conditions. A. UNIBS dataset The packet traces from this set were collected at the border router of our Faculty network. Having full monitor access to this router, we can apply pattern-matching mechanisms to assess the actual application that has generated each TCP flow, in some cases with the help of manual inspection. Because of this, we consider both the training and evaluation sets derived from UNIBS relatively reliable with respect to the pre-classification information, i.e., with respect to knowing, independently from our classifier, which application generated each flow (“ground truth”). Our network infrastructure comprises several 1000Base-TX segments routed through a Linux-based dual-processor box

UNIBS Protocol sessions http 5000 pop3 20000 ftp 14500 smtp 19500 msn 1000 torrent 7400 edonkey 4100 nntp 3 ssl 390 imap 14 gnutella 10 smb 120

Port 80 110 25 139 443 993 445 389 5308 631 21 995 515 22 143

LBNL sessions 38000 1400 1300 3300 10100 410 1400 1000 500 470 470 400 150 120 70

CAIDA Port sessions 80 12500 110 2400 21 16200 25 20700 443 7900 4662 2500 1214 860 6346 400 119 60 53 35 22 16 139 6

TABLE I P ROTOCOL TYPES / PORT NUMBERS AND NUMBER OF FLOWS COMPOSING THE EVALUATION SETS .

T OP SECTION : CLASSES USED FOR TRAINING .

and includes about a thousand workstations with different operating systems. All the traffic traces were gathered on the 100Mb/s link connecting the edge router to the Internet for a period of three weeks: a total of 50GB of traffic was collected by running Tcpdump [11] for fifteen minutes every hour. The training set is composed of six protocol classes and for each class we select exactly one thousand flows. We use the same number of flows for each of the trained classes since we do not make any assumption on the a-priori probability of each class. The evaluation set is instead composed of the six protocols mentioned above (the ones at the top left of Table I) and another set of protocols, named other (the ones at the bottom left of the table): we use this last set to verify the classifier’s ability to recognize protocols different than those used during the training phase. Note that the training and evaluation sets were collected in two different, and consecutive, time frames during the course of several weeks. B. LBNL dataset The LBNL traffic traces we used were collected at the Lawrence Berkeley National Laboratory under the Enterprise Tracing Project [12]. The packet traces were obtained at the two central routers of the LBNL network and they contain more than one hundred hours of traffic generated from several thousand internal hosts. The traffic traces are public, but they are completely anonymized, so ascertaining the “ground truth” on the application behind each recorded flow is not possible. Therefore, for this set, we built protocol sets according to the TCP destination port number of each flow. We used the traffic traces captured on December 15 and 16, 2004 to obtain the training set and those captured on January 6 and 7, 2005 to build the evaluation set. Table I (center) reports the selection of TCP ports and the composition of the evaluation set. Once again, each class of the training set (top center part) is composed of a separate dataset of one thousand flows. Instead, the ports under the central line belong to the other set.

C. CAIDA dataset We built this data set starting from three hour long traces obtained by the Cooperative Association for Internet Data Analysis (CAIDA) [13], and collected at the AMES Internet Exchange (AIX) along an OC48 link on August 14, 2002. We used flows extracted from the first hour (corresponding to the interval 16.15-17.00 UTC) to build the training set and from the third hour (18.00-18.10 UTC) to create the evaluation set. As for the previous set, these traces are also anonymized, so port numbers are used as indicators of each protocol class. In Table I (right) we report the list of TCP ports and the number of flows composing the CAIDA evaluation set. The selection of flows composing the training set and the ports belonging to the other set follow the same considerations mentioned for the LBNL and UNIBS datasets.

SSL traffic2 goes from a lowish 69.57% with the marginal estimation method to a good 94.88% with the joint estimation one. This indicates that when the training set is certified with respect to the ground truth behind each traffic flow, as with the UNIBS dataset, the performance of the algorithm can be significantly improved adopting the joint estimation mechanism. Such differences are less marked with the LBNL and CAIDA datasets, but in these two cases a big factor that needs to be considered is that there is no “ground truth” associated with each traffic class other than the port numbers. Therefore, one can expect that even a significant fraction of what is in class “TCP PORT 80” of these sets might actually belong to protocols quite different than HTTP. The unexciting results on the other CAIDA set (see the 1214, 6346 and 53 ports) can also be at least partially explained by the same token.

D. Numerical results In Table II we present the classification results of the mechanisms described in the previous section. On the left, the results achieved when the PDF is estimated under the independence assumption of the features composing the flow (see Section III-C). On the right, the results achieved by a joint estimation of the PDF (see Section III-B). The algorithm described in Section III-D is then applied to emit a classification verdict, after tuning its parameters according to the optimization process described in Section III-E. Numbers in bold represent the True Positive rates in the top part of each table, i.e., the fraction of flows belonging to classes that the classifier had received training for, which are correctly classified. Conversely, True Negative rates (the fraction of flows correctly assigned to the “unknown” category) are the numbers in bold in the bottom part of each table. Experimental results show that there are only minor differences in classification precision between the two approaches when dealing with traffic classes which the system received training for (top part of each table). Besides few isolated instances, the marginal estimation method seems to perform just as well as the joint estimation one on trained classes. One of such exceptions is port 80 traffic of the LBNL dataset, for which the joint estimation classifier achieves more than 10% better results in terms of true positive rate than the marginal estimation one. Also, there are no marked differences between the application of these two mechanisms to such different environments such as UNIBS, LBNL and CAIDA: both seem to be able to achieve good results in all environments when dealing with trained classes. A much bigger difference can be seen when comparing data related to traffic that belongs to the other class, i.e., traffic generated by applications (or port numbers) unknown to the classifier with respect to the protocols it received training for. In this case, especially for the UNIBS dataset (see bottom part of the top table), there is a marked improvement in the ability of the classifier to correctly assign such traffic to the “unknown” class when using the joint estimation method vs. the marginal one. For example, the True Negative rate for

V. C ONCLUSIONS In this paper we have applied two statistical techniques to the issue of traffic classification. The techniques aim at profiling an application protocol by estimating the probability density function of the flows generated by the protocol itself. The considered features are the packet size and the direction of the flow (from client to server or vice-versa). One of the techniques relies on the independence of the features, and is directly derived from our previous work [1]. The other technique considers the features jointly. The experimental results presented in this paper tell two interesting stories. On one hand, the True Positive rates of flows that belong to trained classes does not seem to be greatly affected by the choice of the technique. In other words, judging by this parameter alone might make a network manager choose the simplest of the two techniques (the one based on marginal estimation), which is also the less memory intensive one. On the other hand, the joint estimation technique shows markedly improved results when dealing with the ability of the classifier to assign unknown traffic (i.e., traffic belonging to classes which the classifier did not received training for) to the correct class. Therefore, in order to achieve the best results possible, it seems that joint estimation algorithms have an edge over marginal estimation ones. In a future work we are planning to investigate the reason behind this behavior, i.e., analyze why joint estimation has such an evident effect only in the ability of the classifier to recognize “unknowns”. R EFERENCES [1] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Traffic Classification through Simple Statistical Fingerprinting,” ACM SIGCOMM Computer Communication Review, vol. 37, pp. 5–16, Jan. 2007. [2] L. Bernaille, R. Teixeira, and K. Salamatian, “Early Application Identification,” in The 2nd ADETTI/ISCTE CoNEXT Conference, (Lisboa, Portugal), Dec. 2006. [3] V. Paxson, “Empirically derived analytic models of wide-area TCP connections,” IEEE/ACM Transactions on Networking, vol. 2, no. 4, pp. 316–336, 1994. 2 Remember

that we did not train the classifier for SSL traffic.

[4] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood, “Deep packet inspection using parallel bloom filters,” IEEE/Micro, vol. 24, no. 1, pp. 52–61, 2004. [5] T. Kocak and I. Kaya, “Low-power bloom filter architecture for deep packet inspection,” IEEE/Communications Letters, vol. 10, no. 3, pp. 210–212, 2006. [6] A. Broder and M. Mitzenmacher, “Network applications of bloom filters: a survey,” Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2003. [7] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification,” in IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, (Taormina, Sicily, Italy), pp. 135– 148, Oct. 2004.

[8] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Machine Learning Techniques,” in Proceedings of the 5th Passive and Active Measurement Workshop (PAM 2004), (Antibes Juan-les-Pins, France), pp. 205–214, Mar. 2004. [9] R. Y. Z. Li and X. Guan, “Accurate Classification of the Internet Traffic Based on the SVM Method,” in Proceedings of the 42th IEEE International Conference on Communications (ICC 2007), June 2007. [10] N. Williams, S. Zander, and G. Armitage, “A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification,” SIGCOMM Computer Communication Review, vol. 36, no. 5, pp. 7–15, 2006. [11] “Tcpdump/Libpcap.” http://www.tcpdump.org. [12] LBNL/ICSI Enterprise Tracing Project. http://www.icir.org/enterprisetracing. [13] The Cooperative Association for Internet Data Analysis (CAIDA). http://www.caida.org.

marginal http pop3 ftp smtp msn torr edon nntp ssl imap gnut smb

http 93.75 – – – – 0.04 – – 28.39 – 70.00 –

pop3 – 93.69 11.44 0.53 – – – 33.33 – – – –

ftp – 4.92 88.02 0.60 – – – – – – – –

smtp – 1.38 – 98.64 – – 0.07 66.67 – 85.71 – –

msn – – – – 93.80 0.23 95.28 – 2.05 – – 33.05

torr – – – 0.01 4.94 97.81 – – – – – –

(a) UNIBS evaluation set. unknown joint 6.25 http 0.01 pop3 0.54 ftp 0.22 smtp 1.26 msn 1.92 torr 4.65 edon – nntp 69.57 ssl 14.29 imap 30.00 gnut 66.95 smb

http 93.53 – – – – 0.08 0.02 – 5.12 – 40.00 26.27

pop3 – 95.64 2.82 0.21 – – – – – – – –

ftp – 3.71 87.64 0.45 – – 0.02 – – – – –

smtp – 0.05 0.30 98.95 – – – – – 50.00 – –

msn – – – – 91.87 – 0.02 – – – – –

torr – – – – 0.39 92.77 2.47 – – – – –

unknown 6.47 0.60 9.24 0.39 7.74 7.15 97.46 100.00 94.88 50.00 60.00 73.73

marginal 80 110 25 139 443 993 445 389 5308 631 21 995 515 22 143

80 85.34 – – – 2.57 0.98 6.84 13.60 2.98 12.47 – – – – –

110 – 97.41 – – – – – – – – 79.23 – – – –

25 – 1.26 96.40 – – – – – – – 0.64 – – – –

139 – – – 99.91 – – – 17.83 – – – – – – –

443 0.03 – – – 76.70 10.00 – – – – – 15.97 – – –

993 0.05 – – – 0.03 88.54 – – – – – 74.69 – – –

(b) LBNL evaluation set. unknown joint 80 14.58 80 96.60 1.33 110 – 3.60 25 – 0.09 139 – 20.70 443 0.45 0.49 993 0.98 93.16 445 17.61 68.57 389 0.39 97.02 5308 2.98 87.53 631 – 20.13 21 – 9.34 995 – 100.00 515 – 100.00 22 – 100.00 143 –

110 – 97.55 – – – – – – – – 79.23 – – – –

25 – 2.38 99.54 – – – – – – – 0.64 – – – 11.76

139 – – – 99.91 – – – 7.98 – – – – – – –

443 – – – – 75.47 1.46 – – – – – 16.71 – – –

993 0.01 – – – 0.44 97.56 – – – – – 75.68 – – –

unknown 3.39 0.07 0.46 0.09 23.64 – 82.39 91.63 97.02 100.00 20.13 7.62 100.00 100.00 88.24

marginal 80 110 21 25 443 4662 1214 6346 119 53 22 139

80 94.78 – – – 1.32 0.04 71.16 63.37 – 65.71 – –

110 – 95.14 0.70 2.16 – – – – 13.33 – – –

21 – 4.19 96.62 3.87 – – – – 55.00 – – –

25 – 0.34 0.07 92.90 – – – – 10.00 – 12.50 –

443 – – – – 92.41 – – – – – – –

4662 – – – – – 99.76 3.02 5.06 5.00 – – –

(c) CAIDA evaluation set. unknown joint 5.22 80 0.34 110 2.61 21 1.06 25 6.27 443 0.20 4662 25.81 1214 31.57 6346 16.67 119 34.29 53 87.50 22 100.00 139

110 – 90.44 0.71 0.81 – – – – 11.67 – – –

21 – 9.01 97.86 0.21 – – – – – – – –

25 – 0.50 0.33 98.45 – – – – 30.00 – 12.50 –

443 – – – – 93.29 – – – – – – –

4662 – – – – – 99.20 1.05 3.37 – – – –

unknown 5.20 0.04 1.10 0.53 6.10 0.80 35.23 48.43 58.33 51.43 87.50 100.00

80 94.80 – – – 0.61 – 63.72 48.19 – 48.57 – –

TABLE II C LASSIFICATION RESULTS ( OPTIMAL PARAMETERS ). I N BOLD , T RUE P OSITIVE RATES ( TOP SECTIONS ), AND T RUE N EGATIVE RATES ( BOTTOM SECTIONS ).