The Second International Conference on Digital ...

Conference Title

The Second International Conference on Digital Security and Forensics (DigitalSec2015)

Conference Dates

November 15-17, 2015 Conference Venue

University of Cape Town Cape Town, South Africa ISBN

978-1-941968-28-4 ©2015 SDIWC

Published by

The Society of Digital Information and Wireless Communications (SDIWC) Wilmington, New Castle, DE 19801, USA www.sdiwc.net

Proceedings of The Second International Conference on Digital Security and Forensics, Cape Town, South Africa 2015

Forensic Network Traffic Analysis Noora Al Khater Department of Informatics King's College London London, United Kingdom [email protected] ABSTRACT The nature of information in a network is volatile and dynamic, some precious evidence might be missed. The real-world situations need a quick classification decision before the flow finishes, especially for security and network forensic purposes. Therefore, monitoring network traffic requires a real-time and continuous analysis, to collect valuable evidence such as instant evidences that might be missed with post-mortem analysis (dead forensics). Network traffic classification is considered the first line of defence where a malicious activity can be filtered, identified and detected. In addition, it is the core component in evidence collection and analysis that uses filtered evidence and helps to reduce redundancy. However, most of the existing approaches that deal with collecting evidence from networks are based on post- mortem analysis. Therefore, this research investigates different classification techniques using Machine Learning (ML) algorithms, seeking to identify ways to improve classification methods from a forensic investigator standpoint.

KEYWORDS Digital forensics; cyber security; network traffic classification; digital evidence; network analysis

1 INTRODUCTION Analysing network traffic is considered a proactive investigation in network forensics. Monitoring and classifying network traffic correctly can help organisations avoid a lot of harm, and provide them with an intelligible vision of traffic patterns to put a good incident plan response in place. In fact, organisations are

ISBN: 978-1-941968-28-4 ©2015 SDIWC

Richard E Overill Department of Informatics King's College London London, United Kingdom [email protected] looking for more effective methods to solve these problems that caused by increased in number of cyber-crimes that cause genuine harm. There are great historical developments of classification techniques; however, they are not free from drawbacks, which can be outlined in two points. First, traditional techniques (portbased classifications) are unreliable because most of the applications use random or nonstandard ports. Second, even payload-based classification is considered more accurate but this accuracy diminishes when dealing with encrypted traffic; furthermore, this technique has a highly expensive computational cost and may be thwarted by privacy policies because it inspects packet contents. These problems have driven the research community to shift into statistical and behavioral classification using Machine Learning (ML) techniques, which depend on analysis of application patterns that do not require payload inspection. Additionally, the increasing amount of traffic and transmission rates stimulate researchers to look for lightweight algorithms. And the persistence of application developers in inventing new ways to avoid filtering and detection mechanisms of traffic is another motivating factor. This paper structured as follows: Section 2 defines digital forensics and provides background on the forensic model. Section 3 discusses the use of ML algorithms in the field of network traffic analysis and reviews the existing studies. Section 4 outlines the

1


challenges in the field of network traffic classification using ML. Section 5 illustrates the proposed method for forensic network traffic analysis. Section 6 summarises our research. 2 WHAT IS DIGITAL FORENSICS? The Digital Forensics Research Workshop (DFRWS) [1] has defined digital forensics as: “The use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation, documentation, and presentation of digital evidence derived from digital sources for the purpose of facilitation or furthering the reconstruction of events found to be criminal, or helping to anticipate unauthorized actions shown to be disruptive to planned operations.” Network forensics is subdomain of digital forensics. It deals with the analysis of network traffic as proactive investigation, and this is different to other areas of digital forensics because it involves volatile information that requires real-time continuous analysis. Network forensics has been defined by DFRWS [1] as: “The use of scientifically proven techniques to collect, fuse, identify, examine, correlate, analyze, and document digital evidence from multiple, actively processing and transmitting digital sources for the purpose of uncovering facts related to the planned intent, or measured success of unauthorized activities meant to disrupt, corrupt, and or compromise system components as well as providing information to assist in response to or recovery from these activities.” 2.1 Background on Forensic Model For investigation in digital forensic science there are several models, which can be used. For a clear explanation we take a formal comprehensive approach from the DFRWS. The DFRWS model consists of sequential steps in the process of digital forensic analysis [1].

ISBN: 978-1-941968-28-4 ©2015 SDIWC

The steps help the practitioners and researchers to conceive of the situation to understand the direction of what they need to focus on. These steps can be seen in Table 1. The linear approach begins with the process of identifying potential evidence. This involves the location of the digital evidence and how to determine the location. For example, in network forensics Intrusion Detection Systems (IDS) can be sources of digital evidence; IDS involves recognising and detecting unusual patterns of flow in the network traffic. The second step is preservation, which is a critical phase for increasing the possibility of a successful investigation. This process starts from acquiring, seizing, and preserving the evidence to create a digital image of the evidence and maintain the chain of custody. The following steps are collection, examination and analysis of the digital evidence to culminate in the final single presentation. Network forensics requires real-time collection of the digital evidence to avoid any missing critical information. However, most of the methods in the collection process are based on post- mortem analysis. The huge amount of collected data needs automated techniques to reduce redundancy, and consequently reduce the analysis time of the evidence [2]. Manual analysis of a complex network attack involving a large amount of data is impossible in a timely fashion, so an automated method is a basic element to identify, collect and link evidence with the criminal action. In addition, the massive amount of traffic needs a huge storage capacity, which costs a lot; this amount of data needs to be filtered to extract related data. Actually, the capture and analysis of network traffic is normally based on sniffing tools, which face difficulty and whose effectiveness diminishes when they deal with encrypted packets. However, analysis techniques based on pattern recognition using ML algorithms have proven promising results. Finally, the last stage of digital forensic investigation is presentation, which entails dealing with legal aspects of the

2


case and presenting the investigation’s findings in the court.

3 NETWORK TRAFFIC CLASSIFICATION USING ML TECHNIQUES

Presentation Documentation Mission Impact Statement Statistical Interpretation

Recommended Countermeasure

Clarification

Expert Testimony

Analysis Preservation Protocols Spatial

Link

Timeline

Data Mining

Statistical

Traceability

Examination Preservation Filtering Techniques Hidden Data Discovery Hidden Data Extraction

Pattern Matching

Validation Techniques

Traceability

Collection Preservation Approved Hardware Lossless Compression Data Reduction

Sampling

Legal Authority

Approved Software

Approved Methods

Preservation Case Management Imaging Technologies Chain of Custody Time Synch.

Event/Crime Detection Profile Detection Audit Analysis

System Monitoring

Complaints

Anomalous Detection

Resolve Signature

Identification

Table 1. Digital Forensic Investigation Process [1]

ISBN: 978-1-941968-28-4 ©2015 SDIWC

Despite of the fastest and simplest of the previous classification technique (port-based) in monitoring and reporting activity of network traffic, the unreliability of port-based classification has been proved in several published works. For instance, one study [3] stated that the port base analysis method can not classify 30- 70% of the internet traffic used in its work. Also, another investigation of the accuracy of port-based classification [4] demonstrated that the accuracy of this traditional technique, using the official IANA list, is not better than 70% of bytes. However, this classification is still used in situations where accuracy is not a critical matter. The problems of port-based classification methods and the drawbacks of payload-based classification techniques have motivated researchers to find alternative ways to analyse network traffic using statistical and behavioural properties of the flow without relying on inspection of packet contents. ML techniques have shown promising results in analysing network traffic based on the extracted features of the flow. 3.1 Machine Learning Concepts ML is a core branch of artificial intelligence. It is the science that makes machines able to obtain new information, develop new skills, assess and reorganize existing knowledge, and identify new information. In the early 1990s Shi [5] noted that the distinctive characteristic of intelligence is that machines have the ability to learn form experience automatically. Additionally, in 2000 Witten and Frank spotted that "Things learn when they change their behavior in a way that makes them perform better in the future" [6].

3


ML techniques are used in several applications such as medical diagnosis, search engines, marketing diagnosis, etc. ML algorithms were developed and their applications disseminated over various fields. The first remarkable work of ML techniques in the field of telecommunications networking was in 1990; this work was an attempt to get the most out of call completion in a circuit switcher [7]. 1994 saw the start of using ML algorithms to classify internet flow for intrusion detection [8]. This was the spark for a lot of research on applying ML techniques in network traffic classification. 3.2 Types of Machine Learning There are four different types of Machine learning according to Witten and Frank [6]: supervised learning (classification), unsupervised learning (clustering), association, and numeric prediction. Supervised learning uses a number of pre-classified examples, to build classifier rules to identify unknown flow. Unsupervised learning automatically classifies network flow into groups (clusters) of instances that have the same properties without any kind of pre-guidance. Association learning explores the links among features. And in numeric prediction, the prediction result is the numeric volume, not a discrete class. Most of the studies in the field of network traffic classification used supervised learning; a few studies, however, have applied unsupervised learning (clustering) or investigated the use of hybrid techniques (semisupervised) in their analysis of network traffic. 3.2.1 Supervised Classification Supervised classification techniques use preclassified (pre-labelled) samples during the training phase to build a classifier (model) with a set of rules in order to classify new samples [9]. The information learnt can be presented in the form of classification rules, decision tree, flowchart, etc. This information can then be

ISBN: 978-1-941968-28-4 ©2015 SDIWC

used to identify similar examples. Two main processes are inherent in supervised classification—namely, the training process and then the final step, the testing process. The testing phase is the following process where the model (classifier) is used to identify new flow. Supervised techniques are suitable for identifying specific types of applications. The effectiveness of these techniques is subject to a training phase (training set), because these methods of classification focus on forming the relationships of input/output. There are a number of different ML algorithms that can be used in classification. Each algorithm is distinct and different in the construction process of the model and the output. Examples of these algorithms are the Nearest Neighbours (NN), Linear Discriminate Analysis (LDA) and Quadratic Discriminant Analysis (QDA) algorithms. The authors in [10], applied a statistical signature-based approach using these ML algorithms to classify IP traffic. Another study [11] used the supervised ML naive Bayes technique to classify network traffic. In the training phase they used 248 full flow-based features. They were able to achieve 65% flow accuracy in their classification results with a simple Naive Bayes technique. And they enhanced the results of the classification by using the Naive Bayes Kernel Estimation (NBKE) and Fast Correlation-Based Filter (FCBF) techniques, enhancing flow accuracy results to 95% overall. The results of this study were improved upon in [12] by applying the Bayesian neural network technique. The authors were able to achieve 99% accuracy when the classifier was trained and tested at the same time and 95% when the classifier was tested eight months later. In [13], the authors applied Naive Bayes ML algorithm to classify network traffic based on features extracted from sub-flows instead of full flows. With this technique they were able to avoid the classifier to look for the start of the flow which could be missed. The authors in

4


[14] extended their previous study [13] by using Naive Bayes and decision-tree algorithms. They affirmed once again that performance of classifiers is enhanced when they are trained using sub-flows with the same datasets as in [13]. They also compared their results with the poor results of classification based on statistical features extracted from bidirectional flows. Another work [15] used Genetic Algorithm (GA) for feature selection, applied three different algorithms (Naive Bayesian classifier with Kernel Estimation (NBKE), Decision Tree J48, and the Reduced Error Pruning Tree (REPTree)) and compared their classification results. The classification outcomes show that the performance of tree J48 and REPTree are better than NBKE with high accuracy in the results. This study mentioned the influence of using data from several sites for training and testing processes. Although the accuracy result is based on overall results, it is noteworthy that high accuracy is provided by the first 10 packets used in the classification. This study raises the question: If different applications were used, how different would the results be?

technique was studied in [17] by applying the Expectation Maximisation (EM) algorithm [18] to classify network traffic based on features extracted from full flows. Using this technique, the researchers grouped applications into several groups based on characteristics. Their technique can be taken as an initial step in identifying unknown network traffic. Other researchers, such as [19], proposed a method to classify TCP flow by applying unsupervised ML (Simple K-Means algorithm). Their proposed classification technique exploits features taken from the first packets of the TCP flow. This contrasts with the previous study [20], which used the features of full flows. This proposed approach allows a quick identification of the applications that flow over the network by examining the first few packets only. This classification method is based on the hypothesis that assumes the classifier can always know the beginning of every flow. This is not the case in reality, as in real-world network traffic the start of the flow might be missed. Therefore, the classifier needs to consider different conditions and scenarios; otherwise the capability of the classifier decrease in situations different than studied conditions.

3.2.2 Unsupervised Classification 3.2.3 Semi-Supervised Classification In contrast to supervised classification, unsupervised classification techniques automatically classify flow without any guidance, trying to find internalised heuristics from unlabelled data based on discovering patterns [16]. They do this by clustering instances that have relatively similar characteristics or are identical in terms of input data together in groups. Some of the instances are moved into only one group, which can be called a limited group. Other situations involve overlapping, when the same instance can be found in different groups. The basic unsupervised techniques are the kmeans algorithm, probability- based clustering, and incremental clustering [6]. Unsupervised

ISBN: 978-1-941968-28-4 ©2015 SDIWC

Semi-supervised classification combines supervised and unsupervised techniques, taking advantage of both techniques by having the ability to detect new applications like unsupervised techniques, which reduces the performance degradation of supervised classification when dealing with situations different than studied conditions where new application emerges. One of the significant pieces of research on semi-supervised techniques was conducted by [21]. The proposed classification method in this study started with supplying a clustering algorithm with a training dataset that combined labelled flows and unlabelled flows. The

5


provided labelled flows help a clustering algorithm to map clusters with labelled flows. The unknown flows are allocated to the closest cluster based on the distance metric they used. The initial results with a semi-supervised approach using the K-Means clustering algorithm is promising. Further details of the results can be found in [22]. 4 RESEARCH CHALLENGES After considering the most recent developments in field of network traffic analysis using ML, it is clear that there are some problems associated with the methodology that have been used to analyse and classify network traffic in most of the published works. However, there is still space for significant contributions. Real-world situations require prompt analysis-based decision-making before the network flow ends—especially in critical situations involving the identification and detection of security incidents, for example. It is important to mention that the majority of the proposed techniques deteriorate when dealing with different conditions than ones for which they are best suited, like coping with a huge amount of network traffic (i.e. real situations). This means a lot of applications might not carry out classification and identification correctly; hence, valuable forensic evidence might be missed. Also, it is notable that most studies applied ML techniques using features extracted from full flows, whereas the results of classifying using sub-flows outperformed the results of studies using full flows. A very few studies, including [13, 23], investigated this. Indeed, for security analysis purposes the speed in identifying network traffic is also required as the accuracy of the results. The problem of classifying using full flows is that if the beginning of the flow is missed, performance of the classifier diminishes. In addition, losing the beginning of the packet is a very common scenario in realworld network traffic.

ISBN: 978-1-941968-28-4 ©2015 SDIWC

In point of fact and from a network forensics and security standpoint, delay, fragmentation, packet loss, and emergence of a new application are very important aspects that need to be considered in the performance evaluation of the classifier for more effective and realistic classification results, however most of the studies in this field have not investigated these issues. Furthermore, unsupervised techniques have not been used widely in classifying network traffic, despite the fact that these techniques have the ability to identify unknown applications if used properly or combined with supervised algorithms to enhance their effectiveness in analysing network traffic. 5 The PROPOSED TECHNIQUE Taking account of these gaps in the field and the importance of correctly and effectively classifying network traffic as a kind of proactive investigation, this research will investigate different network traffic classification techniques relying on the analysis of statistical flows using ML algorithms, without inspecting packet contents or using traditional techniques (port-based classification). The sequential steps of our work as illustrated in Figure 1. The first step involves capturing network traffic to create an appropriate dataset for our investigation. The second step consists of extracting features, such as the duration and length of the packet and inter-packet arrival times, from the sub-flow. We chose to work with features extracted from sub-flow because we believe this is the fastest way to analysis network traffic instead of using full-flow especially for security and forensic aspects. The extracted features will be used to train the classifier in the supervised techniques stage, with the endpoint of analysing network traffic smoothly. There is the high expectation that the classifier would not be able to distinguish a new application properly, if we were to use several datasets other than that on which the classifier

6


was trained; this study looks to investigate this hypothesis by adding unseen examples to various datasets. The performance of the supervised classifier will be evaluated, based on its ability to identify unseen applications; this is an important aspect that needs to be considered, for more realistic scenario. In the next stage, the objective will be to enhance the performance of the clarifier by combining supervised and unsupervised learning techniques. The aim in this stage will be to leverage the ability of unsupervised algorithms to identify unclassified applications. We will then evaluate the results of using both types of technique, to determine any differences in performance. In the third stage, we will investigate the capability of unsupervised techniques to work independently. Through this work, we will look to find a way of improving the performance of classification techniques, without relying solely on the use of supervised techniques. Additionally, we look also to emphasize the importance of apply unsupervised techniques in network traffic analysis and investigate the ability of these techniques to distinguish new and unseen applications that could be malicious in nature. The idea behind this research is that we can evaluate the extent of their effectiveness in coping with real-world situations—situations that require both the prompt recognition of application type in the network before the flow ends, and the making of quick decisions. Such timeliness is critical, as many criminal incidents could be detected and prevented before they can inflict extensive damage. This impetus explains our motivation in choosing to apply unsupervised techniques to identify unseen applications in forensic network traffic analysis.

ISBN: 978-1-941968-28-4 ©2015 SDIWC

Figure 1. The Workflow of Forensic Network Traffic Analysis

6 CONCLUSION Network traffic is a source of evidence that needs to be correctly classified immediately. Detection and response must occur prior to the attack. As a result, a real-time analysis of the application and automated classification is required to analyse the network, which is a valuable source of forensic evidence. Consequently, this provides structured forensic investigations of network security incidents. However, most of the evidence collection techniques for network forensics are rely on post- mortem analysis. Hence, a lot of precious evidence for investigation might be lost, which can lead to an inaccurate conclusion due to the weak evidence. In fact, to build a clear and strong case lawyers need more corroborating evidence, which calls for real-time collection.

7


Therefore, this research aims to identify methods that can improve classification techniques by using ML algorithms, and create a balance among accuracy, efficiency, and cost, because a large amount of network traffic is still unclassified.

[11]

A. Moore and D. Zuev, “Internet traffic classification using Bayesian analysis techniques,” in ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) 2005, Banff, Alberta, Canada, June 2005.

[12]

T. Auld, A. W. Moore, and S. F. Gull, “Bayesian neural networks for Internet traffic classification,” IEEE Transactions on Neural Networks, no. 1, pp. 223–239, January 2007.

[13]

T. Nguyen and G. Armitage, “Training on multiple sub-flows to optimise the use of machine-learning classifiers in real-world IP networks,” in Proceedings of the IEEE 31st Conference on Local Computer Networks, Tampa, FL, pp. 369‒376, November 2006.

[14]

T. Nguyen and G. Armitage, “Synthetic sub-flow pairs for timely and stable IP traffic identification,” in Proceedings of the Australian Telecommunication Networks and Application Conference, Melbourne, Australia, December 2006.

[15]

J. Park, H.-R. Tyan, and K. C.-C. J. Kuo, “GA-based Internet traffic classification technique for QoS provisioning,” in Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Pasadena, CA, December 2006.

[16]

H. D. Fisher, J. M. Pazzani, and P. Langley, Concept Formation: Knowledge and Experience in Unsupervised Learning. Morgan Kaufmann, 1991.

[17]

A. Mcgregor, M. Hall, P. Lorier, and J. Brunskill, “Flow clustering using machine learning techniques,” in Proceedings of the Passive and Active Measurement Workshop 2004, Antibes Juanles-Pins, France, pp. 205‒214, April 2004.

[18]

A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Stat. Soc., vol. 30, no. 1, pp. 1‒38, 1997.

[19]

L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classification on the fly,” ACM SIGCOMM Comput. Commun. Rev., vol. 36, no. 2, pp. 23‒26, 2006.

[20]

S. Zander, T. Nguyen, and G. Armitage, “Automated traffic classification and application identification using machine learning,” in IEEE 30th Conference Local Computer Networks 2005, Sydney, Australia, pp. 250‒257, November 2005.

[21]

J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Semisupervised network traffic classification,” ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) Perf. Eval. Rev., vol. 35, no. 1, pp. 369–370, 2007.

REFERENCES [1]

Digital Forensics Research Workshop. (2001, November). A road map for digital forensics research 2001. Retrieved from http://www.dfrws.org

[2]

W. Wang and T. Daniels. “Building Evidence Graphs for Network Forensics Analysis,” in Proceedings of the 21st Annual Computer Security Applications Conference (ACSAC 2005). September 2005.

[3]

[4]

A. Madhukar and C. Williamson, “A longitudinal study of P2P traffic classification,” in 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, September 2006. A. W. Moore and K. Papagiannaki, “Toward the accurate identification of network applications,” in Passive and Active Measurement, Boston, MA, March 2005, pp. 41‒54.

[5]

Z. Shi, Principles of Machine Learning. International Academic Publishers, 1992.

[6]

I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd ed., Morgan Kaufmann Publishers, 2005.

[7]

B. Silver, “Netman: A learning network traffic controller,” in Proceedings of the Third International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Association for Computing Machinery, 1990.

[8]

J. Frank, “Machine learning and intrusion detection: Current and future directions,” in Proceedings of the National 17th Computer Security Conference, Washington, D.C., October 1994.

[9]

[10]

Y. Reich and J. S. Fenves, “The formation and use of abstract concepts in design,” in Concept Formation: Knowledge and Experience in Unsupervised Learning, D. H. Fisher and M. J. Pazzani, Eds. Morgan Kaufmann, 1991. M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-service mapping for QoS: A statistical signature-based approach to IP traffic classification,” in Proceedings of ACM/SIGCOMM Internet Measurement Conference (IMC) 2004, Taormina, Sicily, Italy, October 2004.

ISBN: 978-1-941968-28-4 ©2015 SDIWC

8


[22]

J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson. “Offline/realtime traffic classification using semi-supervised learning.” Perf. Eval., vol., 64 nos. 9‒12, pp. 1194–1213, 2007.

[23]

T. Nguyen, G. Armitage, P. Branch, and S. Zander, “Timely and continuous machine-learning-based classification for interactive IP traffic,” IEEE/ACM Trans. Netw., vol. 20, no. 6, pp. 1880‒1894, December 2012.

ISBN: 978-1-941968-28-4 ©2015 SDIWC

9


Toward Forensics Aware Web Design: Case Study: Low Fare Local Hajj Booking Web Services Khalid A. A. Al-Shalfan Al-Imam Muhammed Ibn Saud Islamic University [email protected]

ABSTRACT Muslims, who would like to subscribe in the Hajj (Islamic pilgrimage to Mecca) have to register in the Hajj Ministry or agency in their countries, in order to book Hajj permit for organization and security reasons. The Saudi government, organizes the registration of Hajj via a set of web services that enables the registration on the Low Fare Local Hajj (LFLH) program. Hajj registration system includes several observations and errors related to forensics requirements preservation especially the identity and privacy and therefore, if an electronic evidence is required for investigation matter, it will not be admissible or at least admissible with low proven power. In this paper, we apply Fi4SOA framework to the LFLH real motivating example. First, in design phase, we depicts and extract forensics and business requirements of LFLH example. In addition, we establish the SABSA matrix including all requirements, strategic, and physical operations. Secondly, in run time phase, we translate some LFLH rules and events into TESLA rules and events and shows how to infer them and detect any forensics or business malfeasance.

KEYWORDS Hajj booking, web services, forensics and business requirements, Digital forensics.

1 LOW FARE HAJJ RESERVATION The Hajj is an Islamic pilgrimage to Mecca (in Saudi Arabia) and the largest gathering of

ISBN: 978-1-941968-28-4 ©2015 SDIWC

Muslim people every year. It is one of the five pillars of Islam, and a religious duty which must be carried out at least once in a life time by every Muslim who is physically and financially capable of undertaking the journey, and can support his family during his absence. A Muslim who would like to subscribe in the Hajj has to register in the Hajj Ministry or agency in their countries in order to book Hajj permit for organization and security reasons. Citizens of KSA as well must register to Hajj in order to book Hajj permit. The government organizes the registration of Hajj via a set of web services that enables the registration on the Low Fare Local Hajj (LFLH) program. The LFLH is a program launched by the KSA Hajj ministry characterized by its low Hajj cost. The registration capacity in LFLH in the year of 2014 is about 41 thousands pilgrims [1] in a country of more than 28 million of citizens. The LFLH web service is launched before a month from Hajj days. Since the capacity is limited, all available places are booked within the first few days. The Figure 1 shows the structure of LFLH web services. The booking of Hajj permit in LFLH begins by logging into the local hajj web site [2]. The user has to choose between requesting new booking, register confirmation payment after the booking acceptance, or cancel the booking reservation in the case of withdraw.

10


Figure 1. Low Fare Local Hajj web services Structure.

Requesting new booking passes through reading and confirming the general rules and policies of Hajj including who are authorized and the quality of services as well as the costs. The next step, the user has to input his identity number to verify his right of the hajj. Since each citizen in KSA has the Hajj right one time each five years. Also the government requires other conditions and policies which are: - Not be less than the age of the applicant (15) fifteen years. - Women must be logged in demand with her Muharram. - That does not exceed the number required booking them during a single visit to (10) members up to a maximum. In addition, the agreement includes the service program of LFLH and the price set of the program according to categories of camps. One time the pilgrim satisfies the registration conditions, he/she is asked to choose the living city in order to search only on Hajj Company in that city. Then the available Hajj Company is listed including the cost and the offered category of each one. The user chooses the one

ISBN: 978-1-941968-28-4 ©2015 SDIWC

satisfying his ambitions and confirms the initial booking. Besides, the hajj portal web site redirect the user to the hajj company web page in common site for all companies called tawaf.com.sa [3]. The company first inform the user about available places exists yet. The user is asked to deliver required information (name, birth date, ID number, phone number). Finally an acceptance SMS (message) and electronic primary confirmation invoice is received by the user. The user has to pay the amount to the hajj company within 48 hours, if not the registration is canceled automatically. After payment, the user inputs payment details in the hajj company system and enable him to pursuit the payment status. The hajj company sends a payment confirmation SMS if the user is satisfied and agrees with everything. One time the registered pilgrim wants to decline his registration, the Hajj portal system enables his to cancel his reservation by inputting his ID number and confirm the cancelation by writing the SMS received code in the dedicated filed in the hajj company system.

11


The described Hajj registration system includes several observations and errors related to forensics requirements preservation, especially the identity and privacy and therefore, if an electronic evidence is required for investigation matter, it will not be admissible or at least with low proven power. Registration scenarios that breaches forensics requirements are described below. Scenario 1: Suppose that a user1 made a mistake when inputting his ID number, in hajj portal and the hajj portal by coincidence find that the ID number satisfies the registration conditions and therefore redirect him to hajj company system. The user then inputs other details and confirm the registration. In this case, the user 1 will not be able to get the hajj permit since his given information are contradictory. The one having the right entered ID number will not be able to register since he/she is registered in the system by the first user. In both cases, they will lose the hajj permit especially because the number of places are limited and millions of people are reserving in the same time. Scenario 2: Now let us suppose that the user1 intentionally inputs an ID number for user2 that he knows it is the wrong information in hajj company system. The user1 wants just to block user2 from registering and therefore missing the low cost hajj permit opportunity. In this case, the user2 must wait 48 hours in order to book again and in that time almost all available places are reserved. The hajj system does not provide any procedures to track and catch user1. Scenario 3: The payment tracking feature in hajj system enables the users to pursuit their reservation status, by inputting their ID number in hajj system and automatically the user is redirected to hajj company system. The hajj company system presents the payment and reservation status related to the inputted ID number. In this page, the user can change the ID number and therefore he/she is able to see the reservation status, private information (name, birth date, phone number…) of other users, and if that user registered or not in the low cost hajj program.

ISBN: 978-1-941968-28-4 ©2015 SDIWC

In scenario 1, the identity of the user is breached accidentally and no procedures are provided to detect this mistake. This wrong identity is used after in the accomplishment of the remained steps successfully. The wrong ID number is detected only when giving hard copies to the Hajj Company. The scenario 2 poses the problematic of not taking into account the forensics matter in the Hajj system in order to detect and track suspect users. While in scenario 3, the privacy of users is breached and can be accessible for other users. In the above cases, a forensics layer that integrates built in forensics features in services in design time and monitors the transactions, detects any forensics breaches, and enables the tracking of the suspect is a primordial necessity. 2 Fi4SOA FRAMEWORK OVERVIEW In previous work of our research team, we proposed a Fi4SOA framework [4]. It aims to forensically sound handle data and automatically find, detect, and track forensics or business breaches. In this section, we recall the main phases encompassed by Fi4SOA framework. The first phase called design time forensics and business requirements integration. It enables the extraction of forensics/business requirements, the establishment of rules and drivers to preserve forensics/business requirements, and practical strategies and recommendations to integrate them to the targeted application. To this end, we used and adapt Sherwood Applied Business Security Architecture (SABSA) security methodology to extract, establish, and integrate forensics/business requirements. Through SABSA detailed layers, we easily determined forensics and business proprieties and the manner to apply them without conflicts or decreasing application quality of service. In addition, it provides a set of rules and recommendation related to the case for each involved party (hardware, software, or humans) in order to forensically sound process data.

12


The second Fi4SOA framework phase is named run-time events monitoring. In this phase, based on the forensics and business rules, we monitor events (logs, transactions, etc) and detect any forensics or business malfeasance. When time a threat is detected, the system alerts the administrator and logs all related information and data to the incident. For this sake, we uses Tesla as an event specification language that enables the events and rules formalization. In addition, it provides mechanisms to detect events that matches predefined rules. Through Fi4SOA framework, we firstly prepare and improve the forensics readiness of the targeted application. Thus, at any time, we can get forensically sound information about an incident or any event. Secondly, the proposed framework provides a real time events monitoring which detects in an early stage any forensics or business malfeasance and therefore increasing the intervention and responses flexibility. 3 APPLICATION OF Fi4SOA TO LFLH In this section, we apply Fi4SOA framework to the LFLH real motivating example. First in design phase, we depicts and extract forensics and business requirements of LFLH example. In addition we establish the SABSA matrix including all requirements, strategic, and physical operations. Secondly in run time phase, we translate some LFLH rules and events into TESLA rules and events and shows how to infer them and detect any forensics or business malfeasance.

SABSA [5] is an open standard methodology aiming to design and develop a risk-driven information security architecture for enterprises. It provides a set of guidelines and solution supporting critical business initiatives. SABSA methodology consist of 6x6 matrix. Vertically, SBSA matrix comprises six layer, which are contextual security architecture, conceptual security architecture, logical security architecture, physical security architecture, component security architecture, and operational security architecture. These layers represents respectively the business view, architecture view, designer view, builder's view, trademark's view, and facility manager's view (details are not mentioned here due to space limitation and can be found in [5]). Horizontally, SABSA uses six questions "what, why, how, who, where, and when" to analyze six layers in detail. We applied in Aman System Research Team1 SABSA methodology to digital forensics within web services based infrastructure in order to determine forensics and business requirements without conflicts. In the assets, we determine the different forensics and business attributes to be protected and preserved. The motivations provides guidelines related to the case for maintaining and achieving business and forensics goals. The SABSA process attribute identifies formal and technical solution for the encountered problems. The SABSA people, location, and time determines respectively the involved parties in the business or forensics related matters, their location, and time of execution or availability. Back to the LFLH example, the table 1 depicts the related rules and recommendations required for forensics and business issues.

3.1 Design phase Table 1. LFLH Related SABSA Matrix. Item

ASSETS (What)

Motivation (Why)

Process (How)

Contextua l layer

-Ensuring the preservation and the readiness of forensics

-Ensuring the preservation and the readiness of forensics features and

Framework of operational processing for digital investigation

ISBN: 978-1-941968-28-4 ©2015 SDIWC

PEOPL E (Who) KSA citizens and

LOCAT ION (Where) Only KSA citizens

TIME (When) The service is

13


Conceptua l layer

Logical layer

Physical layer

features and proprieties in all business area. -Defining internal and external policies that determines the working procedures of any part. Ensuring the authentication of any user and attributing a role and responsibility to each one. - Protecting the business environment. Conducting investigation against any violation of enterprise policy. -Auditing and evaluating the investigation is a high-level requirement especially in SOA due to the complex, dynamic, distributed, and heterogeneous nature of the interconnected services, which increase the possibility for making errors and collecting evidence in wrong way. Investigation in an SOA environment requires high-level qualifications and skills, which make the process of monitoring and documenting any investigation activities very important. - Having the same test condition and same results is almost under impossible due to the changing, complex nature of an SOA environment. However, the repeatability and reproducibility can be seen as having the same result appearance under close conditions. So that, the results will be considered as repeatable and reproducible if they are accurate (having the same appearance) even if the test conditions are changing. The assets includes the SLA, all conducted transactions by the service, the different persons, and tools in contact with the data.

proprieties in all business area. - Defining internal and external policies that determines the working procedures of any part. Ensuring the authentication of any user and attributing a role and responsibility to each one. - Protecting the business environment. - Conducting investigation against any violation of enterprise policy.

subscribe d Hajj company employe es.

-The authentication is LFLH web site is based on only the identity number. The identity number is not private and can be known by several persons (explained in scenario 1 and 2). In order to ensure that the registered person is the one having the inputted identity, the authentication should uses a combination of parameters. For example, identity number, birthday, and password. Another solution is the use of fingerprints or face recognition especially technologies related to them are widely used. - One time a user is logged to the Hajj company site using the Hajj ministry web site, he/she can change the identity number and gather private information of other registered users (explained in scenario 3), which is a privacy breach. To deal with this issue, a simple solution is to disable any request of information through the Hajj company web site and present only information related to the inputted identity number in the Hajj ministry site.

Forensics process according to standards

Forensic s responsi bilities assignme nt matrix

Organizational policy

Forensics service

Forensic s trusted framewo rk

In the LFLH motivating example, to ensure that the registered person can get access only to their

Role based access control

forensics

-Inform the user about privacy policies. - Minimize the handling and corruption of original

ISBN: 978-1-941968-28-4 ©2015 SDIWC

are allowed to subscrib e in the service.

availabl e only for the month before the Hajj days.

Forensic s policy authority domain map

14


data. - Do not exceed your knowledge. - Write any changes you make to data.

Compone nt layer Operation al layer

private information, we can disable any query sent from the hajj company web page and warn the users by using information banners for example that any tentative of accessing other users private information is prohibited and cause the judicial follow-up. Also any action taken by the user should be logged and saved in a secure location.

Chain of Custody

3.2 Run time phase In this section, we only apply TESLA reasoning to some LFLH forensics and business rules and requirements and show how to detect an integrity breach incident. Event set Event set includes all user requests, service response, messages, security alerts, and transactions during the service composition. Events are extracted from these resources and for each event; we identify all related information such as event type, record, event task, and contact. For instance, the Figure 2 shows a soap message request that has a message body representing a method call at service, preceded by optional WS-Addressing headers that provide the URIs of the target service and action and a unique message identifier. uuid:6B29FC40-CA47-1067-B31D-00DD010662DA http://interiorministry.gov.sa/identityservice http:// interiorministry.gov.sa /SubmitId 62518396

Figure 2. Soap Message Request.

ISBN: 978-1-941968-28-4 ©2015 SDIWC

This message is translated into Tesla event notification as follow: name SoapRequestMessage number to = “http://interiorministry. gov.sa /identityservice” and action= “http:// interiorministry.gov.sa /SubmitId” and GetConfrimation=”62518396” and MessageID=”6B29FC40CA47-1067-B31D-00DD010662DA” type saopmessage order nothing timeStamp “2014-09-24 00:35:35”

Application domain rules The application domain rules component essentially gathers the policies and rules from the Service Level Agreement (SLA) and web service description (.wsdl). This component translates the SLA and web service description into rules and policies that organizes the relation between different partners and determines the duties and rights of each one. For instance, the government in the LFLH example stipulated four conditions in order to register in the service. Following are the condition and their Tesla rules representation: - Must not have been previously demanded Booking pilgrimage during the past five years define from where

BeneathPeriod (periode:”val”) Hajj(year=$y) and period ($currentYear-$y=15) val=$age

15


For women must be logged in demand with her Muharram. define NoMuharram () from gender($gender=”women”) muharram($exist=”no”)

and

-

That does not exceed the number required booking them during a single visit to (10) members up to a maximum. define OverBookingNumber(number:”val”) from bookingvisit() and each (bookingnumber >10) within 1 visit from bookingvisit() where val=bookingnumber

Returning to the LFLH example, we suppose the following rules accompanied with Tesla formulation for each one: Inputted data (ID number, gender, name, birthday, and phone number) are only used for the registration purpose by the hajj portal and hajj company web sites. define privacyDataDeliveryBreaches(service:”val”) from registration() and usedIn(serviceId=$x) and ($x www.tawaf.com.sa) or ($x www.locallowfare.haj.gov.sa) where val=serviceId

The serviceId contains the current service domain name using actually the inputted data. Inputted data must be removed in the case of registration cancelation. define privacyDataRemoveBreaches(service:”val”, data:”val1”) from registrationCancelation() and (serviceID=$x) and notRemove(data=”dataname”) within ts from registrationCancelation() where val=serviceId and val1=dataname

The ts represents the maximum time in seconds that a data should be removed after a registration cancelation action. Inputted data for confirmed registration must be only retained by the hajj portal and not by the hajj company after 10 days from the hajj accomplishment. define privacyDataRetentionBreaches(service:”val”, data=”val1”)

ISBN: 978-1-941968-28-4 ©2015 SDIWC

from hajjAccomplished() and (serviceID= “www.tawaf.com.sa”) and notRemoveData(data=”dataname”) within 10days from hajjAccomplished() where val=serviceId and val1=dataname

Digital forensics properties This component deals with the formulation of general and essential digital forensics proprieties enabling the admissibility of the gathered evidence with high proven power. It is not related to special application domain since all digital area shares the same admissibility requirements and only the manner of how to proceed or collect data differs. The forensics policies related to the application domain are determined in the application domain rules components, and this component (digital forensics proprieties) looks only on the relation between proprieties and if there is missing or not considered requirements when handling data. Digital forensics component contains four classes. The forensicProperty class represents the smallest property defining specific requirement. For example, the identity of a record (or evidence) is a specific requirement of the class type authenticity (see Figure 1). To validate a record, it must satisfy all authenticity requirements which are the identity, the integrity, and the authentication. The second class is classType containing all essential forensics requirements type such as the authenticity, reliability, privacy, comprehensiveness, and etc. Some of them are related to the admissibility requirement and the other to the weight of the gathered evidence which represents the third class category. The last class severity is used to determine the forensics property priority and relevance. It focuses essentially in attributing to each property or relation a value showing its severity in the case of property losing. For instance, let’s consider the formulation of the authenticity requirement which encompasses three properties (identity, integrity, and authentication). The identity requires the definition of the different attributes that characterizes the evidence such as date and

16


time of issuing, creation time, author, addressee, subject, and etc. The identity is converted to Tesla as follow: define identityBreaches (MissingIdentityAttribute=”attr”, record:”ref”) from record(revordref) and notsignature(val) or notsiynedby(val) or nothascreationtime(val) or nothassubject(val) where attr=val, ref=recordref

The integrity aims to preserve the original data and to keep saved copy from alterations in order to avoid any court data contaminations. Collecting data should adapt methods of data integrity during the storage and transmission to avoid alteration and to maintain its authenticity. So that, the integrity aims to ensure that collected data are protected, not being tampered and complete using hash or cryptographic techniques. All related actions to the record must be logged and preserved such as the names of all persons handling and responsible for the record keeping over time, all technical modifications, annotations, and all actions and policies related to the data retention, disposition and transfer. Thus, the integrity can be verified and therefore enhancing the record trustworthiness. The integrity is converted to Tesla as follow: define integrityBreaches (MissingIntegrityAttribute=”attr”, record:”ref”) from record(recordref) and nothashashvalue(value) or isAltered(val) or notSiynedby(val) or nothascreationtime(val) or nothassubject(val) where attr=val, ref=recordref

The authentication aims to allow only authorized persons or software to get access to a document and serve to proof the document authenticity in one particular moment. The authentication is converted to Tesla as follow: define authenticationBreaches (MissingAuthenticationAttribute=”attr”, record:”ref”) from record(recordref) and isNotauthorized(value) or isNotSkilledin(val) or isNotCompetentin(val ) where attr=val, ref=recordref

Finally, the authenticity that requires the proof of the identity, integrity, and authentication is formulated using Tesla as follow: define AuthenticyBreaches (MissingAuthenticityAttribute=”attr”, record:”ref”)

ISBN: 978-1-941968-28-4 ©2015 SDIWC

from record(val) and integrityBreaches(val) or identityBreaches(val) or authenticationBreaches(val) where attr=val, ref=recordref

The rest of the digital forensics proprieties formalized as the authenticity property forming the essential policies and rules that monitors the system forensics soundness. The events notification and rules must share the same formulization in order to avoid any incoherence between them.

TESLA reasoning system The system reasoner infers rules and events in order to detect any matched pattern that generates notifications. The system uses domain application and digital forensics proprieties rules and polices forming the knowledge database together with rules and requirement provided by the user or the investigators whom subscribes to the events. The system reasoner contains, aside from knowledge database, a predefined rules describing eventual attacks or forensics breaches scenario extracted from the generated event notification history or provided by experts. The rules include patterns of forensics violation within specified period or ordered sequences of events. It consider also new subscribed rules provided by the end users (service requester, forensics persons, service provider,…) in the aim to inquiry about some events. Subscription to events is very easy using Tesla, as explained in section 3, offering users high flexibility to achieve goals in real or near real time without the need to define previously new rules that contains their request. Following, we portrait some instance of events and rules and how the system reasoner infers them based on the LFLH motivating example. Let us consider the following event scenario; a user inputs his ID number to the hajj portal system in order to register in the hajj. The system sends the identity to second private service linked with the interior ministry databases in order to verify its permit right for the current year using soap request message. Then the interior ministry replies to the request by sending response soap message containing the request result. This transaction triggers in background the following actions:

17


The above event history contains six events starting by inputting ID number and finished by verifying it by the interior ministry web service. As forensics requirement, this transaction must firstly inform the user about the parties that use his information, and then during the sending and the receiving of the information by the services. The data must be protected against any violation or tampering in order to keep their integrity, privacy and confidentiality. Thereafter this transaction must be saved and stored in the services database by accurate software, processes, and authenticated skilled persons. Each of the six events generates the following Tesla event notification: at t=0: name number

InputingIDNumber service = “hajj portal” and action=

“DataInputting”

type userInputingData order nothing timeStamp “2014-09-24 00:35:35”

at t=1: name SoapRequestMessage number to = “http://interiorministry.gov.sa/identityservice” and action= “http:// interiorministry.gov.sa /SubmitId”

and GetConfrimation=”IDnumber” and MessageID=”6B29FC40-CA47-1067-B31D-00DD010662DA” and Hashvalue=”GUODWDS54SDFF98FSS53SFIUID36FS” type saopmessageCall order nothing timeStamp “2014-09-24 00:35:40”

at t=2: name

SoapRequestMessageReceiving number from = “hajj portal” and MessageID=”6B29FC40-CA47-1067-B31D-00DD010662DA” and VerifyHashvalue=”GUODWDS54SDFF98FSS53SFIUID36FS” type saopmessageRecievingandVerification order nothing timeStamp “2014-09-24 00:35:55”

order nothing timeStamp “2014-09-24 00:36:00”

at t=4: name SoapRequestMessageResponse number relatedto = “MessageID:”6B29FC40-CA47-1067-B31D-00DD010662DA”” and confirmationresponse=”true”

type saopmessageresponse order nothing timeStamp “2014-09-24 00:36:10”

at t=5: name ConfirmorRejectIDnumber number confirmationresponse=”true” type ConfirmorRejectIDnumber order nothing timeStamp “2014-09-24 00:36:12”

And let’s consider that the knowledge database rule contains the following Integrity checking rule: define integrityNotChecked (messageId=”id”, serviceId:”ref”,missingAttribute=val) from SoapMessage(id) as SM or not generatehashvalue(val) as HM within 1s from SM Soap or not Siynedby(serviceId) as Sby within 1s from SM Soap or not hascreationtime(val) as CT within 1s from SM Soap or not hassubject(val) as S within 1s from SM Soap where messageId=id, ref=serviceId,missingAttribute=val

The above rule integritybreachs includes only two successive conditions, the first soapMessage() aims to define the type of message and the second is one of the integrity requirements attributes (hash value(), signed by(), creation time(), and subject()) since missing one attribute decreases the integrity. Figure 3 depicts the different event detection models for the rule integrityBreaches:

at t=3: name number GetConfrimation=”IDnumber”

VerifyIDnumber

and MessageID=”6B29FC40-CA47-1067-B31D-00DD010662DA” type VerifIDnumber

ISBN: 978-1-941968-28-4 ©2015 SDIWC

18


satisfy currently enabled transition are simply ignored, while automata instances are deleted if they are unable to progress within the maximum time associated to each transition. The integrityBreaches rule is triggered when an instance of the corresponding automaton model arrives to its accepting state represented with double circle (see Figure 4).

Figure 3. Event Detection Models for The Rule integrityBreaches.

Each transition from state s1 to state s2 is labeled with the set of constraints that an incoming event of type s2 has to satisfy to trigger the transition plus the maximum time for the transition to be triggered. In Figure 3 all models shares the first state SM since they all related to the soap message request event, then any missing attribute within 1 seconds from the creation of soap message request triggers the notification about integrity requirements breaches. Now, in order to describe the behavior of the event detection automata, we consider only the sequence of events captured by model M1 of the rule integrityBreaches for simplicity reasons.

Figure 4. Event Processing Example of The Rule integrityBreaches.

Table 2. An Example of Event History Occurrences. T=0 Inputtin g ID number

T=1 Generates call soap message

Add integrity hashvalue

T=2 Receive soap message

Briefly the event processing starts by creating a single instance for each of these automata. Then for each incoming event, it creates new automata instances, or moves existing ones from state to state, or deletes some of them. For instance, the automaton instance M1 is a state SM reacts to the detection of an event e that satisfies the constraints for the transition exiting SM (which is in our example the nongeneration of the message hash value within 1 seconds from SM), by firstly duplicating itself, creating a new instance M11, then using e to move M11 to the next state (while M1 remains in the state SM). Those events that do not

ISBN: 978-1-941968-28-4 ©2015 SDIWC

Verify the integrity

T=3 Verify the ID number

T=4 Send soap response message

T=5 Confirm or reject the ID number

4 CONCLUSION In this paper, we treated the case of an example of web service non-conformity with forensics requirement using accurate modeling technique, which is SABSA. Whilst using 'SABSA', the implementation malfeasance and suggesting several recommendation to solve the predetermined issues. The importance of this paper reside in its modeling of local, very critical and important web services in Saudi Arabia. Which is Low Fare Local Hajj Booking Web Services. Furthermore, we

19


investigated the rules and recommendation extracted using SABSA into TESLA reasoning language in order to detect and respond to any forensics issues in execution time. We provide an example of TESLA events detection of one forensics property related to the Low Fare Local Hajj Booking Web Services. However in the near future if this research was to occur again often methods and the defined rules would be implemented and integrate them in real web services. ACKNOWLEDGMENTS This paper is a partial result of a research project granted by King Abdul Aziz City for Sciences and Technology (KACST), Riyadh, Kingdom of Saudi Arabia, under grant number 11-INF1787-08. We thank AMAN System team (www.amansystem.com) for their help and recommendations.

REFERENCES [1]

H. m. o. KSA, 2014. [Online]. Available: http://www.haj.gov.sa/ar-sa/SitePages/NewsDetail.aspx?newsid=1671.

[2]

H. M. o. KSA, 2014. [Online]. Available: http://www.locallowfare.haj.gov.sa/LF/home2. xhtml.

[3]

H. c. page, 2014. [Online]. http://www.tawaf.com.sa.

[4]

G. Cugola and A. Margara, "TESLA: A Formally Defined Event Specification Language," in Proceedings of the Fourth ACM International Conference on Distributed EventBased Systems, Cambridge, United Kingdom, 2010.

[5]

A. C. D. L. Johen Sherwood, Enterprise Security Architecture: A Business-Driven Approach, SanFransisco: amazon.com, 2005.

Available:

ISBN: 978-1-941968-28-4 ©2015 SDIWC

20