Multi-dimensional Knowledge Integration for ... - Semantic Scholar

15 downloads 77436 Views 319KB Size Report
ticket through integration with monitoring systems. In response to the ... application server, and back-end database server, and each component ..... Apache. Web Server. Application. Server. Oracle. Database. DB2. Database. Aix Machine.
Multi-dimensional Knowledge Integration for Efficient Incident Management in a Services Cloud Rajeev Gupta*, Hima Karanam*, Laura Luan**, Daniela Rosu**, Chris Ward** * IBM Research Lab, India, grajeev, [email protected] ** IBM T.J. Watson Research Center, USA, luan, drosu, [email protected]

Abstract The increasing complexity and dynamics in IT infrastructure and the emerging Cloud services present challenges to timely incident/problem diagnosis and resolution. In this paper we present a problem determination platform with multi-dimensional knowledge integration (e.g. configuration data, system vital data, log data, related tickets) and enablement for efficient incident and problem management of the enterprise. Three features of the platform are discussed: automated ticket classification, the automated association of resource with tickets based on integration with configuration database, and the collection of the system vitals relevant to the ticket through integration with monitoring systems. In response to the emerging Cloud services and their highly dynamic service operation context, we identify the need for a proactive service management approach which incorporates configurations and deployment of incident management tools, policies, and templates throughout the service life cycle in order to enable effective and efficient incident management in service operation. 1.

Introduction

A recognized library of best-practices in IT Service management (ITSM), the Information Technology Infrastructure Library (ITIL) [1], strongly advocates that effective incident and problem management is critical to managing IT service operation. The criticality draws from the high potential of service disruption to the business and customers and on the significant contribution to operation cost. The increasing complexity and dynamics in IT infrastructure and IT services present challenges to timely diagnosis of incidents/problems and their resolution. For instance, today an enterprise application typically consists of multiple interconnected functional components such as front-end web server, user authentication server, application server, and back-end database server, and each component might be deployed in multiple instances for scalability under increased workload. Errors in any of the components and their instances affect the application performance and cause problems to the end users. Comprehensive understanding of the underlying IT infrastructure and the integration of knowledge related to multiple dimensions of the service operations (e.g., configuration, system vitals) are critical for effective

incident and problem diagnosis. The emerging Cloud computing [2] and the new enterprise data center [3] model bring an additional level of complexity and dynamics. In a Cloud environment, resources are virtualized, shared, and allocated dynamically upon workload demands in order to achieve flexibility and cost efficiency. Therefore, the operation context of a service changes dynamically at run time, e.g. resources being added or removed. For fast resource allocation, Cloud services employ installation image technology; where an installation image comprises a set of related resources, e.g. OS, middle wares, and applications. When an installation image is installed, the comprised set of resources gets instantiated. Effective incident/problem diagnosis in such environment requires the knowledge of mappings from virtual to physical resources and traceability of operation context changes. While there are many commercial tools that provide operations insight and consolidate events generated by the environment (e.g. TEC [4], or Patrol [5]) these tools tend to lack a timely acquisition of relevant knowledge in the context of the current incident/problem which can only be provided by augmenting traditional service request management environments from a full lifecycle view. In this paper, we present a problem determination platform which enables automatic association and collection of relevant multi-dimensional knowledge (e.g. configuration data, system vital data, log data, related tickets) on service operation within the context of a given incident/problem ticket in order to facilitate timely incident/problem diagnosis. The integrated operation knowledge derives from diverse sources, through specialized tools and configurations. Contributions and organization of the paper: First, Section 2 describes the background and related work. Then Sections 3, 4, and 5 discuss three key extensions to demonstrate the context aware knowledge integration and the required management system configuration in service design and transition phases. The three extensions are (1) automatic incident classification, (2) automatic configuration item (CI) association in conjunction with Configuration Management Database (CMDB), and (3) system and application vitals collection in interaction with monitoring systems. We implement our approach within the IBM Tivoli Service Request Manager (TSRM) [6]. Initial results showing how our approach leads to quicker and better incident management are presented in Section 6. Finally, Section 7 summarizes the paper. In the rest of

the paper, we mostly focus on incident management, while the principles apply to problem management as well. Table 1 lists example of management tasks related to incident/problem determination relevant in each phase of the service life cycle. Our contributions are mainly related to tasks emphasized in bold in Table 1. Table 1: Service life cycle and problem determination tasks Phase of Service Life Cycle Service Design

Service Transition

Service Operation

Service termination

2.

PD Management Related Tasks (* = capability in existing tooling) Design management plan should include • Templates for monitoring agents * • Policies for event handling * • Incident classification tree * • Templates for CI association • Templates for service/system vitals • Templates for automatic fault recovery • Service mapping information to virtual resources and to physical resources • Update CMDB • Configuration of external data sources, e.g. system vitals source and etc. • Detect incidents via events/reports * • Diagnose/resolve incidents/problems leveraging CI association, vitals, knowledge bases, search, and correlation • Track resource changes (virtual and physical) if resources dynamically allocated, e.g. in Cloud environment • Update CMDB • Update resource release (virtual and physical) • Update CMDB

Background and Related Work

The approach to problem determination proposed in this paper distinguishes from prior works through the extent of data integration and automation (Section 2.1) and the flexible architecture for integration of data source and service-specific components in the problem determination platform (Section 2.2).

2.1 Data Integration In recent years, serious efforts have been made in collecting various management data and integrating management systems to provide a coherent access to relevant data. Monitoring systems are widely deployed to detect system/application events and collect performance data [16]. Some monitoring systems today are integrated with service request management systems to automatically open incident tickets upon detection of certain events [11]. Configuration information is consolidated in Configuration Management Databases (CMDB) with upto-date information on Configuration Items (CI) and their

relationships. Service request management systems can be integrated with CMDB and provide problem analysts with the capability to search and view CI information [9]. The data integration across multiple sources creates opportunities for efficiency improvement in the process of incident resolution. For instance, previous work [8] identified the need for system vitals details within the context of ticket resolution. Various approaches have been proposed to address this integration but none could provide a highly-efficient, fully automated solution because they build on limited data integration. For instance, the IBM’s TSRM [6] allows the analyst to open a window to the ITM monitoring tool [10] or to the managed server in order to collect manually the system vitals details. The Service Delivery Portal (SDP) [8], an environment that supports the daily activity of server system administrators, collects a fixed set of vitals related to a given server. Similarly, the IBM Virtualization Engine [7] integrates system configuration, system vitals, and task execution tools, across all servers in the environment, in order to facilitate on-demand collection. These solutions can not provide the incident analyst with the system vitals that are highly relevant for the current incident because they do not integrate fully with the ticketing system. In this paper we demonstrate that broader data integration enables us to address this limitation. For instance, based on ticketing and configuration information, one can identify automatically, with high confidence, the type of ticket and the CIs related to the ticket. Further, based on the CI types and the monitoring relationships in CMDB, the system collect automatically, the system vitals that best-practices indicate as relevant for the current incident type and related CI types. When the analyst opens the ticket for investigation all of the necessary data is available. Incident Management System Incident Ticket

Extensions Incident Classification CI Association

CMDB

CI Vitals Collection

Monitoring System

Knowledge Search

Knowledge DB

Historical Vitals

Related Ticket

Ticket DB

(Others)

Figure 1: Problem determination platform with extensions for multidimensional information integration

2.2 Problem Determination Platform Figure 1 depicts the problem determination platform with multi-dimensional knowledge integration for incident diagnosis. Built on top of an incident ticket management system [6], the platform contains a set of extensions that provide automated processing capabilities and/or additional relevant diagnostic

information from various external sources. Each extension has access to the ticket data and performs within the context of a given ticket. The structure is flexible to accommodate other extensions specific to data sources or service domains. Overall, the integrated collection of extensions enables to customize and automate elements of the problem determination according to the service model and the configuration. In the context of the highly dynamic Services Cloud, a key element for continuous effectiveness is to ensure that when a new service or service change is deployed, the problem determination platform is automatically configured for addressing the incidents specific to the new or changed service. We advocate that this automated configuration requires considering the aspects related to incident and problem management in every phase of the service life cycle (see Table 1), namely in service design, transition, operation, and termination. Our approach departs from today’s model of considering incident and problem management only in service operation phase of the service life cycle management. We start with automated incident classification in the next section.

etc.). These tickets are more structured, including attributes like resource which generated event, severity of the event, event class, etc. Also, the description field in this type of tickets is either absent or it is automatically generated. Given the different types of information comprised in these two types of tickets, different techniques are required for their classification. There are two popular approaches to text classification. The first approach is knowledge engineering, based on which a set of rules written by subject matter experts are used to encode the classification. The second approach is machine learning, based on which a classifier is build automatically using a set of pre-classified documents, and it is used to classify new documents. We foresee that machine-learning approach is more suitable for user reported incidents, whereas the rule-based approach is suitable for system-generated incidents. This is because, it is very difficult for humans to create rules for the kind of (unreliable and unpredictable) text that is present in userreported incidents, while an automatically created classifier can be iteratively updated with new sets of incidents.

3.1 Machine learning approach 3.

Automated incident classification

Incident classification is usually among initial steps of incident management. The classification is used for routing an incident to suitable subject matter expert (SME), assigning priority, assessing impact, etc. In a typical ticket system, classification is done manually by selecting one of the many preconfigured classes. Preconfigured classes are usually hierarchical, e.g., End User Issue/ Hardware/ File System/ File System Full. A user can select a class at any level. This manual classification is time consuming and error prone. As per our experience a majority of service desk personals select the default classes making the classification process redundant. By automating the classification process, we can improve classification accuracy and reduce resource wastage due to delay in classification and wrong routing of incidents due to incorrect classification. Service desk tickets are of two types 1) User reported 2) System generated. User reported tickets have mainly three types of information 1) User information: about the user, contact information, etc. 2) Problem related information: which service is being accessed by the user, the user’s assigned priority, etc. 3) Problem description: natural language description giving the user’s view of problem. The description can vary in length and can be very dirty (e.g., using acronyms and service-specific expressions). System generated tickets are opened using automated tools in response to system events depending on the event severity (e.g., normal, warning, error, critical

For text categorization of a given incident di, using machine-learning, the aim is to find class cj such that assigned numerical score to < di, cj> D % C is maximum where D is incident domain and C={c1, c2,.., cn} is the set of predefined incident classes. Figure 2 gives an example of such classes. A value pij is assigned to indicating probability of classifying incident di under class cj. The function for assigning the probability to an incident-class pair is tuned using training. We use Naïve Bayesian [15] method for incident classification. As per this method a set of words or sequence of words are identified as incident features. Each incident is represented as a bag of these features. During training, a count nij for each pair of feature fi and class cj is maintained. The Naïve Bayesian probability of finding a feature fi in an incident of class cj is given by: p( fi | c j ) =

nij + 1 ∑ nij + N i

where N is the total number of features. The value of p(fi|cj) can be calculated at run-time based on nij. For an incoming incident, we create its textual description by concatenating selected incident fields and count the number of times each feature fi occurs in the description. Let’s denote that count as i. Probability of categorizing the incoming incident into class cj is given by N

p j = ∑ηi log( p( fi | c j ) .

We calculate probabilities for all of

i =1

the classes at lowest hierarchy and assign the document to the class having the maximum probability. As a variation,

class eq ITM_NT_Process message contains CPU_critical 10203 class eq DB2_Memory 10405 class eq WAS_connection_refused message contains database 10407 ………..

Figure 3: Rule based classification 4.

Figure 2: Example class hierarchy we assign the class only if the maximum probability is more than certain given threshold. This ensures that we suggest the classification to the user only if we are reasonably sure about that. Otherwise, we assign a higher level (less refined) class (having higher probability) to the incident. In text classification literature, usually, class labels are considered symbolic. But for incident classification they are meaningful concepts related to the incident. Experts can add features to the incident classes in order to bootstrap and/or complement the automatically identified features. For example, linux, Windows, OS, etc., can be used to bootstrap features for the class Operating System. In our application, a system administrator can see the features for an incident class can remove or add feature as s/he finds suitable. As a result, our technique intelligently combines automation with expert knowledge in order to build a classifier with good accuracy.

Configuration Item Association

During problem determination a failing component responsible for the incident is identified so that it can be further monitored or probed to resolve the incident. For user reported incidents the natural language text describing user’s perception of the problem is the base for identifying the failing component directly or indirectly. In order to identify this component one can perform keyword search over CMDB to get ranked list of CIs with their relevancy scores and the service desk personnel can select the responsible CI from the list, as done in [8]. However a simple keyword search based scheme may not identify the faulty CI as the user may not directly interact with the failing component and/or does not mention details of the failing component in the description. For example when an application fails because of a database failure the user may report an incident referring to the application without mentioning the database. With the increasing focus on dynamic Cloud infrastructure one needs suitable methods to identify the virtual and physical resources responsible for the incident. We define the context of a CI in the form of its related objects and use it to identify the failing component as explained next. Billing Application

deployed-on

Apache Web Server

3.2. Knowledge engineering approach In the knowledge-engineering approach rules are written by experts to map attributes of system generated events to corresponding incident classes. The events have various attributes (obtained by monitoring system by parsing logs and configuration files) which can be used for incident classification. For example, as a simple scheme, event classes can be mapped to incident classes, e.g., any critical event with class NT_Event_Log will result in service desk incident with class End User Issues/ Server/ Operating System. We use a rule language for writing rules where different rules can be combined using AND/ OR relations to get composite rules. Simple rules can use regular expressions (regex) over event attributes. Figure 3 shows a snapshot of rules configured for TEC events. According to do these rules any event of class DB2_Memory results in incident of the class indicated by 10405.

Application Server

depends-on runs-on

Oracle Database

runs-on

runs-on Aix Machine hpux1 Aix Machine hpux2

transactional -dependency DB2 Database runs-on

Figure 4: Example enterprise architecture

4.1 Contextual search on CMDB CMDB can be viewed as a directed graph having CIs, as nodes, and relationships, as edges, among them. A sample CMDB instance is shown in Figure 4. For a given configuration item C we define its context as follows: • Source set SC is defined as the set of CIs from which you can reach C using forward relationships.

• Destination set DC is defined as the set of CIs that can be reached from C using forward relationships. • Context C is defined as the union of SC and DC i.e., Φ C = S C ∪ DC

For example, in Figure 4, context of Application Server can be computed as {Billing Application, DB2 Database, Hpux2}. For contextual search, we maintain a mapping between incident classes and types of CIs that might be responsible for the incident. The mapping can be created manually by SMEs. The steps involved in associating a configuration item to the given incident include: 1. Perform incident classification to determine the incident class. 2. Determine the set T of the possible CI types of the associated CI of the incident using the mapping of incident class to CI types. 3. Perform keyword search in the CIs of types T and get result set S1. 4. Perform keyword search in all CI types which are present in context of the associated CIs. Such CI types can be obtained from CMDB data model. Let the resultant set be S2. 5. Obtain set S3 of CIs of types T that are related to CIs in S2 using relationship browsing. 6. Assign the CI with highest score in the union of S3 and S1 as the CI associated with the incident. This procedure ensures that the search results contain the CIs having keywords as attributes and, also, the CIs that are related to them. For example, looking for CIs of type database, we obtain the keyword search results for CIs of type database, and the keyword search results for CIs that are of types VirtualServer, ApplicationServer and Application. From these CIs we find the related CIs of type database and merge them with the initial set of database CIs to get the ranked list of results. Relevance ranking is assigned to each search result using its search score. The score of a resultant CI C is dependent on (1) K(C)-keyword search score of C, (2) K(Ci)-keyword search scores of its related CIs, and (3) R(Ci)-relationship weights of the relationships (direct as well indirect) that connect contextual CIs with C. n

∑ K (C i ) * R (C i ) S (C ) = K (C ) +

i=0

n

Where n is the number of related CIs in the context of C. We compute the scores for all the CIs using above formula and assign the one with the highest score with the incident.

4.2 Search templates Given the rich relationships that are typically captured in CMDB the overhead of contextual search can be prohibitive because the search in the contextual CI types (step 4) can involve searching over large number of object types. For scalable performance, we define a mechanism

where the search context for a given incident class can be limited to set of selected CI types and relationships. Namely, a search template captures the subset of the context for a given CI type that is relevant for contextual search. This template helps reduce search time and prevents searching objects which may bear no relation to the given type of incident class. Figure 5 gives an example search template for CI type Database Server installed on a virtual server. One or more search templates can be specified for an incident class, which will limit the context C used for search. For system generated incidents, we use incident attributes along with search templates to identify the responsible CI for the incident. Set of attributes as well as templates can be configured for a given class of system generated incidents. Database Server Dependency Application Server

Installed On Virtual Server

Deployed On Application

Figure 5: Example search template Database Server Dependency Application Server

Installed On Virtual Server Runs On

Deployed On Application

Computer System

Figure 6 Example output template

4.3 Finding related CIs In a typical help desk scenario, related CIs are used to understand the scope of a given incident. For example, if the associated CI for an incident is a database server, then user may want to see all the applications running on that server. Existing service desk tools determine the related CIs based on manual identification of the neighboring CIs out of the set of the CIs immediately neighboring the CI associate to the incident. We address the limitations of this approach with an automatic context-oriented approach that identifies related CI in a relevantly deep sub-graph of CI relationships. Namely, for an incident class, the context is defined by an associated CI type and related output template. As in search templates, output templates consist of CI types and their relationships. The top-CI type is the same as the associated CI type (also the root CI type of the corresponding search template). Figure 6 illustrates a

sample output template. For details on implementation and performance, one can refer to our earlier work [9].

5. System Vitals Collection In the process of problem determination, very often, the analyst has to learn about the status of select system or application parameters at given moments in time. For instance, in order to resolve an incident regarding slow response of a Web application, the analysis has to investigate the variation of system vitals such as CPU utilization, thread context switches, memory availability and swapping activity, I/O volume and wait times. The analyst has to focus on variation of system vitals at several time horizons, including short-term history, for the time the incident occurred, longer-term history for intervals of time with similar workload characteristics, like a day ago or a week ago, and also, on current time state. The system vitals information is provided by specialized tools for monitoring and collection of monitoring information, such as IBM Tivoli Monitoring (ITM) [10]. Some tools, like ITM, collect, aggregate, and store vitals for long term, providing for short and long-term history, while other, more simple home-grown tools collect only the most recent snapshot of the system vitals, providing only for real-time state. 5.1 Vital collection In a traditional operations mode, the analyst interacts directly with the monitoring tools in order to collect the required information. The analyst must identify the monitoring systems that control the (virtual) server(s) related to the incident, and the necessary information using tool-specific naming convention. The overhead and tool-specific skills required to perform these tasks increase significantly with the complexity of the IT infrastructure. For instance, in large IT environments, the space of managed servers might be monitored by several monitoring systems, possibly distinct with respect to brand or version; thus, the analyst must have skills to interact with each type of the tools. More critical, in the emerging cloud-based environments, the analyst must use server configuration tools to identify the current and historical mapping of virtual servers to physical servers and thus the related monitoring infrastructures. We aim to improve the efficiency of incident resolution through automated collection of system vitals upon ticket creation. In this novel approach, the systems and applications for which vitals are collected are determined based on the automated incident classification described in Section 3, and the collected information is associated with the ticket for later review by the analyst. Besides reduced overhead for the system analyst, an added benefit

is that the most recent possible system vitals snapshot to the state triggering the incident is collected and available for problem determination. Figure 7 illustrates a sample ticket-management application in which the system vitals are collected automatically, associated with the ticket, and provided to the analysis on demand, in an integrated view.

Figure 7: Ticket Management System with view of ticket and related system Towards this end, we propose a framework for collection of system vitals that integrates with incident management tools based on the following principles: • Extensible integration of monitoring tools based on a generic API, in order to accommodate concurrently multiple types of monitoring systems, as likely in complex IT infrastructures. • Automatic tracking of the server mapping to monitoring systems, in order to eliminate the manual work that precedes the process of vitals collection in Cloud environments. • Generic taxonomy of system vital metrics, in order to reduce the analyst’s skills requirements and to facilitate knowledge sharing across platforms. • Configurable mapping of incident and CI types to a best-practices set of relevant system vitals, in order to reduce the analyst’s work with vitals selection, and boost the performance of lower-skilled analysts. • Multiple levels of granularity and time spans for vitals collection, including real-time and historical sampling, in order to provide for a wide range of vitals analysis during problem determination. • User’s own customization of vital collection, expressed as generic or source-specific metrics, in order to enable a detailed exploration of system vitals based on user’s advanced skills.

System Overflow’, the vitals set is comprised of sets related to CI types ‘Server’ and ‘File System’. For each CI type, the set includes specific generic vitals identifiers that The architecture of the proposed framework for system are relevant for the CI type. For instance, for the ‘Server’ vitals collection is illustrated in Figure 8. The core of the CI type, the set includes system, user, and overall CPU framework is the VitalsManager component, which utilization; and for the ‘File System’ type, the set includes facilitates the integration between the ticket management only file system utilization. Further, each CI type can be application and various sources of vitals data. associated a default best-practice vitals sets that can be used to query the state of the CI (2) Collect vitals for CI ITM instance (3) Get vitals Monitoring identifiers and list vital (Monitoring independent of any incident. Figure 9 illustrates System attributes VitalRecord System) Proxy the default set for the CI type ‘Server’ (1) Get list of vital attributes for incident comprising all of the vitals related to CPU and Ticket type or CI type Start-time Management memory. The automated vital collection uses initialization System Access details the best-practice vitals set associated with the Mapping of (IP, URL, user VitalsManager generic to ids & current incident class to collect vitals across all --Initialize Monitoring specific vital passwords, System Proxy CI Attributes attributes of the CIs identified as being related to the databases --Map incidents to CI --ID of monitoring vitals models current incident (see the techniques described system proxy --Map CI vitals models to --Target system and in previous section). actual vitals identifiers within the monitoring system Vital collection can target the current time or a time window in the past, thus providing realMappings: Active instances time and historical sampling, respectively. Tickets to CI Generic vitals of monitoring views to taxonomy Real-time vitals provide a snapshot of the systems proxy generic vitals system/application state at the time of the request or at a time shortly prior to current Figure 8: Architecture of the System Vitals Collection Framework time, depending on the features of the vitals (Sample integration with ITM monitoring system) data source. Historical vitals can span longer time intervals and are represented at multiple Each source of vitals is represented by a time granularities; the actual time granularities and time MonitoringSystemProxy object, which hides the details of span of the history depends on the configuration of the the interactions (e.g., SOAP URL and/or DB connection), monitoring system. and taxonomy of system vitals metrics. The vitals Overall, the collection of abstractions embedded in the collection APIs allow the specification of the vitals of proposed system-vitals collection framework allows for interest in both generic taxonomy and vitals sourceefficient yet flexible collection of vitals, as required in specific taxonomy. The set of vital attributes that can be different stages of the problem-resolution process. collected for a given CI type depends on the related monitoring system. For instance, ITM has different sets of Incident model system vitals for Windows vs. Linux operating systems. FileSystemOverFlow The mapping of vitals identified in generic taxonomy to the actual source-specific vitals is done by the CI vitals view model MonitoringSystemProxy based on internal configuration FileSystemOverFlow Default ServerUtilizaton and CI vital collection attributes. For each CI these vital CIType: FileSystem CIType: Server CIType: Server collection attributes identify the related vitals data source(s), the related host system, and the unique CI Vital attributes Server.CPU.* identifier within the scope of the host system (e.g., file Server.FileSystem.Utilization Server.Memory.* system name, CPU id). This information can be loaded in CMDB automatically, during the discovery process, based Server.CPU.Utilization on vitals-source specific patterns for mapping the realServer.CPU.User% word system components observed in its scope to CI types Server.CPU.System% and configuration attributes collected by other tools. The system integrates best-practice vitals sets to be collected for a given incident class and/or CI type. The Figure 9: Sample best-practice vitals set for Incident vital set for an incident class is defined as a collection of ‘File System Overflow’ and CI type ‘Server’ vitals sets for CI types relevant for the incident class. 5.3 Vital collection in Cloud environment Figure 9 illustrates a sample of best-practices vitals sets defined for the incident class of ‘File System Overflow’ and for the CI type of ‘Server’. For the incident class ‘File

5.2 Vital collection architecture

The integration of tools for automation of the vitals collection processes based on exploitation of generic vocabulary of vitals and best-practice models for integrated analysis of the vitals, helps reduce the impact on the analyst productivity that draws from the complexity and dynamics of the Cloud based environment. Namely, the mapping of the CIs to related monitoring systems is performed automatically, and the analyst is not required to be skilled in the specific monitoring system that related to a given instantiation of the controlled Cloud service. Furthermore, the use of best-practices models for vitals analysis in context of incident types that are highly specific to a Cloud environment reduces the time taken to solve the incidents and fosters learning.

6. Performance Results We implemented multi-dimensional knowledge integration with IBM TSRM [10]. In this section we present initial performance results based on a limited data set. As explained in Section 4, CI search results are presented to the service desk personnel to identify the responsible CIs. Measurements using a set of tickets related to server management show that simple keyword search [9] results in mean reciprocal rank (MRR) of 0.3 of the correct result; which increases up to 0.7 for context oriented CI search. Our incident classification approach resulted in 70% accuracy with 1000 features. Regarding the overall benefits that draw from integration of data sources and automated vitals collection, analysis of data collected from IBM internal tools built on similar principles indicates reductions in ticket resolution times by over 25%.

7. Conclusions Increasing complexity and dynamics in IT infrastructure and the emerging Cloud services present challenges to timely incidents/problems diagnosis and resolution. In this paper we presented a problem determination platform with multi-dimensional knowledge integration and enablement for efficient incident/problem management; specifically, we discussed three features: the automated incident/problem classification, the integration with configuration database for automated CI association of an incident or problem, and the integration with monitoring systems for collection of relevant system vitals. With the insights from developing these features, we realized that effective incident/problem management is not just an service operation issue, it is also an issue in each phase of the service life cycle. In particular, incident/problem management considerations and tasks should be addressed and incorporated into service design and deployment phases, e.g. design and deploy service request

management tools, policies, and templates for new services so that the relevant management tools and operation data are properly configured and enabled at service operation. We proposed a proactive approach by addressing and integrating incident/problem management in service life cycle management.

References [1] ITIL v3 Service Operation: Information Technology Infrastructure Library. http://www.itlibrary.org/ index.php?page=ITIL_v3. [2]Micro Focus white paper, “Enterprise Cloud Services: Deriving Business Value from Cloud Computing”, http://cloudservices.microfocus.com/main/uploaded/doc/ MFECS-WP-deriving-business-value.pdf. [3] “New Enterprise Data Center”, http://www03.ibm.com/systems/nedc/ [4]IBM Tivoli Enterprise Console (TEC), www.ibm.com/software/tivoli/products/enterprise-console [5] BMC Patrol, www.bmc.com/ [6] “IBM-Tivoli Service Request Manager”, http://www 01.ibm.com/software/tivoli/products/service-request-mgr/ [7] “IBM Virtualization Engine”, http://publib.boulder. ibm.com/infocenter/eserver/v1r2/topic/esmcinfo/eicac.pdf [8] K Christiance, J. Lenchner et al, “A Service Delivery Platform for Server Management Services”, to appear in IBM Journal for Research and Development, special issue on Service Delivery, 2008. [9] R Gupta, K H Prasad and M Mohania, “Automating ITSM Incident Management Process”, 5th IEEE International Conference on Autonomic Computing, 2008. [10] IBM Tivoli Monitoring, http://www.ibm.com/ software/tivoli/products/monitor/ [11] IBM Tivoli Netcool/OMNIbus, http://www-01.ibm. com/software/tivoli/products/netcool-omnibus/ [13] Chip Gliedman, “The Forrester Wave™: Service Desk Management Tools,” ftp://ftp.software.ibm.com/ software/tivoli/analystreports/Service_Desk_Wave_2008. pdf [14] IBM, “Dynamic Infrastructure® Helping Build a Smarter Planet: Delivering Superior Business and IT Services with Agility and Speed”, ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/oiw03021 usen/OIW03021USEN.PDF [15] S. Chakrabarti. Mining the Web. Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002. [16] B Tierney, W Johnston, B Crowley, G Hoo, C Brooks, and, D Gunter, “The NetLogger methodology for high performance distributed systems performance analysis”, The Seventh International Symposium on High Performance Distributed Computing, 1998.