Creating an ITIL Inspired Incident Management Approach - IEEE Xplore

31 downloads 54283 Views 225KB Size Report
or Internet-based application service providers, system availability is crucial. ... management processes which ensure rapid incident response and resolution.
Creating an ITIL Inspired Incident Management Approach: Roots, Response, and Results James J. Cusick & Gary Ma {james.cusick; gary.ma}@wolterskluwer.com Wolters Kluwer, New York, NY Abstract—Incident Management is a key element of supporting any system. For Internet-based applications this function requires integration of staff, process, and tools to manage responses to integrated system events including carrier network outages or performance impacts, hosted server and infrastructure failures, application errors, and misconfigurations. A detailed description of the origin of the Incident Management approach at Wolters Kluwer is discussed, including a summary of the ITIL (Information Technology Infrastructure Library) guidance on which it is based, focusing on the custom process and tools developed to realize an improved approach over time, and the results achieved. Index Terms—Incident Management, Computer Operations, Root Cause Analysis, ITIL, Software Engineering, Process Engineering.

I. INTRODUCTION

F

or Internet-based application service providers, system availability is crucial. Customers may not return to a web site if it is repeatedly unavailable. Impact on revenues is directly related to system availability. The required elements to provide acceptable availability include reliable infrastructure and network services, resilient application architectures, and management processes which ensure rapid incident response and resolution. This paper documents the evolution of the incident management practices of the Wolters Kluwer Corporate Legal Services production support team. The genesis, definition, deployment, and results of the practices are discussed, and future directions for the process area are presented. Wolters Kluwer (WK) is a Netherlands-based international publisher and digital information services provider with operations around the world. Wolters Kluwer is organized into Business Units which then control operating companies. The experience documented here focuses on support teams managed from the New York-based Corporate Legal Services (CLS) Division which manages six corporations. The systems

c 978-1-4244-6039-7/10/$26.00 2010 IEEE

supported include public-facing Web-based applications and internally used ERP (Enterprise Resource Planning) systems. Major vendors manage network services and hosting. The CLS support team is responsible for the availability of these systems from an end customer standpoint. Each CLS business provides critical systems and applications services and availability to support diverse product lines. One of the necessary functions that need to be provided is incident response. An incident in this context is any event that disrupts planned and committed system or application availability. Such incidents must be detected, tracked, diagnosed, repaired or mitigated, and finally reported on and handed off for preventative analysis. Carrying out this function requires organization, staff planning, methods, and tools. From a CLS perspective it is desirable that these functions are defined from a common pattern and their execution assured in a standard manner to achieve consistency of service and management visibility. This paper first provides the background on what Incident Management traditionally means, where the CLS team started in its incident management approach, what requirements were defined to evolve the incident management approach, and the tools used to support improvements and refinements. Results from this process and tool deployment and evolution are provided, as well as suggestions on further work.

II. BACKGROUND The support functions of the CLS team were originally separately managed and run. Management decided to bring each function together under common leadership and provided clear goals for the team to meet availability targets, improve communications around support events, and strengthen controls around production environment stability. Taking the existing team and process as a starting point, small improvements were attempted. However, the overall situation was characterized by a lack of a defined process, ad hoc responses to system outages, a difficulty to gather status on issues, and blurred lines of responsibility. The result was a

142

somewhat hectic response pattern to emergency events, variability in the quality of production system releases, and a dissatisfied stakeholder group which witnessed numerous outages with uneven or poorly communicated responses.

outages there was a clear realization of what requirements needed to be met in an improved support model.

III. ORIGINAL REQUIREMENTS Industry approaches to such demands have been robust for some time. Among others, Kalmanek [1] discusses the core requirements of incident management and focuses on the automation of the entire life cycle of event detection: detect, localize, diagnose, fix, and verify. The concept of formal Incident Management is well understood in industry standards including ITIL1 (Information Technology Infrastructure Library). According to ITIL [2]:



The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price. Activities of the Incident Management process include: Incident detection and recording Investigation and diagnosis Resolution and recovery Classification and initial support Incident closure Incident ownership, monitoring, communication

 

An 'Incident' is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service.

• • • • • •

After a series of significant outage events, the following core requirements were developed and an execution plan was created to realize the team structure, process, and tools to fulfill the essential ITIL elements. These requirements included the following:

    

tracking

and

Create a single team with appropriate expertise, coverage, communications, and authority to respond to any incident. Provide a single point of contact for all external groups (represents a unified Tier 3, see Figure 1 for model of all tiers). Provide consistent coverage 8AM to 11PM, including weekends on call (later became 24/7). Establish a team consisting of an application specialist, database specialist, and a systems specialist with one individual acting as lead. Define escalation procedures, can pull in any resource as required. Provide immediate update to management upon incident identification and continue to update till issue is closed. Utilize standard reach number, email group, with preplanned rotating shift staffing. Utilize standard incident reporting mechanism.

These requirements were implemented and have largely remained constant. The implementation has expanded and has continuously been improved but the core requirements have held steady for the most part.

The CLS team decided to implement the essence of the ITIL Incident Management model which laid a path for the improvement of service outage management. An “Incident Response Team” was created which is referred to as the IRT. The IRT covers nearly all aspects of Incident Management in a light weight fashion. A. Genesis The existing Software Development Lifecycle process within CLS did not deal with operational support [3]. The CMMI2based process focused on requirements, implementation, and test for software applications. The systems and operations aspect of our work had been largely outsourced to vendors leaving a thin layer of system and database specialists and a group of application experts. In early 2008, two major applications were released to production shortly after the consolidation of the support teams was accomplished. These releases proved problematic and the strains in the support practices became clearly evident. As a result of a series of 1

http://www.itil-officialsite.com/home/home.asp

Fig. 1. CLS Support Reference Model.

2 Capability Maturity Model-Integrated, from the Software Engineering Institute (http://www.sei.cmu.edu/cmmi/).

2010 IEEE/IFIP Network Operations and Management Symposium Workshops

143

7. IV. DESIGN AND LAUNCH The Incident Response Team consists of three main components: 1.

A team structure

2.

A process

3.

Supporting tools

V. A CUSTOM TOOL

The team structure allows for a well-defined structure to handle incidents during support hours and also provides rotational staffing, coverage patterns, skills mix, and call out patterns. The process is a simple set of actions to be taken for each incident that alerts appropriate stakeholders of status, gets the required expertise onto the problem, and guarantees that the incident is worked to resolution. The supporting tool may consist of a simple incident logging mechanism, contact sheets, procedures and supporting documents, email alerting, and incident data history. The core requirements as stated above, drove the initial CLS IRT implementation whose actual requirements evolved to consist of the following essential points: 1.

Provide a single point of contact for all external groups (unified Tier 3) for any production related incident. This includes any application, server, network, or database issue on all products.

2.

Provide consistent coverage (eg, 8AM to 11PM or defined support hours) and during weekends the team should be on call.

3.

Team should consist of an Application Support developer, Systems Engineering team representative, and DBA representative. One individual should act as the lead. The team and the lead should rotate every week. There should be a staffing plan to support the IRT.

4.

Defined escalation procedures are provided. The IRT can pull in any resource as required to solve the problem and are fully empowered to make decisions required to resolve incidents.

5.

Provide immediate updates to management upon incident identification and continue to update until the incident is closed.

6.

Establish standard reach number allowing anyone to activate the team. Provide an email group for communication. Provide standard conference bridge for troubleshooting.

144

The IRT should be supported by a standard incident reporting mechanism. All incidents must be logged as soon as they are reported and updated with incident details as appropriate. Minimal information should be captured, eg, affected application, time of incident, description, cross reference data such as a vendor trouble ticket.

To support the implementation of our Incident Management function, a custom tool was developed using an off-the-shelf toolkit. The supporting tool had the following requirements: a phone line for all parties to report incidents, a consistent incident logging mechanism, easy to find contact information, clear procedures to follow when incidents occur, supporting documents, email alerting, and a means to keep track of incident data history. SharePoint was used as the basis of the tool. This choice allowed for collaboration that can access shared workspaces, information stores and documents, as well as built-in capabilities to send out email alerts providing easy subscription capabilities. The reason a packaged tool was not used came down to the lack of funding to purchase such a tool. We utilized our corporate SharePoint server, and started to develop the IRT SharePoint portal which contained all the above requirements. The only cost was to setup the phone line to take hotline calls. After the portal was setup, it required minimum support and no integrations with other systems were required. Every week, the rotation schedule is updated, which displays on the site. The portal initially only supported the Web Software Engineering department; later on, it was expanded to include the Enterprise (ERP) department as well. Along the way, the portal was refined as specific needs came up. The portal has been available 24/7 with few interruptions. Once an incident occurs, the on-call IRT Lead logs all information pertaining to the problem and follows the 9 steps that are described on the portal site. This portal is highly independent; the only limitation was dependence on our corporate SharePoint server and database. If corporate servers went down, so did the IRT site. To work around this dependence, we created a separate Web site which contained most mirrored information found on the SharePoint site. This alternate site is also available 24/7. Future expansion is likely to include a searchable database to match any incident to better support resolution of known incidents as quickly as possible to reduce repair times further and improved our system availability.

2010 IEEE/IFIP Network Operations and Management Symposium Workshops

VI. OPERATIONAL EXPERIENCE The IRT function has been running since March 1, 2008. Thus, to date we have over 18 months of experience operating the process. There are several observations that can be made around the effectiveness of the approach. The core requirements above were implemented in the CLS Web group during early 2008. The team structure, process, and supporting tool were defined and deployed. The IRT ran internally for two weeks to debug the process and tool. Once the team was running smoothly the hotline number and overall approach was communicated to the general help desk that handles Tier 1 support, with the Tier 2 support team, the business stakeholders and engineering. Since its inception it has handled nearly 300 incidents in approximately 18 months of operations through Summer 2009. The IRT has allowed for highly consistent issue handling, standard communications around incidents, a clean interface to other support organizations, fast response time to issues, and short repair intervals. The approach has also made support activities predictable for support engineers, provided leadership and visibility opportunities for each team member, and improved overall management control and awareness of incidents. There was some initial resistance to the introduction of the IRT by some support engineers as it was a change in the status quo but in general people now see the value of the approach and support it strongly. Also, in order to mesh with some allied support groups like the help desk and the vendor operations team, multiple discussion were required to outline the need for the IRT function and to explore the inter-locking behavior of the new team and process with their operations. The only significant challenges with deploying the IRT has been in helping individuals to understand the introduction of a new tracking vehicle for issues. People were very accustomed to our availability tracking and root cause analysis or Problem Management recording and tracking as well as our software change records called work items (both defects and enhancements). The concept of a realtime production oriented event log was new and required extensive explanation and reexplanation. We held briefings, trainings, one on one discussions, and wrote documentation to explain the concept and the practice. Eventually people understood what we were asking for and have embraced the approach. There is still some variation in adherence from individual to individual but not enough to invalidate the approach A. Detecting and Alerting Incidents are detected both via automated means and by manual means. System level alerts create tickets automatically and also page on call technicians automatically. Technicians working with the systems or monitoring them manually also report incidents. Finally, a large user base also calls in problems when the occur. Regardless of the source of the alert each issue is logged and responded to in a standard manner. We have noticed that some issues are observed but not logged.

This is the case typically when a non-critical or off-line function encounters an issue and it is deemed not relevant for an IRT report. Over time we have tried to educate staff members to report any incident. We have tried to make it clear that an incident report is free and has no negative side effects but new approaches can be slow in adoption. B. Incident Logging Once an event is detected the immediate next step is to log it. The logging of an event requires the completion of 16 data fields of which about 1/3 are pre-populated (eg, the date). The ticket is editable so as resolution steps proceed updates can be added by anyone on shift. A running history of the incident resolution steps are thus created and preserved. This represents a significant improvement from our prior approach which was to try to assemble a history of the event from disparate email threads. The main difficulty with this approach has been the paucity of details inserted by the engineers. They tend to enter the minimum amount of information, or in some cases nothing more than the initial description and the time of resolution without adequate intermediate steps or a robust recap of the event. C. Automated Notifications A major benefit of our Sharepoint-based tool is the built-in subscription and notification capability it provides. Anyone with permissions to the site can request notification via email whenever new events are created, changed, or both. This capability has been a boon to communications around production incidents and has even reduced the frequency of status requests by management for alerts to production incidents and ongoing status during an incident. This has freed up the support team to focus on incident resolution, and as long as they keep updating the ticket they know that all interested parties will automatically be updated whenever they hit save. This feature alone has made our entire IRT process worth the effort of creating, deploying, and maintaining. D. Response Script Currently we have one standard response script which contains 9 basic steps required for any incident response. These basic steps include logging the incident, informing the support team and the on-call manager directly, kicking off the teleconference bridge line for coordinated debugging, starting the problem isolation steps, debugging, and service restoration approach. Finally, the steps include the wrap up procedures including completing the incident description trace and communicating to affected user groups. In addition, there is a special data collection template for any performance related incident. This guides the support engineer through the capture of specific parameters across multiple systems including CPU, memory, storage space, and network conditions. Producing a record of real time performance traces as seen at the time of incident has proven useful for trend analysis. The simplicity of the response script has proven to be useful. It is simple enough that it can be explained quickly to any group

2010 IEEE/IFIP Network Operations and Management Symposium Workshops

145

and it can be followed by essentially all personnel without any difficulty. Naturally, the simplicity of the script offers nearly no assistance to the technical nature of the myriad of problems that are encountered. The staff must rely on their own training and experience to isolate, debug, and resolve issues. They often call or email each other for advice, and of course the conference call helps too. Nevertheless, an ITIL best practice might be used here to develop solution scenarios for common problems. A library of problem types and resolution paths might be indexed and available for quick reference. This could be a future improvement. E. Diagnosis and Repair During the 18 months of operations nearly 300 incidents have been encountered. The variety of incidents has been impressive. System, platform, network, procedural, software, integration, and operational issues have all been observed. Most of these incidents have been limited in impact but some have been major. Good examples include failures of disk arrays, DNS servers, and firewalls. Many of the minor incidents have been related to isolated software module faults or errors in migration and configuration. The IRT process has been very effective in dealing with these issues due to the cross representation of required domain specialties from system to application to database and the immediate callout support to relevant vendors. This coverage has accelerated resolution time and introduced notable efficiency in marshalling needed specialties to each problem type. The rotating lead structure has also produced cross-training opportunities for the staff. Now, each staff member experiences the challenge of triaging and troubleshooting problems from outside their normal assigned domain. While they may not be able to master that domain through this short duration lead assignment they do gain fluency in the architectures, operations, and lingo of the various platforms and applications. F. Root Cause Analysis The IRT process focuses on incident management and our ATS (Availability Tracking System) focuses on problem management under the ITIL parlance. IRT events impacting availability or performance are recorded in the ATS system for root cause analysis and preventative follow-up. Additionally, regular analysis is done on the full set of IRT events to categorize them, monitor trends, and initiate corrective work where required. Some IRT events are one time occurrences or might be due to a defect that was introduced. These are quickly repaired as their existence might impact operational status of a system. Naturally, these types of issues find immediate solutions. Other types of issues are temporary or transitory anomalies and are repaired almost instantaneously or with minor intrusion. Some of these issues require investigation which might take longer to settle. In either case, each incident is pursued to find root cause and to introduce required steps to prevent repeat incidents or the emergence of similar incidents. The response script requires the support

146

engineer to document “apparent” root cause. This means that whatever appears to be the root cause should be recorded even though via problem management the “true” root cause will be determined.

VII. INCIDENT MEASUREMENTS Once the IRT was up and running, we automatically were provided with a set of measures on incidents. In all honesty, these measures were not planned for up front as were not specified in our requirements list as outlined above. Instead, once the incident tickets began flowing in we realized there were some KPI (Key Performance Indicators) which could be quickly and easily derived from the raw data. For example, we began looking at incident rates over time (e.g., incidents per day, incidents per week). We also got a much better view into MTTR (Mean Time To Repair) than we previously had. Furthermore, with a simple categorization scheme we can begin looking at percentage occurrence rates of incidents based on type. A typical breakdown has been: 40% infrastructure, 30% software, and 30% configuration, deployment, and process. We can also see at a glance which system domains the incidents are stemming from and the particular applications involved. Using our existing root cause analysis and proactive response process we have been aggressively addressing the prevention of any recurring or one time problems. The IRT data, however, has been extremely useful in understanding the failure patterns, durations, and impacts.

VIII. OFFSHORE EXPANSION The initial IRT deployment covered 8AM to 11PM US Eastern time. A 24x7 support capability was requested and established for Web applications in CLS. This was done by establishing an overnight callout number routed to a pool of offshore support engineers. The offshore vendors staff the support engineers from existing personnel during the overnight US Eastern time period 7 days a week. Since the CLS Enterprise systems are not online overnight, support is only provided during standard business hours for those applications. The establishment of the offshore capability leveraged years of offshore development experience by the team [4]. Utilizing experience in managing offshore capabilities allowed this expansion to around the clock coverage to be achieved quickly and at no additional cost. The experience so far has been positive and there have been very few instances where we had incidents requiring offshore independent reaction. More often we schedule specific proactive duties for the offshore team to address on the overnight shift that gets us one step ahead the next day. Importantly, we know we are covered should an event require handling.

2010 IEEE/IFIP Network Operations and Management Symposium Workshops

IX. BROAD DEPLOYMENT Once the IRT process was running effectively and had proven its worth there was a request to broaden its deployment. Initially a survey of the several operating business groups within CLS was conducted to determine the nature of the support models in place, best practices, and areas for improvement. These findings were compared with the IRT model and operating principles and specific plans were developed to deploy the IRT approach to these other groups. Immediately, an expansion of the IRT to the Enterprise group which is closely aligned with the Web group was conducted. Through management briefings and staff education the IRT process successfully brought in scope this second large group. Currently an additional deployment is underway at a primary group in the division based in Houston and Los Angeles. The approach with this group is to target specific areas to improve on instead of deploying all aspects simultaneously. This deployment is currently ongoing. Specifically, the following approach has been taken to accomplish these deployments across other business units: 1.

2.

3.

4.

5.

Review of the IRT elements and requirements documented above by technical management of each new unit. Modification and improvement of IRT approach and requirements as needed to fit the specific needs of each unit as they are discovered. Creation of tactical adoption plan per unit on a defined priority order. Pursue remaining units as opportunity arises. Report on deployment status and adoption success per unit and assess any proactive or reactive changes as required. Provide cumulative impact assessment across units for the year.

X. LESSONS LEARNED There are several primary learnings from our work in adopting ITIL’s incident management process:     

Develop a vision for how the process should work. Know the core problem to be solved and articulate the end state. Start with core requirements suited to the organization. Do not try to solve world hunger on Day 1. Focus on the key problem areas, for example, incident logging may be lacking but resolution ability may be strong. Communication is vital; people have to understand what is being asked of them. Multiple conversations may be required to successfully get the message across. Expect some resistance as deployment will change people’s duties and even work schedules if not overall expectations.





Institutionalize at all management levels and operate the process the same as a production system would be handled. This process must operate around the clock with full support. Most people will see the benefits once the process is operational and will turn into cheerleaders for the approach but they may protest while they are getting to that point. Keep encouraging and reward them for the success.

XI. FUTURE DIRECTIONS The primary goal now is the ongoing operation of the IRT, continual improvements as necessary, and the expansion and deployment of the IRT into other CLS business groups. Initial discussions with the higher priority units are first confirming existing incident management practices and will then map those into the defined requirements for the IRT approach listed above. Wherever there are gaps, the reference implementation approach from CLS Web should be applied, customized, or suitable new techniques should be created. The goal will be that the requirements tracking table should be complete, that is, all required incident management practices should be defined and operationalized. As part of the deployment activities, status updates will be provided and eventually a standard cross-unit reporting and analysis model can be evolved. As long as the supporting tools capture the same information such cross-unit analysis can be conducted seamlessly. Eventually, optimizations of the process, team structures, and tools can be pursued as experience with the techniques grows. In addition to expanding usage of the approach there are a couple of ITIL recommended approaches which we are not currently following that should be explored for adoption. We should consider better categorization and type recording of incident as well as severity. Severity is currently not logged purposely to keep the resolving engineers focused on each incident regardless of severity. Thus management sets severity subjectively. This could be improved or modified. Also, we do not currently match incidents against a repository of known issues as recommended in ITIL. We could get a list of common problems with debugging approach and store them or share with the Level 1 help desk.

XII. CONCLUSIONS Our journey to implement an ITIL-based Incident Management process began in the pressure of repeated production failures with a disconnected set of teams working the issues. After realizing the core requirements of a better functioning apparatus we applied the structure of ITIL to our approach and over a series of months deployed and tuned the model to strong success. Resolution time dropped, communication around incidents was far improved, and the platform stability began improving also.

2010 IEEE/IFIP Network Operations and Management Symposium Workshops

147

The benefits of the ITIL framework in supporting our efforts was to provide an external reference of best practice, a confirmation of our approach, and an overall framework to show where our IRT actions should be placed into the full scope of activities we needed to manage and carry out. For others facing similar challenges we recommend the ITIL framework, especially some of the summary documents describing the essentials of the framework. We are currently reviewing all of our support practices and methods against the full spectrum of ITIL guidance. We expect that the ITIL framework will be of significant assistance in our future attempts to improve our services.

XV. AUTHOR CONTACT James Cusick, Director IT, Wolters Kluwer, New York, NY, [email protected]. Gary Ma, Manager IT, Wolters Kluwer, New York, NY, [email protected].

It cannot be stressed enough that our IRT process and team structure has completely changed the way we manage our support function for the better. It has also provided a higher quality of life for many of us as crisis events are now routinely managed via a rational and defined approach.

XIII. ACKNOWLEDGEMENTS The creation, deployment, and operation of the IRT was made possible by the support of management and the dedication and precision of our technical staff.

XIV. REFERENCES [1] C. Kalmanek, “Unlocking Systems and Data: The Key to Network Management Innovation”, 2006 IEEE/IFIP Network Operations & Management Symposium, Vancouver, Canada, April 2006. [2] J. V. Bon, Foundations of IT Service Management: based on ITIL, Van Haren Publishing; 2nd edition, September 15, 2005. [3] R. Cyran, & J. Cusick, "Achieving CMMI Level 2: Challenges, Missteps, and Successes", Proceedings of the 10th IASTED International Conference on Software Engineering Applications, November 13-15, 2006, Dallas, TX, A Publication of the International Association of Science and Technology for Development, ACTA Press, Anaheim. [4] Cusick, J., & Prasad, A., "A Practical Management & Engineering Approach to Offshore Collaboration", IEEE Software, Vol. 23, No. 5, pp 20-29, Sept/Oct, 2006

148

2010 IEEE/IFIP Network Operations and Management Symposium Workshops