Downtime in the Delta: What an Airline's Major IT Disaster can Teach us about IT Service Continuity Management

Posted by Mark Hillyard on Aug 9, 2016 9:54:00 AM

In ITSCM, Service Continuity Management, Service Design

Unless you are so wrapped up in the 2016 election cycle that you are not reading, hearing or seeing any other news, it is likely you are aware of Delta Airlines' IT outage early Monday morning.  In fact, considering the global nature of the crisis, it is likely you know someone directly affected.  What was initially blamed on a power outage quickly morphed into one of the biggest system failures this year for the airline industry.  At approximately 2:30 EDT, Delta experienced a complete failure of their global computer infrastructure.  By 8:30, they had ended the grounding of all flights and were beginning recovery efforts.  But in the airline industry, this kind of outage will linger for days, if not weeks.  With planes out of place, passengers and their baggage waiting to move on with their trips, and flight crews on tight, highly regulated schedules, the disaster will likely cost Delta tens of millions of dollars.  That is a difficult figure to fathom for six hours (likely much less real downtime) of a system outage.  The bigger question that will be asked in the days and weeks to come, however, is, "could it all have been avoided?"

The simple answer is, well, not so simple.  It is entirely possible that Delta was doing everything correctly from a service continuity standpoint.  Regular risk assessments, disaster recovery plans, mitigation and redundancy elements all in place, and this is actually the best case scenario under these conditions.  And it is even reasonable to posit that they did everything right, and this was the level of risk they were willing to take.  The best we can do as bystanders is speculate. While I believe they are definitely recovering as quickly as can be reasonably expected, I also believe there were steps that were either missed or ignored in the overall IT Service Continuity Management (ITSCM) process.

Most notably, there is a huge question mark as to how a global, billion-dollar company does (or did) not have practically bulletproof redundancy in place for all of its IT infrastructure.  Even mid-sized corporations with data-driven infrastructures (which is essentially every company on the planet at this point) know the value of highly redundant, geographically diverse systems that are impervious to weather conditions, natural disasters, and all manner of physical threats.

It is likely that aging technology played a role in the fiasco, but there are still glaring questions as to how redundancy was being tested, what sort of mitigation and recovery plans were in place if the IT infrastructure was already a known risk, and, most importantly, how could so many systems be so tightly reliant on each other that a single power failure could cripple a company of this size.

What is IT Service Continuity Management? 

Where does it fit into the overall ITSM lifecycle? And why are so many companies not particularly good at it?

For those ITIL nerds out there, ITSCM is located in Service Design.  Its primary focus is exactly what one might guess:  Ensure IT systems are resilient, redundant, and can recover from outages, crises and disasters within agreed-upon thresholds.  There are quite a few inputs to this process, including the Configuration Management Database (CMDB), Business Impact Analysis (BIA) reports, business risk analysis, and Service Level Agreements (SLAs) - as well as Operational Level Agreements (OLAs) and Underpinning Contracts (UCs).  However, the process itself is pretty straightforward.  ITSCM picks up where Business Continuity Management (BCM) leaves off.  While preparing for disasters like fires, floods, and earthquakes is generally the realm of BCM, ITSCM must also account for these risks and prepare accordingly.  As with any ITSM process, communication and agreement with the business are key.  While IT could reasonably say that an outage of the magnitude Delta just faced could take 3-4 hours to mitigate, the business may find such a timeframe untenable (and the resultant loss of business, revenue and reputation for Delta will likely bear this out in the coming months).  The IT organization may respond to tighter service levels by requesting additional funds and/or resources to bridge the gap between their service level target (SLT) and the business' service level requirement (SLR).  But there is likely to be some level of compromise from both sides as it relates to cost vs. risk.  All of the inputs to the process must be taken into account when determining reasonable SCM activities.

Unfortunately, this level of maturity is lacking in many organizations.  More often than not, one side is dictating service levels (and it is often IT), and making demands to meet even their own estimated response and resolution times.  The result on the business side is frequently a sense of cynicism and suspicion that IT is simply begging for bleeding edge technology that really isn't required.  BIAs are not performed, risk analysis is left to the business without involving IT, and any useful data is discarded or ignored.

Perhaps more than any other area of ITSM, service continuity management requires a great deal of communication, constant vigilance by both the business and the IT organization, and ironclad agreements as to what sort of recovery is possible in the face of crisis and/or disaster.  I will not be surprised in the least if Delta's CIO (if not the CTO and a lot of other IT management officials) loses his job because of this failure.  But it didn't have to be this way.  If IT is truly dealing with such outdated infrastructure that this disaster was inevitable, then the IT organization is not communicating effectively enough the need to update their systems (the value proposition versus the cost).  Alternatively, if the business was made aware in no uncertain terms what was coming if resources were not allocated to modernize and safeguard the infrastructure, then it was unable or unwilling to properly assess the true risk it was facing.  In either case, the lack of communication and trust between the business and IT was what ultimately led to the state in which Delta finds itself now.

The bottom line is that, as an IT organization, we are bound to ensure the business can continue in the face of complete failure of our infrastructure, and that recovery can be attained in a timeframe that is acceptable based on the risk and cost of providing such service levels.  And the only way to truly accomplish this is through establishing a valuable, trusting relationship between our organization and the business.

Ready to take this a step further?
Learn how to design a killer service design package!

SDP-4ever