Just after five in the morning on Monday, Delta sent out an alert every traveler dreads. “Delta has experienced a computer outage that has affected flights scheduled for this morning.”

Two hours later, Delta added discouraging details: The outage in Atlanta had crippled its mission control center—the NASA-inspired room that keeps Delta’s global fleet running. Soon, static check-in lanes clogged airports and gate agents started writing boarding passes by hand. Passengers slept on airport floors or sat in parked planes, even as departure boards and smartphone apps wrongly told them everything was running great. The airline canceled more than 650 flights and delayed many more in the US, Japan, Italy, and the UK.

No one seems to know what went wrong, exactly—Delta’s investigating—but this is hardly the first time a computer glitch has shackled an airline’s global operations to the tarmac. So how does this keep happening?

“The complexity of the system is a crucial factor in its failure,” says Ahmed Abdelghany, who studies aviation IT systems at Embry-Riddle Aeronautical University and worked in information services for United Airlines. Airlines typically use computer systems that are built in layers, pulling in subcomponents and data feeds from diverse sources. And while techs work to keep them up to date, some parts of those systems are 30 years old. They’re expensive to change, and they’re also highly proprietary—not the kind of thing you can update with an app store download.

This failure was especially damaging because Delta’s Operations and Customer Center at Atlanta’s Hartsfield-Jackson monitors not just Delta’s global fleet, but its crews and passengers. They know if the weather’s bad, if meals have been loaded, if maintenance is required. This is the room that deals with problems. When it goes offline, it’s a crisis.

In this case it doesn’t appear that the outage knocked the entire facility into darkness, but rather that it knocked out key computers or servers. That’s why Delta’s contingency plans—a second command center nearby and uninterruptible power supply systems to overcome your standard blackout—didn’t help. The lights stayed on, but the information turned off.

“These systems are running every day 24/7, and they tend to be safe, reliable, and robust,” says Abdelghany. If they get into trouble, it’s often because something made a change to their steady state, like an update pushed through the system.

Georgia Power, which supplies electricity to Delta, says it’s working with the airline today to fix a failed switchgear—a heavy duty version of the circuit breaker panel you’ve got in your basement. That would suggest that if an update or test is the problem, it was of hardware (perhaps, ironically, something like a new power supply), rather than of software. Georgia Power says the outage affected nobody else.

Delta stresses the problem never put anyone in danger and that it maintained communications with its planes. In any case, standard air traffic control systems were hunky-dory and ready to handle jets already in the air. By mid-afternoon, the airline had operated 2,340 flights. That seems like a lot, until you hear Delta’s daily average is about 6,000.

If you’re starting to think this kind of thing happens a lot, you’re right. In July, the failure of a single data center router forced Southwest to cancel 2,300 flights across four days, costing the airline well over $10 million. CEO Gary Kelly told The Dallas Morning News the router only partially failed, so it didn’t trigger the backup systems. In May, JetBlue had to check in customers by hand when its computer system went down. American Airlines blamed connectivity issues when it had to suspend flights last September. A year ago, United blamed a glitch for 800 flight delays.

And then there are the cases that defy contingency planning. In 1991 a farmer reportedly took 20 air traffic control centers offline when he inadvertently cut through an underground fiber optic cable while burying a cow. In 2014, an FAA contractor set fire to an air traffic control center in Chicago, disrupting travel for more than two weeks.

The challenge for any airline after disruption of this scale is getting things back to normal. Planes, crew, and passengers are all in the wrong places, and it takes time to reset carefully orchestrated global operations. Delta’s 80,000 employees and 800 planes fly 180 million customers to more than 300 destinations every year. Making that possible—and profitable—requires keeping those planes in the air and on time as much as possible. Each delay ripples outward into missed connections, baggage left behind, extra fuel burned as aircraft fight to catch up. Mass delays beget mayhem.

For now, Delta says to expect more delays and cancellations while operations ramp up. Things may be getting back to normal, but this is unlikely to be the last time we see a systems failure screw up travel plans.

Originally from:  

How a Computer Outage Can Take Down a Whole Airline