Cascading failures - a travel adventure

On a not so recent trip to India, my colleague Kevin and I were subjected to some pretty terrible airport travel challenges. I hate to say but Newark is the setting here. I will walk through 10+ things that happened over 4.5 hours which looked shockingly like an application system failure. Kevin and I had a lesson on complex systems and the impact of cascading failures against the drama of whether or not we would actually ever take off. Let me set the scene…

Ruh oh

After a “short” cross-country flight from SFO, we were loaded onto the upcoming 14-hour flight which is the usual level of chaos, but we had a hint of trouble when the pilot came on and said they had been working on a wheel that got tweaked during landing. We looked at each other and said, “no big deal” and carried on. 10 minutes later the trouble began as the airplane had been over fueled, so they decided to run the engines to burn it off.

At this point, Kevin and I started taking some bets on when we might actually get off the ground and on our way to Mumbai and when we would get to our hotel for the night. When the airport shut down the entire runway, due to an unrelated emergency landing, the over bet on the delay went into full effect. We then had seven more travel mishaps that created a series of unfortunate events. Without having you live every dirty detail here is a quick overview of our unfortunate sequence of events. Next, we will walk through the three different ways our experience mirrored an application system failure and what you might have to think through to resolve it.

Our travel fiasco included:

  1. Wheel tweaked on landing
  2. Aircraft over fueled -> initiate burning off fuel to lighten the load
  3. An unrelated plane’s emergency landing occurred on the main runway which impacted flight paths for both arriving and departing aircraft
  4. Our pilots “timed out” on “hours allowed to work” per FAA regulations which are in place to prevent flight fatigue since we had been on the jetway so long
  5. Gate allowed people to deplane our flight in order to comply with rules on how long passengers can be kept on a plane without a break (2 hours in this case)
  6. New pilots from a Tel Aviv flight were swapped with our pilots since the Tel Aviv flight was shorter. (The swap allowed our original crew to fly a shorter route which was within their hours allowed and The Tel Aviv crew still had enough allowed working hours to pilot our 14-hour flight.)
  7. Fuel strategy not successful, now needed to pump it out directly from the tank
  8. Flight Attendants timed out due to Union rules
  9. A 3rd plane parked behind us blocking our ability to get out of the gate
  10. Nearly ran out of time on the new pilots just like #4 due to plane blockage
  11. Wheels up for 14.5 hours for a total of 19 hours on the plane!

Speaking from the perspective of an engineering leader, our travel “adventure” was a reminder things will go wrong, but how you prepare and react to the problem will likely dictate how quickly you are able to get through it or if you will face further exposure and risk of degradation.

In the outline above, the wheel damage, the emergency landing and the other plane were all external events to our system. I will label the over-fueling, deplaning passengers and initial strategy to lighten the fuel load mistakes on the part of various decision-makers. Lastly, things were complicated by constraints in protocol; of which some seemed to be overkill.

Let me start with kudos for monitoring & finding the wheel issue in general, ensuring that the system was “safe”. In the application world, I would consider this a pre-launch test or unit test after the landing event. It appears that the teams on the ground knew exactly what to do and they had the spare parts on hand to make the needed repairs. If the wheel was the only issue we were faced with we probably would have avoided the emergency landing event and the sequence of misfortunes that followed. As you can see, the one tiny mistake made by over-fueling created the opportunity and exposure for a cascade.

If I were leading a retro on our experience I would suggest there should be a repeatable (scalable) mechanism to prevent over fueling in general, even if I was unable to identify any other reason besides cost savings. I am sure you want to err on the side of a little too much versus too little. I also want to note the nature of this flight’s length might have created additional sensitivity towards fueling than other shorter flights would, but fueling could be monitored electronically. I imagine a process that looks a bit like this: Test how much fuel there is, set the amount to load automatically and then execute the fueling.  It appears that during the existing process human error was allowed into the equation and if equipped with better tooling/monitoring we could significantly decrease the error rate through automation.  For applications, I would suggest this is akin to a repeatable deploy pipeline that could be done by hand but would be superior if automated. We have been working to improve this here at Numerator throughout the course of last year for some of our older applications that were not built using modern DevOps methodologies. Through our efforts, we have already seen increases in speed and defect drop that you would expect.  Another often not advertised benefit of this sort of work is the opportunity it provides teams in our legacy applications a chance to learn new skills.

Moving down the list to the events created by regulatory concerns, it is hard to argue with rules & regulations in place to ensure pilot safety given the rigor of their job. However, there could be arguments for allowing the senior flight crew to practice sound judgment in relation to both the allowed hours a flight attendant is able to work as well as the amount of time passengers are allowed to stay on the plane prior to take off.  Another possibility would be to allow the passengers to vote to stay on the plane. Obviously, this is an unconventional idea but if 90% of the passengers voted on the United mobile app to stay on the plane to increase the chance of leaving that night it would be a worthwhile adjustment. Black and white rules are useful for truly critical items, but a check & balance are useful for less strict items. An example would be a release that is internal users only and very desired by the users but crashes once in a while. If the users accept the risk / situation because the overall effectiveness of the application has improved, they should have agency to get access to the upgrade even if it isn’t 100% stable.

As I continue to reflect on my travel fiasco, the biggest error that I can cite were the decisions made related to fuel. The initial error in over-fueling lead to a 2nd error of trying to burn off fuel when it really required taking the fuel out via a pump. I am guessing this is in part due to being harder to bring a tank over and pump out fuel than burn it off. That said, knowing how much fuel was added and the time it would take to burn it off would have given the crew the proper decision. It appears nobody checked the math on how much was extra and how fast it could be burned off. In an application failure this would be similar to doing a row by row db update vs writing a script to solve an error more absolutely. Again, in the case of the fuel it feels like there is a system monitoring or automation opportunity.

In the case of the pilot swap, I credit an effective Operations team overall with finding a creative solution to help the overall system efficiency. I am sure that with the chaos created by an emergency landing there were a lot of different things going on that the team in the tower had to account for. Despite the added chaos, the operations team’s management of the situation was a solid win for our flight with no negative impact on the other flight. An example in a SDLC system would be the rotation of an underutilized QA team onto a project ready for release but blocked by competing priorities within the team who normally manages the project. Implicit in this statement is that the teams can read the documentation and train up with minimal load. This is similar to the pilots being able to fly Tel Aviv or Mumbai since they had all the training needed to swap.  This would also include the flight attendants, as they had a team ready and waiting to swap flights with.  So while the changeover in our flight crew happened, it didn’t seem as dire as pilots not being available.

At the end of the fueling and employee swapping processes, a plane behind ours was blocking our exit.  This coupled with a strict count down of our new pilots’ clock of hours they were allowed to work was purely comical given the previous 4 hours. All is well that ends well and they moved that plane blocking us so we could back up 100 feet and get to the runway. To be honest, if having the other plane move failed it would have been the ultimate insult to end the process with. No different than a war lost for the missing nail holding a horse’s shoe on in battle. A reminder that the smallest thing can sometimes have the biggest impact. So remember: run your table-top exercises, automate what you can, and have good training in place for when things go wrong.  It is only a matter of time before something does go wrong and how you respond will dictate your success.

With that, be well and stay safe out there all you travelers and system owners.

<This was drafted a while back but it has been little bit busy so it was delayed in publishing. Blame management aka me - Joshua Greenough CTO Numerator>