How to apply Risk Management

6 minute read.

In 1986 the space shuttle Challenger exploded, killing the crew and a high-school teacher that had trained with the astronauts. In this article, I travel back in time to reanalyze the Space Shuttle program using modern methods for Risk Management.

The explosion originated near an O-ring gasket that sealed rocket fuel. The gasket protected a known design flaw that could expose fuel to fire.
Here’s the sequence of events that led to the explosion:

  1. NASA’s purchasing department had the O-ring designed and purchased from a third-party. It was designed to work above 40 degrees Fahrenheit, which was probably justified given the warm weather of NASA’s launch site in Florida.
  2. In 1985, launches were delayed until January 1986.
  3. A rare cold day lowered temperatures below freezing, 32 degrees, which is colder than the O-ring design temperature, decreasing its ability to seal the design flaw.
  4. The incomplete seal led to leaking fuel, which led to an explosion at takeoff.
  5. The explosion is shown in this video. An aspect of Risk Management is alluded to at 1:30; NASA had been under pressure to launch on that day. In other words, few people were thinking clearly or beyond their job description.
I simplified the situation to focus on what society learned about Risk Management. We learned the importance of effective communication across different disciplines in an organization, and that Risk Management should be documented at the highest level of an organization with clearly defined assumptions and sequences of events that could lead to hazardous situations. Those lessons have been captured in the International Standards Organization guidelines for  Risk Management; in this article I’ll use methods from ISO 14971, Risk Management for Medical Devices.

Risk Management

The international standard for risk management, ISO 14971, can be simplified into these steps:

  1. Create a thorough plan that predetermines acceptable levels of risk, and identifies all people, departments, and policies that affect your product throughout its lifetime.
  2. Identify sequences of events that would lead to hazardous situations.
  3. Document hazardous situations that could lead to harm.
  4. Document harm that could happen from hazardous situations.
  5. Document the severity of that harm, ranging from inconvenience up to death.
  6. Estimate the probability of hazardous situations and of harm.
  7. Apply a number to the level of risk, defined as  the severity of harm times the probability of that harm happening.
  8. If the level of Risk is greater than what was predetermined in your plan, apply risk control to either lower severity or probability.
  9. Continuously monitor the real-world use of your product, and continuously improve your plan based on new information and improved assumptions (4.4).

ISO 14971 and the space shuttle

Though ISO 14971 seems obvious in hindsight, the methods weren’t clearly laid out as logical steps until relatively recently, and none of these steps were applied to the space shuttle program. But, we’ve learned and improved as a society. Here’s what modern Risk Management would have looked like for the Space Shuttle Challenger, citing paragraphs of the standard with (parentheses):

  1. Plan (3.4): Predetermined risks levels would have been determined before there was pressure to launch.  This plan would have been created by a diverse team and documented at the highest level of policies across all departments, including the team that purchased the O-ring (3.5, 4.2). The team would have identified high-level harms, such as “the space shuttle could explode,” and high-level hazardous situations, such as “fuel could leak and become exposed to fire” (4.3).
  2. Sequences of Events. Teams would brainstorm possible sequences of events that could lead to hazardous situations (4.4). It was likely that the launch date would be postponed from a warm month to a cold month, and it was likely that colder weather than usual could occur. Combined, that sequence of events is easy to understand, and it would be obvious that the combination of that sequence would create a hazardous situation. Identifying a sequence of events is a team-driven exercise, using techniques of group brainstorming that allow all ideas to be written down before judging them. No one knows all possible situations, but a group will get much closer than any one person if that group’s culture allows everyone to feel valued.
  3. Hazardous situations. Each sequence of events leads to situations that are hazardous, which should be documented.
  4. Harm: A hazardous situation can lead to harm. The delayed launch and cold weather were not harmful, but led to an explosion that was. The probability of harm occurring after a hazardous situation has already occurred is P2.
  5. Severity: Harm can range from something inconvenient to something catastrophic. A risk management plan would apply numbers to that range, from 1 to some practical number. Let’s say 1 is inconvenient, 5 is catastrophic. The shuttle exploding would be a 5, so S = 5.
  6. Probability: Estimate the likelihood of hazardous situations and of harm to the best of your team’s knowledge. This will be P1 and P2. For example, say that the probability of a sequence of events leading to launch in a cold month is 3 months per 12 months each year, or 3/12 = 1/4, therefore P1 = 0.25. Let’s say that we assume that there’s a 50% chance of rocket fuel igniting from a leak in the O-ring in a hazardous situation, therefore P2 = 0.50. Probability will always be based on assumptions, so the team’s logic would be clearly stated in the plan so that probability can be adjusted as you learn more information.
  7. Risk: Calculate risk (2.12) for each hazardous situation (5). Risk is defined as the severity of harm times the probability of that harm happening, or Risk = S x (P1 x P2).  In this case, Risk = 5 x (0.25 x 0.50) = 0.6. That number by itself is meaningless, it must be compared to predetermined levels of acceptable risk in your plan. The concept is that high severity with low probability still warrants attention, and so does low severity with high probability.
  8. Apply Risk Control (6.2). If the level of Risk is greater than what was predetermined in your plan, reduce either the severity or the probability until Risk levels are below predetermined acceptance levels (3.4). In the case of the space shuttle, this would mean changing the design of the rocket, changing the design of the O-ring, changing launch procedures to not launch in cold months, or a combination of all things that would reduce the probability or severity of harm. Risk control must be prioritized, giving higher priority to improving designs and less priority on adding written instructions. Anyone who’s assembled furniture at home understands why we prioritize designs over written warnings; humans rarely read instructions. The risk control used for each hazardous situation must be documented (6.3), and reevaluated to ensure that residual risks are still acceptable according to the plan (6.4, 6.6).
  9. Continuously monitor and improve your plan (4.4, 6.1, 6.3). Use new information and real-world data to continuously improve your assumptions, updating your plan with improved logic. In the case of the space shuttle, they would have monitored launch conditions and realized that they had launched in hazardous situations several times. In other words, monitoring their assumptions would have led to knowing that there’s a higher probability of hazardous situations than originally assumed, and their P1 would have been adjusted.
In hindsight, ISO 14971 methods could have prevented the space shuttle explosion in many ways. First, the design flaw was known, and ISO 14971 prioritizes improving the design rather than adding more procedures. If the team made an informed, transparent decision to accept that level of risk, the design could have had safeguards added. This means that the design was flawed, but things like extra O-rings, warnings, etc. could have been put in place. It turns out that’s what they did: engineers added two O-rings as a safeguard. But, that logic was not well documented or known across a shared risk management plan, which is why ISO 14971 requires high-level risk management planning with documented assumptions.
A risk management plan with a sequence of events would allow anyone in the organization to take ownership of saving lives: someone could have been empowered to stop the sequence of events before a hazardous situation occurred, such as postponing the launch when a cold week led to temperatures below the 40-degree O-ring design temperature. Also, records would be kept of how often that sequence of events happened, allowing assumptions in the risk management plan to be updated with real-world information.

In the language of quality control systems, ISO 14971 is the overarching policies in a series of linked processes that continuously improve to reduce risk. For the space shuttle, risk was to astronauts and the space program. For medical devices, risk defined in terms of patients, the severity of harm to a patient times the probability of that harm occurring.

ISO 14971 Risk Management for Medical Devices

Lessons from the Challenger explosion are part of current, international standards for quality control and risk management. Almost all medical device standards and regulations require that quality systems function as a risk-driven process of continuous improvement.

The most common international standard for quality management is ISO 9001, and the standard specific to medical devices quality systems is ISO 13485, which is the foundation of a new audit method, the Medical Device Single Audit Program (MDSAP). Risk is also emphasized by the FDA quality system requirements and the European Union medical device requirements (EU MDR). All use concepts described by ISO 14971 and the supplemental version for Europe, EN ISO 14971:2012.

Regulatory requirements are emphasized by a MDSAP diagram showing Risk Management as the highest level of guidance for companies.

View this in context of “a risk driven process of continuous improvement across all departments,” and notice that Purchasing is so important that it is over even management and design. In other words, we learned from the space shuttle that linked processes of risk management are only as effective as the weakest link, which is usually outside of an organization through purchased parts and outsourced design.

Continuous Learning

In the case of the case of Challenger explosion, it’s obvious in hindsight how a series of disconnected processes led to harm. What’s less obvious is how to apply risk-based decisions into your existing quality systems. I give an examples of how to apply risk management to decisions throughout my blog, and recommend using training and consulting companies to help your organization continuously improve.
  • Oriel STAT-A-MATRIX ,an international management consulting organization (I consult with them)
  • Qunique, a Swiss-based botique consulting firm
  • Me (Jason 🙂

Share this information

Risk affects all of society, and the more people who think in big-picture concepts the safer our world becomes for everyone. Please share your knowledge.

Parting Thoughts

The space shuttle Challenger explosion was a rare event that in hindsight had preventable sequences of events. History has many similar examples, such as the , the , and every . These events are often referred to as “” because of the book “,” which emphasizes that we can’t predict outlier events but we can build robust systems resistant to their impact and able to adapt to changes.

But, no amount of quality-control and mathematical modeling can replace humans working together and communicating effectively. In the case of the space shuttle example, individuals struggled to have their voices heard. To improve your company, focus on culture, communication, and transparency. Imagine yourself as having silenced the space shuttle engineer who knew of the risks but didn’t have a way for his voice to be heard.

That last quote may not be understood by people who haven’t experienced small-team military communication techniques, which are different than televised generalizations about the military. In the real-world of small team operations, we relied on minimal communication and high-level goals; in other words, we had a high-level Risk Management plan that empowered every team member to act in everyone else’s best interest all of the time. A young soldier could tell a high-ranking commander that the team was deviating from a predetermined plan. Conversely, when teams only do “their job” we have what happened to the Space Shuttle Challenger: teams pressured to launch the space shuttle without access to higher-level risk management policies, information, and assumptions that could have been used by anyone to save lives.