In 1986 the space shuttle Challenger exploded, killing the crew and a high-school teacher that had trained with the astronauts. In this article, I travel back in time to reanalyze the Space Shuttle program using modern methods for Risk Management.
- NASA’s purchasing department had the O-ring designed and purchased from a third-party. It was designed to work above 40 degrees Fahrenheit, which was probably justified given the warm weather of NASA’s launch site in Florida.
- In 1985, launches were delayed until January 1986.
- A rare cold day lowered temperatures below freezing, 32 degrees, which is colder than the O-ring design temperature, decreasing its ability to seal the design flaw.
- The incomplete seal led to leaking fuel, which led to an explosion at takeoff.
- The explosion is shown in this video. An aspect of Risk Management is alluded to at 1:30; NASA had been under pressure to launch on that day. In other words, few people were thinking clearly or beyond their job description.
- Create a thorough plan that predetermines acceptable levels of risk, and identifies all people, departments, and policies that affect your product throughout its lifetime.
- Identify sequences of events that would lead to hazardous situations.
- Document hazardous situations that could lead to harm.
- Document harm that could happen from hazardous situations.
- Document the severity of that harm, ranging from inconvenience up to death.
- Estimate the probability of hazardous situations and of harm.
- Apply a number to the level of risk, defined as the severity of harm times the probability of that harm happening.
- If the level of Risk is greater than what was predetermined in your plan, apply risk control to either lower severity or probability.
- Continuously monitor the real-world use of your product, and continuously improve your plan based on new information and improved assumptions (4.4).
ISO 14971 and the space shuttle
Though ISO 14971 seems obvious in hindsight, the methods weren’t clearly laid out as logical steps until relatively recently, and none of these steps were applied to the space shuttle program. But, we’ve learned and improved as a society. Here’s what modern Risk Management would have looked like for the Space Shuttle Challenger, citing paragraphs of the standard with (parentheses):
- Plan (3.4): Predetermined risks levels would have been determined before there was pressure to launch. This plan would have been created by a diverse team and documented at the highest level of policies across all departments, including the team that purchased the O-ring (3.5, 4.2). The team would have identified high-level harms, such as “the space shuttle could explode,” and high-level hazardous situations, such as “fuel could leak and become exposed to fire” (4.3).
- Sequences of Events. Teams would brainstorm possible sequences of events that could lead to hazardous situations (4.4). It was likely that the launch date would be postponed from a warm month to a cold month, and it was likely that colder weather than usual could occur. Combined, that sequence of events is easy to understand, and it would be obvious that the combination of that sequence would create a hazardous situation. Identifying a sequence of events is a team-driven exercise, using techniques of group brainstorming that allow all ideas to be written down before judging them. No one knows all possible situations, but a group will get much closer than any one person if that group’s culture allows everyone to feel valued.
- Hazardous situations. Each sequence of events leads to situations that are hazardous, which should be documented.
- Harm: A hazardous situation can lead to harm. The delayed launch and cold weather were not harmful, but led to an explosion that was. The probability of harm occurring after a hazardous situation has already occurred is P2.
- Severity: Harm can range from something inconvenient to something catastrophic. A risk management plan would apply numbers to that range, from 1 to some practical number. Let’s say 1 is inconvenient, 5 is catastrophic. The shuttle exploding would be a 5, so S = 5.
- Probability: Estimate the likelihood of hazardous situations and of harm to the best of your team’s knowledge. This will be P1 and P2. For example, say that the probability of a sequence of events leading to launch in a cold month is 3 months per 12 months each year, or 3/12 = 1/4, therefore P1 = 0.25. Let’s say that we assume that there’s a 50% chance of rocket fuel igniting from a leak in the O-ring in a hazardous situation, therefore P2 = 0.50. Probability will always be based on assumptions, so the team’s logic would be clearly stated in the plan so that probability can be adjusted as you learn more information.
- Risk: Calculate risk (2.12) for each hazardous situation (5). Risk is defined as the severity of harm times the probability of that harm happening, or Risk = S x (P1 x P2). In this case, Risk = 5 x (0.25 x 0.50) = 0.6. That number by itself is meaningless, it must be compared to predetermined levels of acceptable risk in your plan. The concept is that high severity with low probability still warrants attention, and so does low severity with high probability.
- Apply Risk Control (6.2). If the level of Risk is greater than what was predetermined in your plan, reduce either the severity or the probability until Risk levels are below predetermined acceptance levels (3.4). In the case of the space shuttle, this would mean changing the design of the rocket, changing the design of the O-ring, changing launch procedures to not launch in cold months, or a combination of all things that would reduce the probability or severity of harm. Risk control must be prioritized, giving higher priority to improving designs and less priority on adding written instructions. Anyone who’s assembled furniture at home understands why we prioritize designs over written warnings; humans rarely read instructions. The risk control used for each hazardous situation must be documented (6.3), and reevaluated to ensure that residual risks are still acceptable according to the plan (6.4, 6.6).
- Continuously monitor and improve your plan (4.4, 6.1, 6.3). Use new information and real-world data to continuously improve your assumptions, updating your plan with improved logic. In the case of the space shuttle, they would have monitored launch conditions and realized that they had launched in hazardous situations several times. In other words, monitoring their assumptions would have led to knowing that there’s a higher probability of hazardous situations than originally assumed, and their P1 would have been adjusted.
A risk management plan with a sequence of events would allow anyone in the organization to take ownership of saving lives: someone could have been empowered to stop the sequence of events before a hazardous situation occurred, such as postponing the launch when a cold week led to temperatures below the 40-degree O-ring design temperature. Also, records would be kept of how often that sequence of events happened, allowing assumptions in the risk management plan to be updated with real-world information.
In the language of quality control systems, ISO 14971 is the overarching policies in a series of linked processes that continuously improve to reduce risk. For the space shuttle, risk was to astronauts and the space program. For medical devices, risk defined in terms of patients, the severity of harm to a patient times the probability of that harm occurring.
ISO 14971 Risk Management for Medical Devices
The most common international standard for quality management is ISO 9001, and the standard specific to medical devices quality systems is ISO 13485, which is the foundation of a new audit method, the Medical Device Single Audit Program (MDSAP). Risk is also emphasized by the FDA quality system requirements and the European Union medical device requirements (EU MDR). All use concepts described by ISO 14971 and the supplemental version for Europe, EN ISO 14971:2012.
Regulatory requirements are emphasized by a MDSAP diagram showing Risk Management as the highest level of guidance for companies.
Share this information
But, no amount of quality-control and mathematical modeling can replace humans working together and communicating effectively. In the case of the space shuttle example, individuals struggled to have their voices heard. To improve your company, focus on culture, communication, and transparency. Imagine yourself as having silenced the space shuttle engineer who knew of the risks but didn’t have a way for his voice to be heard.