top of page

2.1.4 Failure Modes & Effects Analysis (FMEA)

Failure Modes & Effects Analysis (FMEA) Introduction to FMEA Failure Modes & Effects Analysis (FMEA) is a structured, systematic method to identify: - How a process, product, or service can fail. - Why it might fail. - What happens when it fails. - What to do to prevent or control those failures. FMEA is primarily a risk prioritization and prevention tool. It is used to: - Anticipate failures before they occur. - Rank risks using agreed criteria. - Select and implement effective corrective and preventive actions. - Re-assess risk after improvements. FMEA supports data-driven decision making in improvement and design activities, especially within the Analyze, Improve, and Control phases of structured improvement projects. --- Types of FMEA Process FMEA (PFMEA) PFMEA analyzes potential failures in a process that produces a product or service. - Focus: Steps in a process flow. - Typical inputs: - Process map or value stream map. - Historical defects, scrap, rework, complaints. - Measurement system capability information. - Typical outputs: - Process controls and poka-yoke concepts. - Revised work standards and instructions. - Improvement priorities for process steps. Design FMEA (DFMEA) DFMEA analyzes potential failures in the design of a product, component, or system. - Focus: Product functions and design features. - Typical inputs: - Functional requirements, specifications. - Engineering drawings and bills of materials. - Customer, safety, and regulatory requirements. - Typical outputs: - Design changes or redesign priorities. - Tolerance adjustments and material choices. - Design verification and validation plans. Other Variants Several specialized variants exist and follow the same basic logic: - Service FMEA: Failure modes in service delivery. - System FMEA: Higher-level systems and their interfaces. - Software FMEA: Software modules and logic failures. All variants maintain the same core structure: failure mode, effect, cause, controls, and risk prioritization. --- Core FMEA Concepts Failure Mode, Effect, and Cause - Failure mode: The specific way in which a function, step, or component can fail to meet requirements. - Example: “Valve fails to open,” “Part dimension out of tolerance.” - Effect of failure: The consequence of the failure mode as perceived at the next step, by the customer, or by the system. - Example: “Leak occurs,” “Product does not assemble.” - Cause of failure: The underlying reason the failure mode could occur. - Example: “Incorrect torque setting,” “Worn tooling,” “Incorrect specification.” Clear differentiation: - One failure mode can have multiple effects. - One failure mode can have multiple possible causes. - Each row of an FMEA typically links: - A function or process step. - A failure mode. - An effect. - One or more causes. - Existing controls and risk ratings. Prevention vs Detection Controls are categorized into: - Prevention controls: Reduce or eliminate the likelihood that the cause occurs. - Example: Error-proofing devices, standardized work parameters. - Detection controls: Identify the failure mode or cause after it occurs but before it reaches the customer. - Example: In-process inspections, automated checks. Prevention is prioritized over detection because it acts earlier in the chain and reduces dependence on inspection. --- Structure of an FMEA Worksheet Typical Columns A typical FMEA table includes: - Item / Function: What is intended to be done (process step or design function). - Potential Failure Mode: How the item or function could fail. - Potential Effect(s): Consequences for internal/external customers or subsequent steps. - Severity (S): How serious the effect is. - Potential Cause(s): Mechanisms or reasons for the failure mode. - Occurrence (O): Likelihood that the cause will lead to failure. - Current Controls: Existing prevention and detection measures. - Detection (D): Likelihood that current controls will detect the failure or cause. - Risk Priority Number (RPN): Calculated risk value (traditional). - Recommended Actions: Actions to reduce risk. - Responsibility and Target Date: Ownership and timing. - Action Results: Updated S, O, D after implementing actions. The sequence of columns supports a logical chain: function → failure mode → effect → severity → cause → occurrence → controls → detection → risk priority → actions. --- FMEA Rating Scales Severity (S) Severity describes how serious the effect of a failure mode is if it occurs. A consistent ordinal scale (often 1–10) is used. - High severity (e.g., 9–10): - Safety hazard or regulatory noncompliance. - Significant customer dissatisfaction or loss of function. - Medium severity (e.g., 5–8): - Noticeable degradation of performance. - Rework, scrap, or moderate customer dissatisfaction. - Low severity (e.g., 1–4): - Minor inconvenience, cosmetic issue, or easily corrected error. Key points: - Severity is independent of how frequently it happens. - Severity is not reduced by better detection alone; it is reduced by preventing or mitigating the effect itself. Occurrence (O) Occurrence describes how likely the cause is to occur and lead to the failure mode. - High occurrence (e.g., 8–10): - Failure cause is frequent or almost inevitable. - Medium occurrence (e.g., 4–7): - Occasional failures, not rare but not constant. - Low occurrence (e.g., 1–3): - Rare failures; strong process controls or historical evidence of low rates. Key points: - Occurrence ratings should be based on data when possible (defect rates, capability indices). - Improvement efforts aimed at process control and capability primarily reduce occurrence. Detection (D) Detection describes the ability of current controls to discover the failure mode or cause before the effect reaches the customer. - High detection rating (e.g., 8–10): - Very unlikely to detect; failures typically escape. - Medium detection rating (e.g., 4–7): - Some chance of detection but not highly reliable. - Low detection rating (e.g., 1–3): - Very high likelihood of detection (robust automatic controls, highly reliable tests). Important nuance: - Detection measures capability, not frequency of failure. - A low detection rating is good; it means strong detection capability. --- Risk Priority Number (RPN) and Alternatives Calculating RPN The traditional quantitative risk metric in FMEA is the Risk Priority Number (RPN): - RPN = S × O × D Properties: - RPN is an ordinal indicator, not a precise risk probability. - RPN ranges depend on the scale (e.g., 1–10 scales yield 1–1000). - Different combinations of S, O, D can yield the same RPN. Interpreting RPN Key considerations when using RPN: - Use RPN to rank-order risks, not to calculate absolute risk. - Combine RPN with individual S, O, D values: - High severity items may require attention even with modest RPN. - Equal RPNs do not imply equal risk profiles. Common practices: - Define priority rules, such as: - Address all items with severity above a certain threshold. - Within that group, prioritize by RPN. - Re-calculate RPN after actions are implemented to verify risk reduction. Beyond RPN Limitations of RPN include: - Non-unique representation (different profiles with same RPN). - Multiplicative nature exaggerates influence of high mid-range ratings. To address this, some organizations supplement or adjust RPN with: - Severity-first logic: - Always address critical safety or regulatory risks regardless of RPN. - Priority matrices: - Visualizing S vs O or S vs RPN for clearer decisions. Even when alternative schemes are used, the underlying FMEA logic remains the same: severity, occurrence, and detection guide prioritization. --- Building an Effective FMEA Preparation Before starting an FMEA: - Define the scope: - Which process, product, or subsystem. - Interfaces and boundaries. - Assemble a cross-functional team with: - Process, design, quality, and operations expertise as relevant. - Gather reference information: - Process maps, P&IDs, design drawings. - Historical defects, complaints, warranty data. - Requirements, specifications, standards. Clear scope and data enable a focused, fact-based analysis. Step-by-Step FMEA Development A practical sequence for building an FMEA: 1. List items or functions - Process steps, product functions, or system elements. - Use clear, functional wording (what it must do). 1. Identify potential failure modes - For each function or step, ask: - “In what ways can this fail to meet requirements?” - Capture distinct failure modes, not vague descriptions. 1. Describe effects - For each failure mode, state: - What happens at the next step. - What happens to the end user. - Include internal and external effects if applicable. 1. Assign severity - Rate how serious the worst credible effect is. - Use agreed criteria and be consistent across the FMEA. 1. Identify causes - For each failure mode, list direct causes. - Use knowledge from root cause analysis, process knowledge, or data. - Keep causes specific and controllable. 1. Assign occurrence - Estimate how often each cause leads to the failure mode. - Use defect data or expert judgment aligned to the scale. 1. List current controls - Prevention controls for causes. - Detection controls for failure modes or effects. - Include test methods, inspections, alarms, procedures. 1. Assign detection rating - Evaluate how likely current controls are to detect the issue. - Lower values for strong, automated or validated detection. 1. Calculate RPN - Multiply S, O, D for each row. - Create a prioritized list of higher-risk items. 1. Define recommended actions - For high-priority items, propose: - Prevention actions to reduce occurrence. - Detection enhancements if prevention is not feasible. - Design or process changes to reduce severity when possible. 1. Implement and document results - Assign ownership and due dates. - Update the FMEA with: - Implemented actions. - Revised S, O, D and resulting RPN. 1. Review and maintain - Revisit the FMEA when: - Changes are made to design or process. - New failure data emerges. - Control plans are updated. --- Quality of an FMEA Characteristics of a Strong FMEA An effective FMEA exhibits: - Clear logic - Functions, failure modes, effects, and causes are logically linked. - Specific language - Descriptions are concrete and operationally meaningful. - Consistent ratings - S, O, D reflect shared criteria and are not arbitrarily assigned. - Action orientation - High-risk items have concrete, prioritized actions. - Living document behavior - Regular updates reflect new learning and changes. Common Pitfalls and How to Avoid Them Typical issues and countermeasures: - Overly generic failure modes - Problem: Vague entries like “process failure.” - Countermeasure: Break down into specific, observable failures. - Ignoring high severity - Problem: High-severity, low-RPN items neglected. - Countermeasure: Always review high-severity items independently of RPN. - Ratings based only on opinion - Problem: S, O, D assigned without data. - Countermeasure: Use available data and documented scales; refine ratings as data improves. - FMEA not updated - Problem: Document becomes outdated and unused. - Countermeasure: Tie FMEA updates to change management and periodic reviews. - Actions not closed - Problem: Recommended actions stay as notes, not completed. - Countermeasure: Track actions with owners, deadlines, and confirmed effectiveness. --- Integrating FMEA with Measurement and Analysis Using Data to Inform Ratings Reliable S, O, D ratings benefit from: - Historical defect data - Parts per million, defect rates by failure mode and cause. - Process capability information - Capability indices that relate to the likelihood of out-of-spec output. - Control chart information - Stability and special cause signals affecting occurrence. Data-driven ratings: - Reduce subjectivity. - Enable more accurate prioritization. - Support re-rating after improvements. Linking FMEA to Root Cause Analysis While FMEA identifies potential causes, additional analysis tools may be applied outside the FMEA to validate and deepen understanding. Integration points: - Use FMEA to flag critical failure modes and causes. - For high-risk causes, apply detailed root cause analysis. - Feed confirmed root causes and their mechanisms back into the FMEA. This creates a loop between prediction (FMEA) and confirmation (analysis in practice). --- Connecting FMEA to Control Activities FMEA and Control Plans Outputs from FMEA feed directly into control planning: - Key process characteristics - Identified through severe or frequent failure modes. - Control methods - Derived from recommended detection and prevention actions. - Monitoring frequency and reaction plans - Informed by occurrence ratings and severity. Alignment: - Each high-risk failure mode should have corresponding controls and monitoring defined. - Control plans should reference the FMEA so that changes in one are reflected in the other. Sustaining Improvements After improvements: - Verify reduction in occurrence or improved detection through data. - Update O and D to reflect actual performance. - Monitor for unintended consequences or new failure modes. Sustainment requires: - Auditing adherence to new controls. - Periodic validation of detection methods. - Ongoing feedback from operations and customers. --- Summary Failure Modes & Effects Analysis (FMEA) is a structured approach to foresee and prevent failures in processes, products, and systems. It systematically links: - Functions and steps. - Failure modes and effects. - Causes and existing controls. - Severity, occurrence, and detection. - Prioritized actions and verified results. By using consistent rating scales, calculating and interpreting RPN (or related prioritization schemes), and transforming insights into concrete prevention and detection actions, FMEA guides focused risk reduction. When maintained as a living document and connected to measurement, analysis, and control activities, FMEA becomes a central tool for robust, data-driven failure prevention and continuous improvement.

Practical Case: Failure Modes & Effects Analysis (FMEA) A midsize hospital’s lab was missing its 2‑hour turnaround-time target for emergency blood tests. Delays were inconsistent and difficult to trace. Complaints from the ER were increasing, and a recent near-miss for a critical patient triggered a Lean Six Sigma project. The team (lab manager, two techs, ER nurse, quality engineer) mapped the blood test process from sample collection to result release. They agreed to apply FMEA to the pre-analytic phase (from physician order to sample receipt in the lab), where most delays seemed to start. They listed key process steps: order entry, sample collection, labeling, transport to lab, sample receipt. For each step, they identified potential failure modes and their effects on turnaround time and patient safety. For example: - Order not entered correctly in the system. - Tube mislabeled or unlabeled. - Sample left in a unit fridge instead of being sent. - Sample received but not prioritized as “STAT.” The team estimated severity, occurrence, and detection ratings for each failure mode and calculated risk priority numbers. Two top risks emerged: mislabeled/unlabeled tubes and STAT samples not being flagged or routed separately. They agreed on targeted actions: - Introduce barcode-based patient ID and label printing at bedside. - Add a mandatory STAT flag in the order entry screen. - Implement a red STAT transport bag system from ER to lab. - Train staff on the new standard work and visually post the FMEA “top risks and controls” near collection points. One month after implementation, the lab’s STAT turnaround-time compliance improved from inconsistent levels to reliably meeting the 2‑hour target. Mislabeled samples dropped sharply, and there were no new delay-related near-misses reported in the ER. End section

Practice question: Failure Modes & Effects Analysis (FMEA) A cross-functional team is updating a Design FMEA for a new valve. A failure mode is identified as “spring fatigue leading to loss of sealing force.” Which of the following best describes what should be recorded under “cause” for this failure mode? A. Customer complaint about leakage B. Excessive cyclic loading beyond material endurance limit C. Valve fails to seal during field operation D. Operator does not perform preventive maintenance Answer: B Reason: In FMEA, “cause” is the specific, actionable mechanism or source that leads directly to the failure mode. Excessive cyclic loading beyond material endurance limit is a technical root cause of spring fatigue. Other options are not causes of the failure mode itself: A and C are effects/symptoms; D is a potential contributing factor but not the technical cause of spring fatigue. --- A Process FMEA team is debating risk prioritization. Two failure modes have the following rankings: • Failure Mode 1: Severity = 9, Occurrence = 2, Detection = 3 • Failure Mode 2: Severity = 6, Occurrence = 6, Detection = 7 Based solely on the traditional Risk Priority Number (RPN) approach, which failure mode should be prioritized higher? A. Failure Mode 1 because of its higher severity B. Failure Mode 2 because it has the higher RPN C. Failure Mode 1 because it has the lower detection rating D. Both are equal priority because the sum of S+O+D is the same Answer: B Reason: RPN = S × O × D. Failure Mode 1: 9×2×3 = 54. Failure Mode 2: 6×6×7 = 252. Based on RPN alone, Failure Mode 2 is prioritized higher due to significantly higher RPN. Other options misapply the traditional RPN decision rule or incorrectly assume total scores or single factors (severity or detection alone) dominate when the question specifies “based solely on the traditional RPN approach.” --- During a PFMEA for an assembly line, a team identifies a failure mode “wrong component assembled.” Current ratings are S = 8, O = 5, D = 7. The team considers three possible actions: 1. Implement a robust Poka-Yoke that blocks assembly if the wrong component is present. 2. Add an end-of-line visual inspection. 3. Add a monthly operator training refresher. Which action is most directly expected to reduce the Detection rating? A. 1 only B. 2 only C. 3 only D. 1 and 3 Answer: B Reason: Detection rating reflects the likelihood that existing controls will detect the failure mode before it reaches the customer. Adding an end-of-line inspection specifically improves the ability to detect wrong components, directly affecting D. A primarily reduces occurrence because the error is physically prevented; C influences human behavior and may lower occurrence. They are not direct detection improvements. --- In a Design FMEA, a Black Belt wants to use historical field data to improve the Occurrence ratings. Which is the most appropriate way to align FMEA Occurrence scores with quantitative data? A. Convert field failure rates into DPMO and assign Occurrence based on sigma level only B. Develop a mapping table that links measured failure rates (e.g., ppm) to Occurrence ranks C. Replace Occurrence rankings with MTBF values and discontinue qualitative scales D. Ignore historical data because Occurrence is inherently subjective Answer: B Reason: Best practice is to define a standard mapping between actual failure rates (ppm, % defective, etc.) and the qualitative Occurrence rankings used in the FMEA. This aligns ratings with data while preserving the standard scale. A oversimplifies by using sigma level only; C removes the standard FMEA structure; D contradicts data-driven FMEA practice. --- A supplier quality team uses a PFMEA to evaluate a critical machining process. A key failure mode currently has S = 9, O = 4, D = 8. After implementing an in-process automated gage with 100% inspection and automatic stop, the team must update the FMEA. Assuming no change in the process capability, which adjustment is most appropriate? A. Reduce Severity, keep Occurrence and Detection the same B. Reduce Occurrence, keep Severity and Detection the same C. Reduce Detection, keep Severity and Occurrence the same D. Reduce both Occurrence and Detection Answer: C Reason: Introducing a reliable 100% automated in-process check with automatic stop increases the probability of detecting the nonconformance before it escapes, so the Detection rating should be lowered (better detection). Severity (effect if it escapes) and Occurrence (how often the defect is generated) are not inherently changed by detection controls alone. Other options incorrectly assume that detection controls change the inherent occurrence rate or severity of the failure’s effect.

bottom of page