top of page

3.3.1 General Concepts & Goals of Hypothesis Testing

General Concepts & Goals of Hypothesis Testing Introduction Hypothesis testing is a structured method for using sample data to make decisions about populations. It provides a formal way to decide whether an observed effect is likely to be real or could reasonably be explained by random variation. In process improvement and data-driven problem solving, hypothesis testing is used to: - Compare before/after performance - Compare performance between groups or conditions - Validate assumed relationships between variables - Support decisions with quantified risk of being wrong This article explains the core concepts and goals of hypothesis testing, focusing on what must be understood to correctly design, interpret, and explain statistical tests. --- Purpose and Logic of Hypothesis Testing Decision-Making Under Uncertainty Data almost always contain random variation. Even if nothing has changed in the underlying process, sample statistics (such as means or proportions) will fluctuate from sample to sample. Hypothesis testing addresses a core question: - Is the observed difference or relationship likely due to random variation alone, or does it provide evidence of a real effect? The answer is never certain; hypothesis testing quantifies the risk of making a wrong decision based on sample data. Competing Claims: Null and Alternative Hypothesis testing is built on comparing two competing statements about the population: - Null hypothesis (H₀) The baseline claim, usually representing: - No difference - No effect - No change - No relationship - Alternative hypothesis (H₁ or Hₐ) The claim that contradicts H₀, usually representing: - A difference - An effect - A change - A relationship Examples: - Mean comparison: - H₀: μ₁ = μ₂ - H₁: μ₁ ≠ μ₂ - Improvement claim: - H₀: μafter ≥ μbefore - H₁: μafter < μbefore The test does not attempt to prove H₀; instead it evaluates how compatible the observed data are with H₀. If compatibility is very low, H₀ is rejected in favor of H₁. --- Structure of a Hypothesis Test Essential Components Every hypothesis test includes: - Population parameter being tested - Mean, proportion, variance, difference in means, correlation, etc. - Null hypothesis (H₀) - Specifies the assumed value or relationship for the parameter - Alternative hypothesis (H₁) - Specifies the kind of difference or relationship of interest - Test statistic - A standardized measure calculated from sample data that reflects how far the sample result is from what H₀ predicts - Sampling distribution under H₀ - Describes how the test statistic behaves if H₀ is true (e.g., Z, t, F, chi-square) - p-value or critical region - The evidence measure used to decide whether to reject H₀ - Decision rule - Compare p-value to α (significance level) or compare test statistic to a critical value One-Tailed vs Two-Tailed Tests The form of H₁ determines whether the test is one-tailed or two-tailed. - Two-tailed test - H₀: parameter = value - H₁: parameter ≠ value - Looks for a difference in either direction - One-tailed test - H₀: parameter ≥ value and H₁: parameter < value or H₀: parameter ≤ value and H₁: parameter > value - Looks for a difference in a specified direction only The choice must be made before examining the data, based on the question of interest and practical meaning of directions of change. --- Significance Level, p-Value, and Decision Rules Significance Level (α) The significance level, denoted α, is: - The maximum tolerable probability of rejecting a true H₀ (Type I error) - Chosen in advance, commonly: - α = 0.05 - α = 0.01 - α = 0.10 (sometimes in early exploratory work) Interpretation: - If α = 0.05, then when H₀ is true, the procedure will (on average) incorrectly reject H₀ in 5% of repeated tests. p-Value: Evidence Against H₀ The p-value is: - The probability, assuming H₀ is true, of obtaining a test statistic at least as extreme as the one actually observed (in the direction(s) specified by H₁). Key points: - Small p-value → data are unlikely under H₀ → stronger evidence against H₀ - Large p-value → data are reasonably compatible with H₀ → insufficient evidence to reject H₀ Common decision rule: - If p-value ≤ α → reject H₀ - If p-value > α → fail to reject H₀ Note: Failing to reject H₀ does not prove H₀ is true; it simply means the data do not provide strong enough evidence against it. Critical Values and Rejection Regions An equivalent way to decide is to use critical values: - Determine the critical value(s) of the test statistic that mark the boundary of the α region under the sampling distribution (assuming H₀). - If the observed test statistic falls in the rejection region (beyond critical value(s)), H₀ is rejected. Both approaches (p-value and critical value) lead to the same conclusion when used consistently. --- Types of Errors and Power Type I and Type II Errors Hypothesis testing decisions can be right or wrong. Two types of errors are possible: - Type I error (α) - Rejecting H₀ when H₀ is actually true - Probability is controlled by the chosen significance level α - Type II error (β) - Failing to reject H₀ when H₀ is actually false - Probability depends on: - True effect size - Sample size - Variability in the data - Chosen α - Test type and assumptions There is a trade-off: - Reducing α (being more conservative about rejecting H₀) typically increases β (more likely to miss real effects), unless sample size is increased. Statistical Power Statistical power is: - 1 − β - The probability of correctly rejecting H₀ when H₀ is false (i.e., detecting a real effect) High power means a good chance of detecting meaningful differences. Important factors affecting power: - Effect size - Larger true differences or stronger relationships are easier to detect - Sample size - Larger samples decrease variability of estimates and increase power - Variability - Lower process or measurement variation increases power - Significance level α - Higher α increases power (but also increases the risk of Type I error) Planning for adequate power (commonly 0.8 or higher) is fundamental when designing tests, especially when costly decisions depend on the results. --- Practical vs Statistical Significance Distinguishing the Two A test result can be: - Statistically significant - p-value ≤ α; the effect is unlikely due to chance alone, given H₀ - Practically significant - The size of the effect is large enough to matter operationally, financially, or in terms of quality Key points: - With very large samples, even tiny, unimportant differences can become statistically significant. - With small samples, practically important differences may fail to reach statistical significance (low power). Effective interpretation requires both: - Assessing statistical evidence (p-value, confidence intervals) - Assessing practical impact (effect size, cost, risk, customer impact) Effect Size and Confidence Intervals To support practical interpretation, it is useful to consider: - Effect size - The magnitude of the difference or relationship (e.g., difference in means, standardized effect size, difference in proportions) - Confidence intervals - Interval estimates around the parameter that reflect precision - Narrow intervals indicate precise estimates; wide intervals indicate uncertainty - If a confidence interval for a difference excludes zero, it aligns with rejecting H₀ at a corresponding α Together, p-values, effect sizes, and confidence intervals provide a fuller understanding of results. --- Assumptions and Validity of Tests Common Statistical Assumptions Most parametric hypothesis tests rely on assumptions such as: - Independence of observations - Random sampling or random assignment - Normally distributed data or, for large samples, approximate normality of the sampling distribution - Equal variances across groups (for certain tests like some forms of the t-test and ANOVA) - Correct data type for the chosen test (e.g., continuous vs categorical) Violating key assumptions can: - Distort p-values and confidence intervals - Inflate Type I or Type II error rates - Undermine conclusions Assessing and Addressing Assumptions Before trusting test results, it is important to: - Understand which assumptions apply to the chosen test - Use: - Graphical checks (histograms, boxplots, residual plots) - Summary statistics (e.g., tests for equal variances) - Knowledge of the process and data collection method If assumptions are not met, options can include: - Transforming the data (e.g., log transformation) - Choosing alternative tests that better match the data characteristics - Improving data collection methods to respect independence and randomness --- General Goals of Hypothesis Testing in Improvement Work Verifying Claims and Changes Hypothesis testing supports key questions such as: - Has performance improved after implementing a change? - Do two processes or suppliers truly differ in performance? - Is a suspected factor associated with defects or variation? - Is a new method, product, or setting better or worse than the current one? Primary goals: - Disciplined decision-making - Replace intuition-only decisions with quantified evidence - Risk management - Explicitly control and understand the risks of wrong decisions (Type I and Type II errors) - Objective comparison - Use consistent, transparent rules for declaring differences or relationships Linking to Root Cause and Solution Validation Within an improvement project, hypothesis testing is used to: - Screen and confirm potential causes - Test whether a suspected factor is statistically associated with the outcome - Evaluate alternative solutions or settings - Compare performance across settings, methods, or designs - Validate sustained improvement - Demonstrate that a measured improvement is unlikely to be due to random fluctuation alone Hypothesis testing thus connects measurement and analysis to reliable decisions about causes, changes, and controls. --- Common Misinterpretations to Avoid Understanding what hypothesis tests do not say is as important as understanding what they do say. - p-value is not the probability that H₀ is true - It is the probability of the observed (or more extreme) data, assuming H₀ is true. - Failing to reject H₀ is not proof that H₀ is true - It may reflect: - Insufficient sample size - High variability - A real effect that is too small to detect with the current data - Statistical significance does not imply: - Large effect - Practical importance - Causation (especially in observational data) Avoiding these misunderstandings is essential for sound conclusions and credible communication. --- Choosing an Appropriate Test: Conceptual View Although implementation details vary, the conceptual choice of a test is guided by: - Question type - Difference in means? Difference in proportions? Relationship between variables? Variation comparison? - Data type - Continuous, ordinal, or categorical - Number of groups or conditions - One sample, two samples, multiple samples - Paired vs independent data - Same units measured twice (paired) vs different units (independent) The core logic remains identical across tests: - Define H₀ and H₁ about population parameters. - Compute a test statistic from sample data. - Evaluate how extreme it is under H₀ (p-value or critical value). - Make a decision with controlled risk and interpret it in practical context. --- Summary Hypothesis testing provides a structured, quantitative approach for making decisions from data in the presence of random variation. Its central elements are: - Formulating null (H₀) and alternative (H₁) hypotheses about population parameters - Selecting an appropriate test statistic and understanding its distribution under H₀ - Using significance level (α) and p-values or critical values to decide whether to reject H₀ - Recognizing Type I and Type II errors and the importance of statistical power - Distinguishing statistical significance from practical significance, using effect sizes and confidence intervals - Respecting assumptions of the tests to ensure valid conclusions - Applying tests to verify causes, compare alternatives, and confirm improvements with quantified risk Mastering these general concepts and goals enables rigorous design, execution, and interpretation of hypothesis tests as part of data-driven improvement and decision-making.

Practical Case: General Concepts & Goals of Hypothesis Testing A regional lab network wants to reduce blood test turnaround time (TAT). Leadership has set a target: “The new barcode system should not increase average TAT compared to the current system.” Context A Lean Six Sigma team pilots a new barcode labeler on one analyzer line. Staff complain it is “slowing everything down.” Leadership hesitates to roll it out network-wide without evidence. Problem The team needs to decide whether the observed TAT differences with the new system are just normal variation or reflect a real, meaningful change that could hurt service levels. Applying General Concepts & Goals of Hypothesis Testing The team: - Frames a decision question: “Is the new barcode system performing at least as well as the current system in average TAT?” - Defines two competing explanations: - One: any TAT difference is due to random day‑to‑day variation in the lab. - The other: the barcode system truly changes average TAT. - Collects a manageable sample of TAT data before and after the change instead of measuring every single test. - Uses a statistical test to compare the two explanations and quantify how compatible the observed data are with the “no real change” explanation. - Predefines how much evidence is needed to consider the change harmful, to avoid a decision based on anecdotes or single “bad days.” Result The analysis shows the observed TAT increase is small and statistically indistinguishable from typical daily variation. The team concludes there is no evidence of a real deterioration in performance and recommends full deployment, with a plan to monitor TAT over time. End section

Practice question: General Concepts & Goals of Hypothesis Testing A Black Belt wants to determine whether a new plating process has changed the mean coating thickness from the historical target of 25 μm. The correct formulation of the hypotheses for a two-sided test is: A. H0: μ = 25; H1: μ ≠ 25 B. H0: μ ≠ 25; H1: μ = 25 C. H0: x̄ = 25; H1: x̄ ≠ 25 D. H0: μ > 25; H1: μ ≤ 25 Answer: A Reason: The null hypothesis always specifies the status-quo value (μ = 25), and a two-sided alternative tests for any difference (μ ≠ 25). The hypotheses are stated in terms of the population mean μ, not the sample mean. Other options invert H0/H1, use sample statistics instead of population parameters, or specify one-sided tests, which do not match the stated goal. --- A team conducts a hypothesis test at α = 0.05 and obtains a p-value of 0.18. What is the most appropriate interpretation for decision-making? A. Fail to reject H0; there is not sufficient evidence to support H1 at 5% significance. B. Reject H0; there is a statistically significant effect at 5% significance. C. Accept H0; it is proven to be true with 95% confidence. D. Increase α to 0.20 so H0 can be rejected. Answer: A Reason: Since p-value (0.18) > α (0.05), we fail to reject H0 and conclude that the sample does not provide sufficient evidence of an effect at the chosen significance level. Other options incorrectly claim proof of H0, incorrectly reject H0, or suggest manipulating α post hoc, which is not statistically valid. --- In designing a hypothesis test for a critical safety characteristic, a Black Belt wants to minimize the risk of releasing nonconforming product (consumer’s risk). Which statement best describes the type of error that should be minimized? A. Minimize Type I error, which is the risk of rejecting a true H0. B. Minimize Type II error, which is the risk of failing to reject a false H0. C. Minimize both Type I and Type II errors to zero by increasing sample size. D. Minimize the p-value by reducing the significance level α. Answer: B Reason: Consumer’s risk (β) corresponds to Type II error: failing to reject a false null hypothesis, which here means passing product that is actually nonconforming. Other options confuse error types, propose an impossible elimination of both errors, or misinterpret the effect of changing α on p-values. --- A Black Belt is selecting an appropriate hypothesis test. The goal is to compare the population mean cycle time of a single process against a known specification target, with unknown population standard deviation and a small sample (n = 12). Which is the best choice? A. 1-sample t-test for the mean B. 1-sample z-test for the mean C. 2-sample t-test for means D. 1-sample proportion test Answer: A Reason: For a single mean compared to a known target with unknown σ and small n, the 1-sample t-test is appropriate. Other options either require known σ or two samples, or apply to proportions, which do not fit the stated objective. --- In the context of Lean Six Sigma project decision-making, what is the primary goal of conducting a formal hypothesis test on process performance data? A. To confirm that process improvements are practically important regardless of sample size. B. To use sample data to make an objective, probabilistic decision about the population parameter. C. To calculate exact probabilities of defects at the individual unit level. D. To prove with certainty that the improvement has eliminated all variation. Answer: B Reason: Hypothesis testing uses sample evidence to make objective, probability-based inferences about population parameters and to guide decisions (e.g., sustain, adjust, or abandon a change). Other options confuse practical vs. statistical significance, focus on unit-level predictions, or incorrectly claim certainty and elimination of variation.

bottom of page