24h 0m 0s
🔥 Flash Sale -50% on Mock exams ! Use code 6sigmatool50 – Offer valid for 24 hours only! 🎯
3.3 Hypothesis Testing
Hypothesis Testing Introduction to Hypothesis Testing Hypothesis testing is a structured method for using sample data to make decisions about a population. It answers questions such as: - Is the process mean different from the target? - Has a change in the process improved performance? - Are two processes producing different results? - Are two variables statistically associated? At its core, hypothesis testing compares what you observe in data with what you would expect to see if there were no real effect or difference. --- Core Concepts and Logic Hypotheses and the Null Model A hypothesis test always starts with two competing statements about a population parameter (such as a mean, proportion, or variance): - Null hypothesis (H₀): The status quo or “no effect” statement. - Alternative hypothesis (H₁ or Hₐ): The statement that there is a difference, effect, or relationship. Examples: - Mean: - H₀: μ = 10 - H₁: μ ≠ 10 (two-sided) - Proportion: - H₀: p = 0.95 - H₁: p < 0.95 (one-sided, lower-tail) - Difference in means: - H₀: μ₁ - μ₂ = 0 - H₁: μ₁ - μ₂ > 0 (one-sided, upper-tail) The null hypothesis defines a model of “no true change or difference.” The test evaluates how consistent the sample data are with this null model. Test Statistic and Sampling Distribution A test statistic converts the observed sample data into a single standardized value that can be compared to a known reference distribution. - Test statistic: A function of the data and the null hypothesis. - Sampling distribution: The probability distribution of the test statistic when H₀ is true. Examples: - z statistic for means when σ is known and sample is large - t statistic for means when σ is unknown and sample is small - χ² statistic for variances or associations in contingency tables - F statistic for comparing two variances The more extreme the test statistic (in the direction of H₁), the stronger the evidence against H₀. p-value and Decision Rule The p-value is the probability, assuming H₀ is true, of observing a test statistic as extreme as or more extreme than the one obtained from the data. - Small p-value: Data are unlikely under H₀, providing evidence against H₀. - Large p-value: Data are consistent with H₀; there is not enough evidence to reject H₀. General decision rule at significance level α: - If p ≤ α: Reject H₀ (evidence supports H₁). - If p > α: Fail to reject H₀ (insufficient evidence to support H₁). Failing to reject H₀ is not the same as proving H₀ true; it means the data do not contradict H₀ strongly enough. Significance Level and Confidence Level The significance level α is the maximum tolerable probability of rejecting a true H₀ (Type I error). Common choices: - α = 0.05 (corresponds roughly to 95% confidence) - α = 0.01 (more stringent) - α = 0.10 (less stringent, used in some exploratory settings) Connection to confidence intervals: - A two-sided hypothesis test at α corresponds to a (1 − α) confidence interval. - If the hypothesized parameter value lies outside the (1 − α) confidence interval, H₀ is rejected at level α. --- Errors in Hypothesis Testing Type I and Type II Errors In any test, two types of decision errors are possible: - Type I error (α): - Rejecting H₀ when H₀ is actually true. - Probability is controlled directly by setting α. - Type II error (β): - Failing to reject H₀ when H₁ is actually true. - Depends on sample size, effect size, α, and data variability. There is a trade-off between α and β for a fixed sample size: lowering α usually increases β unless sample size is increased. Power of a Test Power is the probability of correctly rejecting H₀ when H₁ is true: - Power = 1 − β High power means the test is sensitive enough to detect meaningful differences. Power increases when: - The true difference (effect size) is larger. - The sample size is larger. - The data variability is smaller. - The significance level α is higher (less stringent). Power analysis is used to: - Determine required sample size before data collection. - Evaluate whether a non-significant result might be due to low power. --- Assumptions and Data Conditions Common Statistical Assumptions Many parametric hypothesis tests rely on assumptions: - Independence: Observations are independent of each other. - Normality: Data, or sample means, are approximately normally distributed. - Equal variances (homoscedasticity): For certain tests comparing two or more groups. - Measurement scale: - Means require interval or ratio data. - Proportions require binary (yes/no) data. - Some nonparametric tests use ordinal or ranked data. When assumptions are violated, results can be misleading. In such cases, consider: - Transformations (for example, log transformation) to approximate normality. - Nonparametric tests that rely less on distributional assumptions. - Adjusted tests that do not assume equal variances. Checking Normality and Stability To justify many tests, it is important to know whether data meet approximate normality and stability requirements. Typical techniques: - Normal probability plots (Q-Q plots): Visual check for approximate straight-line behavior. - Histograms: Shape inspection for symmetry and unimodality. - Descriptive statistics: Skewness, kurtosis, and outliers. For subgroup means or large samples, the central limit theorem often supports approximate normality of the sampling distribution even if raw data are not perfectly normal. --- Parametric Tests for Means and Variances One-Sample z and t Tests for Means Use these tests to compare a sample mean to a known or target value. - One-sample z test: - Population standard deviation (σ) known, or very large sample size. - Test statistic: z = (x̄ − μ₀) / (σ / √n) - One-sample t test: - σ unknown; standard deviation estimated from sample (s). - Test statistic: t = (x̄ − μ₀) / (s / √n) Hypotheses: - H₀: μ = μ₀ - H₁: μ ≠ μ₀, μ > μ₀, or μ < μ₀ Select z or t based on knowledge of σ and sample size; use the corresponding normal or t distribution to compute the p-value. Two-Sample Tests for Means (Independent Samples) Used to compare means from two independent groups or processes. Assumptions: - Observations in each group are independent. - Within-group data approximately normal. - For pooled tests, equal variances; otherwise, use unequal-variance (Welch) approach. Typical test: - Two-sample t test: - H₀: μ₁ − μ₂ = 0 - H₁: μ₁ − μ₂ ≠ 0 or one-sided variants - Use pooled variance if variances are assumed equal; otherwise, use separate variances. Choice of one-sided vs two-sided is based on the improvement objective and risk tolerance; two-sided tests are more conservative. Paired t Test Used when measurements are naturally paired, such as before–after on the same unit or matched pairs. Key idea: - Transform paired data (Xᵢ, Yᵢ) to differences Dᵢ = Xᵢ − Yᵢ. - Perform a one-sample t test on D: - H₀: μ_D = 0 - H₁: μD ≠ 0, μD > 0, or μ_D < 0 Assumptions: - Differences are approximately normally distributed. - Pairs are independent of each other. The paired t test removes between-unit variation and focuses directly on the change within each pair. Tests for Variances: χ² and F Variance tests are used to check process stability and to compare variability between processes. - One-sample χ² test for variance: - H₀: σ² = σ₀² - Test statistic: χ² = ( (n − 1) s² ) / σ₀² Uses the chi-square distribution with n − 1 degrees of freedom. - Two-sample F test for variances: - H₀: σ₁² = σ₂² - Test statistic: F = s₁² / s₂² Uses the F distribution with (n₁ − 1, n₂ − 1) degrees of freedom. Variance tests are sensitive to non-normality; verify approximate normality before relying on them. --- Tests for Proportions and Categorical Data One-Sample Proportion Test Used when the data represent counts of successes and failures (for example, defective vs non-defective). - H₀: p = p₀ - H₁: p ≠ p₀, p > p₀, or p < p₀ Test statistic (for large n and np, n(1 − p) both sufficiently large): - z = (p̂ − p₀) / √( p₀(1 − p₀) / n ) Where: - p̂ = observed sample proportion - n = sample size For small samples, exact binomial tests may be needed instead of the normal approximation. Two-Sample Proportion Test Used to compare two independent proportions, such as defect rates from two lines. - H₀: p₁ − p₂ = 0 - H₁: p₁ − p₂ ≠ 0 or one-sided variants For large samples, use a z test based on the pooled proportion under H₀: - p̂_pooled = (x₁ + x₂) / (n₁ + n₂) - z = (p̂₁ − p̂₂) / √( p̂pooled (1 − p̂pooled) (1/n₁ + 1/n₂) ) Where x₁, x₂ are numbers of successes, and n₁, n₂ are sample sizes. Chi-Square Test for Independence (Contingency Tables) Used when both variables are categorical and the question is whether they are associated. - H₀: Variables are independent. - H₁: Variables are not independent. Procedure: - Construct a contingency table of counts. - Compute expected counts under independence: - Eᵢⱼ = (row total × column total) / grand total - Test statistic: - χ² = Σ ( (Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ ) - Compare to chi-square distribution with (r − 1)(c − 1) degrees of freedom, where r and c are number of rows and columns. Check that expected counts are not too small; if they are, combine categories or use exact tests. --- Nonparametric Tests for Non-Normal Data When to Use Nonparametric Tests Nonparametric tests are useful when: - Data are highly skewed or contain significant outliers. - The measurement scale is ordinal (ranks) rather than interval or ratio. - Sample sizes are small and assumptions of normality cannot be justified. Nonparametric tests typically: - Use ranks instead of raw values. - Test hypotheses about medians or distributions rather than means. - Are less powerful than parametric tests when parametric assumptions hold, but more robust when they do not. Common Nonparametric Tests Essential nonparametric counterparts include: - One-sample Wilcoxon signed-rank test: - Alternative to one-sample t test when data are not normal. - Tests median vs a hypothesized value. - Wilcoxon signed-rank test (paired): - Alternative to paired t test. - Tests median of paired differences vs zero. - Mann–Whitney test (Wilcoxon rank-sum): - Alternative to two-sample t test for independent groups. - Tests whether two distributions differ in location (median) without assuming normality. For each test, the general logic mirrors parametric tests: - Define H₀ and H₁. - Compute a test statistic based on ranks. - Obtain p-value from the relevant sampling distribution (exact or approximate). --- Multiple Comparisons and Familywise Error Concept of Multiple Testing When performing many hypothesis tests, the probability of making at least one Type I error across all tests (familywise error rate) increases. Example: - If each test uses α = 0.05 and tests are independent, many tests will almost guarantee at least one false positive over time. This is relevant when comparing multiple groups, many factors, or many pairwise differences. Adjustments for Multiple Comparisons To control familywise error, significance levels can be adjusted. Common method: - Bonferroni adjustment: - Divide overall α by the number of comparisons k: - αindividual = αoverall / k - Each test uses α_individual instead of α. More sophisticated methods exist, but the key idea is to recognize that multiple tests require more conservative decision rules to maintain overall error control. --- Practical Interpretation of Results Statistical vs Practical Significance A statistically significant result (p ≤ α) does not automatically imply that the effect is practically meaningful. Consider: - Effect size: - Difference between means (absolute or standardized). - Ratio of variances. - Difference between proportions. - Practical implications: - Cost impact. - Customer satisfaction. - Process capability improvements. Large data sets can detect very small differences as statistically significant. Always interpret p-values together with effect size and context. Confidence Intervals and Effect Size Confidence intervals provide: - A range of plausible values for the parameter. - Direct insight into both magnitude and uncertainty. Use them to: - Assess whether the effect is large enough to matter. - Communicate results more meaningfully than p-values alone. For example: - If the 95% confidence interval for a mean difference is (0.1, 0.3) units, and a 0.05-unit change is considered meaningful, the effect is both statistically and practically important. --- Hypothesis Testing in Process Improvement Typical Questions Addressed Hypothesis tests are central when evaluating changes and making data-based decisions, such as: - Is the new method reducing the mean cycle time? - Has the defect rate changed relative to baseline? - Are different machine settings producing different variability? - Is there an association between a categorical factor and defect occurrence? Each question can be translated into: - A clear null hypothesis representing no improvement or change. - An alternative hypothesis representing the desired effect or relationship. Structuring a Test in Practice To use hypothesis testing effectively: - Define the objective: - Specify the parameter of interest (mean, proportion, variance). - State H₀ and H₁ clearly, including directionality. - Select the test: - Based on data type (continuous vs discrete). - Based on number of groups (one, two, or more). - Based on assumptions (normality, equal variances, pairing). - Plan the sample: - Determine sample size to achieve adequate power. - Ensure sampling is representative and unbiased. - Perform the analysis: - Check assumptions with exploratory plots and basic statistics. - Compute test statistic, p-value, and confidence interval. - Draw conclusions: - Decide to reject or fail to reject H₀ at chosen α. - Interpret effect size and confidence interval. - Translate statistical conclusions into process or business implications. --- Common Pitfalls and Good Practices Typical Misinterpretations Avoid these common errors: - Treating p-value as the probability that H₀ is true. - Believing failure to reject H₀ proves there is no difference. - Ignoring assumptions and applying tests mechanically. - Focusing only on p-values and ignoring effect sizes. - Conducting many tests and selectively reporting only significant ones. Robust Practices Strengthen hypothesis testing by: - Clarifying the question before looking at the data. - Pre-selecting α, test type, and decision rules. - Checking data quality and assumptions first. - Reporting: - Test used and assumptions. - p-value and α. - Confidence intervals and effect sizes. - Discussing both statistical and practical implications. --- Summary Hypothesis testing provides a disciplined way to decide, from data, whether observed differences or relationships are likely to be real or just due to random variation. It relies on: - Clear statements of H₀ and H₁. - Selection of appropriate tests for means, proportions, variances, and categorical data. - Understanding of Type I and II errors, power, and the role of α. - Attention to assumptions, data conditions, and sample size. - Correct use of p-values, confidence intervals, and effect sizes. - Awareness of multiple testing issues and practical significance. Mastering these elements allows you to convert raw data into reliable, actionable conclusions about process performance and changes.
Practical Case: Hypothesis Testing A regional lab network was under pressure because doctors complained that patient test reports arrived late. Leadership believed a new barcode intake system had reduced average turnaround time, but staff were unconvinced and resisted full rollout. The Black Belt collected turnaround-time data for a sample of orders processed before and after the barcode system went live in one pilot lab. The question was: “Has the new process actually reduced average turnaround time, or are we seeing normal variation?” Using hypothesis testing, the Black Belt: - Treated “no change in average turnaround time” as the baseline assumption. - Compared the two data sets (before vs. after implementation) with an appropriate statistical test, checking that the observed difference in averages was not likely due to random variation alone. - Controlled the risk of wrongly claiming improvement by using a predefined significance level. The analysis showed a statistically significant reduction in average turnaround time, with no meaningful increase in error rates. Based on this result, leadership approved full deployment of the barcode system, and frontline staff accepted the change because the decision was grounded in objective evidence rather than opinion. End section
Practice question: Hypothesis Testing A Black Belt wants to test whether a new filling process has reduced the mean fill time from the current standard of 12.0 seconds. A random sample of 40 observations from the new process yields a sample mean of 11.6 seconds and a known population standard deviation of 1.2 seconds. Which is the most appropriate hypothesis test? A. One-sample t-test, two-sided B. One-sample z-test, one-sided (lower tail) C. One-sample z-test, two-sided D. Paired t-test, one-sided (lower tail) Answer: B Reason: Population standard deviation is known and n > 30, so a one-sample z-test is appropriate; the question is whether the mean has been reduced (μ < 12), so a one-sided lower-tail z-test is required. Other options either use the wrong distribution (t-test instead of z) or incorrect tail/structure for the stated objective (two-sided, paired). --- A team performs a two-sample t-test (equal variances not assumed) to compare mean cycle times between two machines. The p-value of the test is 0.18 at α = 0.05. Which conclusion is most appropriate? A. Fail to reject H0; there is insufficient evidence that the mean cycle times differ. B. Reject H0; there is sufficient evidence that the mean cycle times differ. C. Fail to reject H0; the two machines have exactly the same mean cycle time. D. Reject H0; Machine 1 has a significantly lower mean cycle time. Answer: A Reason: Since p-value (0.18) > α (0.05), we fail to reject the null hypothesis and cannot conclude a statistically significant difference in mean cycle times. Other options incorrectly interpret p > α as evidence of difference or make a stronger claim of equality or direction not supported by the test. --- A Black Belt tests whether the defect rate of a new process is lower than the historical defect rate of 4%. In a sample of 300 units, 6 units are defective. Which is the most appropriate hypothesis formulation? A. H0: p = 0.02, H1: p ≠ 0.02 B. H0: p ≥ 0.04, H1: p < 0.04 C. H0: p ≤ 0.04, H1: p > 0.04 D. H0: p = 0.04, H1: p < 0.02 Answer: B Reason: Historical benchmark is 4%; the objective is to show improvement (defect rate lower than 4%), so a one-sided lower-tail test with null p ≥ 0.04 and alternative p < 0.04 is appropriate. Other options either use the wrong benchmark, wrong direction (greater), or incorrect alternative value. --- A Black Belt performs a paired t-test on before/after measurements of processing time for 25 orders after a software update. Normality of the differences is questionable. Which is the best next step to ensure valid hypothesis testing? A. Use an F-test for equality of variances instead. B. Increase α to 0.10 to compensate for non-normality. C. Use a nonparametric alternative such as the Wilcoxon signed-rank test. D. Transform the raw before and after data to categorical values and run a chi-square test. Answer: C Reason: When normality of paired differences is doubtful, a nonparametric paired test (Wilcoxon signed-rank) is the standard alternative to the paired t-test. Other options either change α inappropriately, use a different hypothesis structure (F-test, chi-square), or do not address the distributional assumption correctly. --- In a two-sample t-test for means (independent samples), the team incorrectly assumes equal variances when in fact the group variances are substantially different. What is the most likely impact on the hypothesis test? A. Type I error risk may be misstated; p-values may be inaccurate. B. No impact; t-tests are completely robust to variance inequality. C. Only the confidence interval width changes; hypothesis test remains exact. D. Only the test statistic sign changes, not the magnitude. Answer: A Reason: Using the pooled-variance t-test when variances are unequal can distort the test statistic and degrees of freedom, leading to inaccurate p-values and Type I error rates. Other options claim no impact or incorrect, limited impacts that do not reflect the statistical sensitivity to variance assumptions.
