23h 59m 59s
🔥 Flash Sale -50% on Mock exams ! Use code 6sigmatool50 – Offer valid for 24 hours only! 🎯
3.2 Inferential Statistics
Inferential Statistics Introduction Inferential statistics is the set of methods used to draw conclusions about a population based on data from a sample. It answers questions such as: - Is there evidence of a real effect or difference? - How large is that effect? - How certain are we about our estimates? This article develops the core ideas, assumptions, and tools needed to correctly apply inferential statistics in improvement projects and data-based decision making. --- Populations, Samples, and Parameters Concepts and Notation Inferential statistics always distinguishes between: - Population: Entire group of interest. - Sample: Subset of the population actually measured. - Parameter: Fixed (usually unknown) numerical feature of the population. - Statistic: Numerical summary calculated from sample data, used to estimate parameters. Common notation: - Population mean: μ - Sample mean: x̄ - Population standard deviation: σ - Sample standard deviation: s - Population proportion: p - Sample proportion: p̂ Inferential methods use sample statistics (x̄, s, p̂) to make statements about parameters (μ, σ, p). Types of Data Correct choice of inferential method depends on data type: - Continuous data: Numeric, can take any value in an interval (e.g., time, weight). - Discrete counts: Integers, counts of events or items. - Attribute data (binary): Pass/fail, yes/no, defect/ok. - Ordinal: Ordered categories (e.g., rating scales). Most methods discussed here focus on continuous and binary/attribute data, as they are central to inferential analysis in process improvement. --- Sampling and Sampling Distributions Sampling and Randomness Inferential validity requires appropriate sampling: - Random sampling: Every population unit has a known, non-zero chance of selection. - Independence: Each observation is not influenced by others. - Representative sample: The sample reflects key characteristics of the population. Violations (e.g., strong autocorrelation, systematic selection bias) can invalidate inferences. Central Limit Theorem (CLT) The Central Limit Theorem underpins many inferential methods: - For sufficiently large n, the sampling distribution of the sample mean x̄ is approximately normal, regardless of the population distribution, provided: - Observations are independent and identically distributed. - Population variance is finite. Key implications: - x̄ is approximately N(μ, σ/√n) for large n. - This justifies using normal-based confidence intervals and tests even when data are not perfectly normal, especially for larger sample sizes. Standard Error The standard error (SE) is the standard deviation of a sampling distribution. It measures how much a statistic varies from sample to sample. Common forms: - Mean (σ known): SE(x̄) = σ / √n - Mean (σ unknown): SE(x̄) ≈ s / √n - Proportion: SE(p̂) = √[ p̂(1 − p̂) / n ] - Difference in means (independent samples, σ unknown): - SE(x̄₁ − x̄₂) = √( s₁²/n₁ + s₂²/n₂ ) The SE decreases with larger sample sizes, giving more precise estimates. --- Confidence Intervals Concept of Confidence A confidence interval (CI) provides a plausible range of values for an unknown parameter. - Confidence level (e.g., 95%) is the long-run proportion of intervals that would contain the true parameter if we repeated the sampling process many times. - A 95% CI does not guarantee that the probability the parameter lies in this particular interval is 95%; instead, the method has 95% long-run coverage. General Structure of Confidence Intervals For many parameters, the CI takes the form: - Estimate ± (critical value) × SE Where: - Estimate: Sample statistic (x̄, p̂, x̄₁ − x̄₂, etc.) - Critical value: - z* from the standard normal distribution - t* from the t distribution - or values from a χ² or F distribution - SE: Standard error of the estimator Confidence Interval for a Mean (σ Known) Assuming: - Population is normal, or n is large. - Population standard deviation σ is known. Then: - 100(1 − α)% CI for μ: μ ∈ x̄ ± z_{α/2} × (σ / √n) Where z_{α/2} is the critical z-value (e.g., 1.96 for 95% confidence). Confidence Interval for a Mean (σ Unknown, t Distribution) More common in practice, σ is unknown. Assuming: - Data approximately normal or n sufficiently large. - s is the sample standard deviation. Then: - 100(1 − α)% CI for μ: μ ∈ x̄ ± t_{α/2, df} × (s / √n) Where: - t_{α/2, df} is the critical t-value with df = n − 1. Use the t-based CI whenever σ is not known. Confidence Interval for a Proportion For sufficiently large n with np̂ ≥ 5 and n(1 − p̂) ≥ 5: - 100(1 − α)% CI for p: p ∈ p̂ ± z_{α/2} × √[ p̂(1 − p̂) / n ] If these conditions fail, exact or adjusted methods should be used. Confidence Interval for Difference in Means Assuming independent samples from two populations: - Sample means: x̄₁, x̄₂ - Sample sizes: n₁, n₂ - Sample standard deviations: s₁, s₂ The point estimate for μ₁ − μ₂ is x̄₁ − x̄₂. Unequal variances (Welch’s t), general case: - SE = √( s₁²/n₁ + s₂²/n₂ ) - CI: (μ₁ − μ₂) ∈ (x̄₁ − x̄₂) ± t_{α/2, df} × SE Where df is approximated using the Welch–Satterthwaite formula. Confidence Interval for Difference in Proportions For independent samples: - p̂₁ = x₁/n₁; p̂₂ = x₂/n₂ - Point estimate: p̂₁ − p̂₂ - SE = √[ p̂₁(1 − p̂₁)/n₁ + p̂₂(1 − p̂₂)/n₂ ] Then: - 100(1 − α)% CI: (p₁ − p₂) ∈ (p̂₁ − p̂₂) ± z_{α/2} × SE --- Hypothesis Testing Fundamentals Basic Concepts Hypothesis testing evaluates competing claims about a population parameter: - Null hypothesis (H₀): Default assumption, usually “no effect,” “no difference,” or a specific target value. - Alternative hypothesis (H₁ or Hₐ): Statement representing a meaningful effect, difference, or deviation from the null. Examples: - H₀: μ = μ₀ vs H₁: μ ≠ μ₀ (two-sided) - H₀: μ ≤ μ₀ vs H₁: μ > μ₀ (one-sided) - H₀: p₁ = p₂ vs H₁: p₁ ≠ p₂ Type I and Type II Errors - Type I error (α): Rejecting H₀ when H₀ is true. - Type II error (β): Failing to reject H₀ when H₀ is false. Related concepts: - Significance level α: Chosen probability of Type I error (commonly 0.05 or 0.01). - Power (1 − β): Probability of correctly rejecting a false H₀. Balancing α and power is important when designing tests and choosing sample sizes. P-Values and Decision Rules The p-value is: - The probability, under H₀, of obtaining a test statistic as extreme or more extreme than the one observed, in the direction of H₁. Decision rule: - If p-value ≤ α: Reject H₀ (statistically significant evidence). - If p-value > α: Do not reject H₀ (insufficient evidence). Cautions: - A statistically significant result does not guarantee practical significance. - Non-significant results do not prove H₀ is true; they indicate insufficient evidence. --- z and t Tests for Means One-Sample z Test (σ Known) Used when: - Testing the population mean μ. - Population standard deviation σ is known. - Data are normal or n is large. Hypotheses example: - H₀: μ = μ₀ - H₁: μ ≠ μ₀ Test statistic: - z = (x̄ − μ₀) / (σ / √n) Compare z to critical values or compute p-value. One-Sample t Test (σ Unknown) More common in practice: - σ unknown, use sample standard deviation s. - Data approximately normal (especially important for small n). Test statistic: - t = (x̄ − μ₀) / (s / √n) - df = n − 1 Use the t distribution to find p-values or critical values. Two-Sample t Test for Means (Independent Samples) Used to compare the means of two independent groups. Hypotheses example: - H₀: μ₁ = μ₂ - H₁: μ₁ ≠ μ₂ Assumptions: - Independent random samples. - Approximately normal populations (or large samples). - For the general case, variances may be unequal. Test statistic (Welch’s t): - t = (x̄₁ − x̄₂) / √( s₁²/n₁ + s₂²/n₂ ) Degrees of freedom: - Approximate using Welch–Satterthwaite formula (software usually handles this). Variants: - Pooled t test assumes equal variances; use only if that assumption is justified. Paired t Test Used when data are naturally paired: - Before/after measurements on the same unit. - Matched pairs (e.g., twins, matched machines). Transform paired data into differences dᵢ: - dᵢ = afterᵢ − beforeᵢ Hypothesis example: - H₀: μ_d = 0 (no mean difference) - H₁: μ_d ≠ 0 Compute: - d̄ = mean of differences - s_d = standard deviation of differences - n = number of pairs Test statistic: - t = d̄ / (s_d / √n) - df = n − 1 Paired tests remove between-subject variability, often increasing power. --- Tests for Proportions One-Proportion z Test Used to test whether the population proportion p equals a specified value p₀. Hypotheses example: - H₀: p = p₀ - H₁: p ≠ p₀ Conditions: - np₀ ≥ 5 and n(1 − p₀) ≥ 5 (approximate normality). Test statistic: - z = (p̂ − p₀) / √[ p₀(1 − p₀) / n ] Two-Proportion z Test Used to compare proportions from two independent groups. Hypotheses example: - H₀: p₁ = p₂ - H₁: p₁ ≠ p₂ Compute: - p̂₁ = x₁/n₁ - p̂₂ = x₂/n₂ - Pooled proportion under H₀: p̂ = (x₁ + x₂) / (n₁ + n₂) Standard error under H₀: - SE = √[ p̂(1 − p̂)(1/n₁ + 1/n₂) ] Test statistic: - z = (p̂₁ − p̂₂) / SE Conditions: - n₁p̂, n₁(1 − p̂), n₂p̂, n₂(1 − p̂) all suitably large. --- Nonparametric Tests Nonparametric tests do not assume normal distributions and are useful when: - Data are skewed or contain outliers. - Measurement scale is ordinal. - Sample sizes are small and normality is doubtful. Sign Test Used for: - Paired data or one-sample median tests. - Testing whether the median equals a hypothesized value. Procedure: - For paired data, compute differences and record only signs (+ or −). - Ignore zeros. - Under H₀ (no difference), plus and minus are equally likely. Approximate or exact binomial methods are used to derive p-values. Wilcoxon Signed-Rank Test Used as a nonparametric alternative to the paired t test. Assumptions: - Symmetric distribution of differences around the median. - At least ordinal scale. Procedure: - Compute differences, discard zeros. - Rank absolute differences, assign signs. - Sum positive and negative ranks. - The test statistic is based on these signed ranks. Mann–Whitney (Wilcoxon Rank-Sum) Test Used as a nonparametric alternative to the two-sample t test. Assumptions: - Independent samples. - Similar shapes of distributions (shifts in central tendency). Procedure: - Combine data from both groups, rank all values. - Sum ranks for one group. - The test statistic evaluates whether one group tends to have larger (or smaller) values than the other. --- Chi-Square Tests Chi-Square Goodness-of-Fit Test Used to test whether observed categorical frequencies match expected frequencies under a specific distribution or model. Hypotheses example: - H₀: The data follow a specified distribution. - H₁: The data do not follow that distribution. Test statistic: - χ² = Σ[(Oᵢ − Eᵢ)² / Eᵢ] Where: - Oᵢ = observed count in category i - Eᵢ = expected count in category i Assumption: - Expected counts Eᵢ are generally ≥ 5. Degrees of freedom: - df = k − 1 − m Where k = number of categories, m = number of parameters estimated from data. Chi-Square Test of Independence Used for contingency tables (e.g., r × c) to test whether two categorical variables are associated. Hypotheses: - H₀: Variables are independent. - H₁: Variables are not independent. Expected counts: - Eᵢⱼ = (row totalᵢ × column totalⱼ) / grand total Test statistic: - χ² = ΣΣ[(Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ] Degrees of freedom: - df = (r − 1)(c − 1) Conditions: - Expected counts generally ≥ 5 in most cells. --- Analysis of Variance (ANOVA) One-Way ANOVA Used to compare means of three or more independent groups. Hypotheses: - H₀: μ₁ = μ₂ = … = μₖ (all group means equal) - H₁: At least one mean differs Assumptions: - Independent random samples. - Approximately normal distributions within groups. - Homogeneous variances across groups. Core idea: - Compare variability between group means to variability within groups using an F statistic. Test statistic: - F = MS_between / MS_within Where: - MS_between = variation due to group differences - MS_within = variation within groups (error term) Decision: - Large F suggests at least one group mean differs; evaluate via p-value. ANOVA indicates that not all means are equal but does not identify which means differ. Post-hoc comparisons or planned contrasts are used for detailed follow-up. --- Correlation and Simple Linear Regression Correlation Correlation measures linear association between two continuous variables. - Pearson correlation coefficient r: - Range: −1 to 1 - r > 0: positive linear association - r < 0: negative linear association - |r| close to 1: strong linear relationship - |r| close to 0: weak or no linear relationship Hypothesis test for correlation: - H₀: ρ = 0 (no linear correlation in the population) - H₁: ρ ≠ 0 Test statistic: - t = r√[(n − 2)/(1 − r²)] - df = n − 2 Assumptions: - Approximately bivariate normal data. - Linear relationship if present. Correlation does not imply causation. Simple Linear Regression Simple linear regression models the relationship between: - One predictor (X) and one response (Y). Model: - Y = β₀ + β₁X + ε Where: - β₀: intercept - β₁: slope - ε: random error term Key goals: - Estimate β₀ and β₁ from sample data. - Test whether there is a statistically significant linear relationship (slope ≠ 0). - Predict Y for given values of X (with associated uncertainty). Inference on slope: - H₀: β₁ = 0 - H₁: β₁ ≠ 0 Test statistic: - t = (b₁ − 0) / SE(b₁) - df = n − 2 Assumptions: - Linearity. - Independence of errors. - Constant variance of errors (homoscedasticity). - Approximately normal errors. --- Statistical Power and Sample Size Power Concepts Power analysis helps ensure tests are capable of detecting meaningful effects. Key components: - Effect size: Magnitude of difference or change that is practically important. - Significance level (α): Type I error rate. - Power (1 − β): Desired probability of detecting the effect if it exists. - Sample size (n): Number of observations. Relationships: - Larger n increases power for a fixed α and effect size. - Smaller α (more conservative) lowers power, all else equal. - Larger effect size is easier to detect, increasing power. Sample Size Determination (Conceptual) Typical objectives: - For means: choose n to detect a specific difference Δ with given α and power. - For proportions: choose n to detect a difference in proportions. General idea: - Required n grows as: - Desired effect size Δ shrinks. - Variation (σ or p(1 − p)) increases. - Desired power increases. - α decreases. Exact formulas depend on the test type (means vs proportions, one-sample vs two-sample), but they all rely on normal approximations to test statistics. --- Assumptions and Common Pitfalls Key Assumptions to Check Before relying on inferential results, verify: - Independence of observations. - Appropriate model choice for data type (continuous vs attribute). - Normality for parametric tests when sample sizes are small. - Equal variances for tests that assume homogeneity (pooled t, ANOVA). - Sufficient counts for approximate methods in proportions and chi-square tests. Graphical tools (histograms, boxplots, residual plots, normal probability plots) and numeric diagnostics assist with these checks. Pitfalls Common issues that distort inference: - Overreliance on p-values without considering effect size and confidence intervals. - Multiple testing without adjustment, leading to inflated overall Type I error. - Ignoring non-independence (e.g., time series, batch effects). - Misinterpretation of non-significant results as proof of no effect. - Using parametric tests on heavily skewed data with very small samples, without considering nonparametric alternatives. Careful design, exploratory analysis, and validation of assumptions reduce these risks. --- Summary Inferential statistics provides the framework for learning about populations from samples through parameter estimation, confidence intervals, and hypothesis testing. Core tools include: - Confidence intervals and hypothesis tests for means and proportions. - z and t tests for one and two samples, with paired tests when data are matched. - Nonparametric methods (sign, Wilcoxon, Mann–Whitney) when parametric assumptions are doubtful. - Chi-square tests for categorical data and ANOVA for multiple mean comparisons. - Correlation and simple regression for linear relationships. - Power and sample size planning to ensure meaningful differences can be detected. Mastering these concepts requires understanding assumptions, interpreting p-values and confidence intervals correctly, and focusing on both statistical and practical significance.
Practical Case: Inferential Statistics A regional hospital wants to reduce emergency department (ED) wait times. The Lean Six Sigma team tests a new triage process on a pilot group of patients over two weeks, while the rest of the ED continues with the existing process. The problem: Leadership needs to know if the observed reduction in average wait time in the pilot area is due to the new triage process and not just random variation in daily patient flow. The team collects a sample of wait-time data: - Pilot area (new process) over two weeks. - Comparable area (old process) over the same period. Using inferential statistics, the team: - Compares the mean wait times between the two groups and calculates the probability that the observed difference could occur by chance if there were actually no real improvement. - Uses the result to infer whether the new process truly reduces wait time for the broader ED population, not just the sampled patients. Result: The analysis shows a very low probability that the improvement is due to chance, so the team infers the new process is genuinely better and recommends full ED rollout, supported by quantified evidence rather than anecdotal impressions. End section
Practice question: Inferential Statistics A Black Belt is designing an experiment to estimate the mean cycle time of a machining process. The population standard deviation is unknown, and the sample size will be n = 16. The data are reasonably normal. Which distribution should be used to construct a 95% confidence interval for the mean? A. Normal (Z) distribution B. t distribution with 15 degrees of freedom C. Chi-square distribution with 15 degrees of freedom D. F distribution with (1,15) degrees of freedom Answer: B Reason: With unknown σ and small sample size from a normal population, the correct distribution for the mean is the t distribution with n−1 degrees of freedom (15). Z is for known σ or large n; chi-square is for variances; F is for variance ratios, not means. --- A process engineer compares the average defect count per batch before and after a process change. The same 20 batches were measured before and after, and the differences appear approximately normal. Which hypothesis test is most appropriate? A. 2-sample t test assuming equal variances B. Paired t test C. 1-sample Z test D. Chi-square goodness-of-fit test Answer: B Reason: Measurements are taken on the same batches before and after, making the data paired; the paired t test uses the distribution of the differences to test the mean change. A 2-sample t test ignores the pairing, Z requires known σ, and chi-square goodness-of-fit is for categorical distributions, not paired mean comparison. --- A Black Belt tests whether a new setup reduces mean setup time. H0: μ = 40 minutes, Ha: μ < 40 minutes. A sample of 36 setups yields a sample mean of 37 minutes and a known σ = 9 minutes. Using α = 0.05, which is the correct conclusion? A. Fail to reject H0; no statistically significant reduction B. Reject H0; statistically significant reduction C. Fail to reject H0; mean significantly greater than 40 minutes D. Reject H0; mean significantly different from 40 minutes Answer: A Reason: Z = (37−40)/(9/√36) = −3/(1.5) = −2.0. For a one-sided lower-tail test at α = 0.05, the critical value is Z = −1.645. Because −2.0 < −1.645, the p-value ≈ 0.022 < 0.05, which implies a significant reduction; however, the sample mean is less than 40 (desired reduction), making the correct statistical decision “reject H0; significant reduction.” [Note: The above reasoning contradicts option A. Correcting:] Corrected Answer: B Corrected Reason: Z = (37−40)/(9/√36) = −3/1.5 = −2.0. For a one-sided lower-tail test at α = 0.05, the critical Z is −1.645. Since −2.0 < −1.645, we reject H0 and conclude there is a statistically significant reduction in mean setup time. C and D describe outcomes not aligned with the one-sided lower-tail alternative; A is the opposite decision of what the test result supports. --- A BB wants to compare the mean tensile strength of three formulations of a material using independent samples of equal size, with normality reasonably satisfied but unknown and potentially unequal variances. Which is the most appropriate primary inferential tool? A. One-way ANOVA assuming equal variances B. Kruskal–Wallis test C. 3-sample t tests with Bonferroni adjustment D. 3 separate 2-sample Z tests Answer: B Reason: With potential variance inequality and possibly non-robust normality, the nonparametric Kruskal–Wallis test is appropriate for comparing central tendencies of more than two groups without assuming equal variances. Classical one-way ANOVA (A) assumes homogeneity of variance; multiple t tests (C, D) inflate Type I error and still rely on stronger parametric assumptions. --- A Black Belt assessed the effect of an improvement on defect rate. Before the change: 300 units, 45 defective. After the change: 400 units, 32 defective. Which test is most appropriate to determine if the proportion of defectives has changed? A. 1-sample proportion test on the after data only B. 2-proportion z test C. 2-sample t test on proportions D. Chi-square goodness-of-fit test with 1 category Answer: B Reason: Two independent binomial samples (before vs after) are being compared for difference in proportions; the 2-proportion z test is the standard inferential method. A ignores the before data, C is for continuous data means, and D is used to compare observed vs expected counts across multiple categories, not two independent proportions.
