top of page

3.4.1 1 & 2 sample t-tests

1 & 2 sample t-tests Purpose of t-tests A t-test is a statistical hypothesis test used to compare means when population standard deviation is unknown and sample sizes are relatively small. It is central to testing improvements and differences in process performance. - 1-sample t-test: Compare a sample mean to a known or target value. - 2-sample t-test: Compare means from two independent groups or conditions. Both rely on the t-distribution and require careful attention to assumptions and correct interpretation of results. --- Core concepts and notation Parameters, statistics, and notation In t-tests, it is important to distinguish between population values and sample estimates. - Population mean: μ - Sample mean: x̄ - Population standard deviation: σ (unknown in t-tests) - Sample standard deviation: s - Sample size: n - Difference in means: μ₁ − μ₂ or x̄₁ − x̄₂ - Hypothesized mean: μ₀ - Hypothesized difference: Δ₀ (often 0) The t-statistic has the general form: - t = (estimate − hypothesized value) / standard error The standard error (SE) reflects the variability of the estimate, not just raw data spread. --- t-distribution and degrees of freedom Why the t-distribution is used When σ is unknown and estimated by s, the test statistic follows a t-distribution rather than a normal distribution, especially for smaller samples. Key properties: - Symmetric and bell-shaped. - Centered at zero. - Heavier tails than the normal distribution (more probability in extremes). - Approaches normal as degrees of freedom increase. Degrees of freedom (df) Degrees of freedom quantify how much independent information is available to estimate variability. - 1-sample t-test: df = n − 1 - 2-sample t-test (pooled): df = n₁ + n₂ − 2 - 2-sample t-test (unpooled): df approximated using a formula (Welch’s df, usually computed by software) Larger df means the t-distribution becomes closer to the normal distribution. --- 1-sample t-test Purpose and hypotheses The 1-sample t-test checks whether a sample mean differs significantly from a specified reference value. Typical questions: - Is the process mean equal to the target? - Has the mean changed from a previous standard? Hypotheses: - Two-sided test - H₀: μ = μ₀ - H₁: μ ≠ μ₀ - Upper-tailed test - H₀: μ ≤ μ₀ - H₁: μ > μ₀ - Lower-tailed test - H₀: μ ≥ μ₀ - H₁: μ < μ₀ Choice of one-sided vs two-sided must be specified before seeing the data. Assumptions - Data are a random sample from the population of interest. - Observations are independent. - Population distribution is approximately normal, or the sample is large enough for the t-test to be robust. - Measurement scale is continuous (or at least interval). When normality is questionable and sample size is small, interpret results with caution. Formula and calculation Given: - Sample size: n - Sample mean: x̄ - Sample standard deviation: s - Hypothesized mean: μ₀ Standard error: - SE(x̄) = s / √n Test statistic: - t = (x̄ − μ₀) / SE(x̄) = (x̄ − μ₀) / (s / √n) Degrees of freedom: - df = n − 1 Using t and df, obtain: - p-value from the t-distribution. - Critical t-values for a chosen significance level α. Decision rules and interpretation At significance level α (commonly 0.05): - p-value ≤ α: Reject H₀. Evidence suggests μ differs from μ₀ in the direction of the alternative hypothesis. - p-value > α: Do not reject H₀. Data do not provide strong enough evidence of a difference. A statistically significant result does not automatically imply practical significance; the size of the mean difference must be considered. Confidence interval for the mean A (1 − α)100% confidence interval for μ: - x̄ ± t_{α/2, df} × SE(x̄) - t_{α/2, df} is the critical value from the t-distribution for the chosen α and df. Interpretation: - If this interval does not contain μ₀, then H₀: μ = μ₀ would be rejected at significance level α (two-sided test). - The interval provides a range of plausible values for the true mean. --- 2-sample t-test (independent samples) Purpose and hypotheses The 2-sample t-test compares means from two independent groups or conditions. Typical questions: - Does mean performance differ between two processes? - Is there a difference between current and new methods when samples are independent? Hypotheses (for difference μ₁ − μ₂): - Two-sided test - H₀: μ₁ − μ₂ = Δ₀ (often Δ₀ = 0) - H₁: μ₁ − μ₂ ≠ Δ₀ - Upper-tailed test - H₀: μ₁ − μ₂ ≤ Δ₀ - H₁: μ₁ − μ₂ > Δ₀ - Lower-tailed test - H₀: μ₁ − μ₂ ≥ Δ₀ - H₁: μ₁ − μ₂ < Δ₀ Usually Δ₀ = 0 to test for equality of means. Assumptions - Each sample is a random sample from its population. - Observations are independent both within and between groups. - Each population is approximately normal, or sample sizes are large enough. - Measurement scale is continuous (or at least interval). - Equal-variance or unequal-variance assumption chosen appropriately (see below). Violation of independence or severe non-normality can seriously affect the validity of conclusions. Equal-variance vs unequal-variance t-test There are two versions: - Pooled-variance t-test - Assumes population variances are equal: σ₁² = σ₂². - Uses a pooled estimate of variance. - Welch’s t-test (unequal variances) - Does not assume equal variances. - Uses separate variance estimates and an adjusted df. In practice: - When in doubt, the unequal-variance version is safer. - Equal-variance test is appropriate only when evidence supports similar variances (for example, variance ratio reasonably close to 1 and no strong theoretical reason to expect large differences). Pooled-variance 2-sample t-test Given: - Sample sizes: n₁, n₂ - Sample means: x̄₁, x̄₂ - Sample standard deviations: s₁, s₂ - Hypothesized difference: Δ₀ (often 0) Pooled variance: - sₚ² = [ (n₁ − 1)s₁² + (n₂ − 1)s₂² ] / (n₁ + n₂ − 2) Standard error of the difference: - SE(x̄₁ − x̄₂) = √[ sₚ²(1/n₁ + 1/n₂) ] Test statistic: - t = [ (x̄₁ − x̄₂) − Δ₀ ] / SE(x̄₁ − x̄₂) Degrees of freedom: - df = n₁ + n₂ − 2 Unequal-variance 2-sample t-test (Welch’s) Standard error of the difference: - SE(x̄₁ − x̄₂) = √[ s₁²/n₁ + s₂²/n₂ ] Test statistic: - t = [ (x̄₁ − x̄₂) − Δ₀ ] / SE(x̄₁ − x̄₂) Degrees of freedom (Welch–Satterthwaite approximation): - df ≈ [ (s₁²/n₁ + s₂²/n₂)² ] / [ (s₁²/n₁)²/(n₁ − 1) + (s₂²/n₂)²/(n₂ − 1) ] This df is often non-integer; software uses it directly. Decision rules and interpretation At significance level α: - p-value ≤ α: Reject H₀. Evidence suggests a difference in means in the direction of H₁. - p-value > α: Do not reject H₀. Evidence is insufficient to claim a difference. Always interpret: - Direction of difference (which mean is larger). - Magnitude of the difference. - Practical relevance in the process context. Confidence interval for difference in means For (μ₁ − μ₂), a (1 − α)100% confidence interval: - (x̄₁ − x̄₂) ± t_{α/2, df} × SE(x̄₁ − x̄₂) Interpretation: - If the interval for μ₁ − μ₂ includes 0, there is no statistically significant difference at level α (two-sided). - If the interval is entirely above 0, μ₁ is likely larger than μ₂. - If the interval is entirely below 0, μ₁ is likely smaller than μ₂. The interval provides both direction and magnitude of the estimated difference. --- One-sided vs two-sided tests Choosing the test direction - Two-sided: Use when any difference (higher or lower) is of interest or when direction is not pre-specified. - One-sided: Use only when the question is truly directional and a difference in the opposite direction would not trigger the same action. Once a test direction is chosen, it must not be changed after seeing results. Changing from two-sided to one-sided post hoc artificially inflates the chance of false positives. Relationship to confidence intervals - A two-sided test at significance level α corresponds to a (1 − α)100% confidence interval. - For a one-sided test, confidence intervals can be one-sided (upper or lower bounds). --- Effect size and practical significance Effect size for t-tests Statistical significance does not guarantee practical impact. Effect sizes quantify the magnitude of differences. Common measures: - Mean difference: x̄ − μ₀ (1-sample) or x̄₁ − x̄₂ (2-sample). - Standardized effect size (Cohen’s d): - 1-sample (using s): d = (x̄ − μ₀) / s - 2-sample (pooled): d = (x̄₁ − x̄₂) / sₚ Standardized effect sizes help compare across different scales and studies. Practical vs statistical significance Consider: - Process requirements or specification limits. - Cost, risk, and benefit of acting on a detected difference. - Confidence interval bounds relative to what is practically important. An observed difference can be statistically significant and still be too small to matter in practice. --- Assumption checks and robustness Normality considerations The t-test is reasonably robust to moderate departures from normality, especially when: - Sample sizes are moderate or large (for example, n ≥ 30 per group). - Distributions are roughly symmetric without strong outliers. When assumptions are questionable: - Inspect data with plots (for example, histograms, boxplots, normal probability plots). - Be cautious if sample sizes are small and distributions are heavily skewed or have outliers. Equal variance assessment (for 2-sample tests) To decide between pooled and unequal-variance tests: - Compare sample standard deviations s₁ and s₂. - Consider variance ratio: max(s₁², s₂²) / min(s₁², s₂²). - Ratios near 1 support equal-variance assumption; very large ratios argue against it. When the equal-variance assumption is doubtful, the unequal-variance test is preferred. --- Type I and Type II errors and power Error types - Type I error (α): Rejecting a true null hypothesis. - Type II error (β): Failing to reject a false null hypothesis. In t-tests, α is the pre-set significance level. Power = 1 − β, the probability of correctly detecting a true difference. Factors affecting power Power in 1- and 2-sample t-tests increases when: - Sample size increases. - True difference in means is larger. - Data variability (s) is smaller. - Significance level α is higher (for example, 0.10 vs 0.05). - For 2-sample tests, more balanced sample sizes (n₁ ≈ n₂). When planning data collection, power analysis can guide adequate sample sizes to detect meaningful differences, but the key concept is understanding how these factors influence the reliability of test outcomes. --- Common pitfalls and best practices Frequent mistakes - Ignoring assumptions about independence and normality. - Choosing equal-variance t-tests when variances clearly differ. - Confusing statistical significance with practical importance. - Running multiple t-tests without adjusting for increased risk of Type I error. - Switching from two-sided to one-sided tests after seeing data. - Over-interpreting non-significant results as proof of no difference. Good practice guidelines - Define hypotheses, test direction, and α before examining the data. - Examine data visually to check for outliers and distribution shape. - Use unequal-variance 2-sample t-tests when variance equality is uncertain. - Always report: - Test type (1-sample or 2-sample, equal or unequal variance). - Hypotheses and α level. - t-statistic, df, and p-value. - Confidence intervals for means or mean differences. - Practical interpretation in process terms. --- Summary 1- and 2-sample t-tests are methods for testing claims about means when population variance is unknown. They rely on the t-distribution and are structured around clear hypotheses: - The 1-sample t-test compares a sample mean to a specified value, using t = (x̄ − μ₀) / (s/√n) with df = n − 1. - The 2-sample t-test compares two independent means, with versions for equal or unequal variances, and uses the difference in sample means divided by an appropriate standard error. Correct application requires: - Understanding of assumptions (random sampling, independence, approximate normality, variance behavior). - Careful choice between one-sided and two-sided tests. - Use of confidence intervals to interpret magnitude and direction of effects. - Distinguishing between statistical and practical significance. - Awareness of Type I and Type II errors and how sample size and variability affect power. Mastery of these concepts enables accurate, meaningful use of 1- and 2-sample t-tests to evaluate and compare process performance.

Practical Case: 1 & 2 sample t-tests A medical device plant is struggling with long assembly times for a new product. Management set a target: average assembly time must be 15 minutes or less. 1-Sample t-Test (Compare to a Target) The industrial engineer times a random sample of 25 assemblies from the new line. Goal: test if the average assembly time is greater than the 15-minute target. They run a 1-sample t-test comparing the sample mean to 15 minutes. Result: p-value < 0.05, mean > 15. Conclusion: the process is statistically slower than the target; improvement is required. The Lean Six Sigma team launches a SMED-style changeover reduction and standardized work project. 2-Sample t-Test (Compare Two Processes) After improvements, the engineer collects another random sample of 25 assembly times from the improved line. Goal: test if the improved line is faster than the original line. They run a 2-sample t-test comparing: - Sample 1: “before improvement” times - Sample 2: “after improvement” times Result: p-value < 0.05, mean(after) < mean(before). Conclusion: the new process is statistically faster. Management uses these results to: - Approve the new standard work - Update staffing and scheduling assumptions based on the improved, verified mean time End section

Practice question: 1 & 2 sample t-tests A Black Belt is comparing the mean cycle time of a process before and after a Kaizen event using data from the same 15 work orders measured pre- and post-improvement. Which hypothesis test is most appropriate? A. 1-sample t-test B. 2-sample t-test (independent) C. Paired t-test (2-sample matched pairs) D. ANOVA F-test Answer: C Reason: Measurements are taken on the same items before and after; this creates paired data. The appropriate test is a paired t-test comparing the mean of within-pair differences to zero. A is incorrect because there are two related conditions, not one sample. B assumes independent samples, which is violated here. D is for comparing more than two means, not two paired conditions. --- A process owner believes the mean fill weight of a product is different from the target of 500 g. A Black Belt collects a random sample of 30 units, finds a sample mean of 495 g, a sample standard deviation of 12 g, and uses a 1-sample t-test at α = 0.05. The calculated t-statistic is −2.28, and the two-sided p-value is 0.029. What is the correct conclusion? A. Fail to reject H₀; there is no significant difference from 500 g B. Reject H₀; the mean fill weight differs significantly from 500 g C. Fail to reject H₀; the sample size is too small to make any inference D. Reject H₀; the mean fill weight is significantly less than 500 g Answer: B Reason: H₀: μ = 500, H₁: μ ≠ 500. The two-sided p-value (0.029) is less than α (0.05), so H₀ is rejected and we conclude the mean differs from 500 g, but not specifically in which direction beyond what the sample suggests. A and C ignore the p-value. D incorrectly states a one-sided conclusion when a two-sided test was specified. --- A Black Belt compares the mean defect repair time between two independent teams (Team A and Team B). Normality and equal variance assumptions are satisfied. Which of the following is the correct setup of hypotheses for a 2-sample t-test if the objective is to determine whether Team B has a lower mean repair time than Team A? A. H₀: μA = μB; H₁: μA ≠ μB B. H₀: μB − μA ≥ 0; H₁: μB − μA < 0 C. H₀: μB − μA ≤ 0; H₁: μB − μA > 0 D. H₀: μA − μB ≠ 0; H₁: μA − μB = 0 Answer: B Reason: To test if Team B is faster (lower mean) than Team A, use a one-sided alternative: H₁: μB − μA < 0 (mean of B less than A), with null H₀: μB − μA ≥ 0. A is two-sided and does not specifically test “less than.” C tests if B is slower than A. D reverses the logic of null and alternative. --- A Black Belt performs a 2-sample t-test (independent) on cycle times for Method 1 (n₁ = 20) and Method 2 (n₂ = 18). At α = 0.05, the test output shows: difference in means (1 − 2) = −3.5 min, 95% CI for difference = (−6.2, −0.8), p-value = 0.011. What is the correct interpretation? A. There is no statistically significant difference between the methods B. Method 1 has a significantly lower mean cycle time than Method 2 C. Method 2 has a significantly lower mean cycle time than Method 1 D. The test is inconclusive because sample sizes are different Answer: C Reason: The estimated difference (1 − 2) is −3.5 and the 95% CI is entirely below zero, p = 0.011 < 0.05, indicating Method 2 has a significantly lower mean cycle time than Method 1. A conflicts with p-value and CI. B reverses the direction of the difference. D is incorrect: unequal sample sizes are acceptable in a 2-sample t-test. --- A Black Belt wants to compare the mean tensile strength of a new material to an existing standard using a 1-sample t-test. Which of the following conditions is most critical to validate before relying on the t-test results? A. The population standard deviation is known B. The sample data are approximately normally distributed (or n is sufficiently large) C. The population is finite and less than 10,000 units D. The process has a Cpk greater than 1.33 Answer: B Reason: The 1-sample t-test assumes that the sample comes from an approximately normal distribution, especially for small sample sizes; for large samples, the Central Limit Theorem helps. A is incorrect because the t-test is specifically used when σ is unknown. C is irrelevant to the t-test assumption. D concerns capability, not validity of the t-test.

bottom of page