a. Including Tests of Equal Variance, Normality Testing and Sample Size calculation, performing tests and interpreting results.

Including Tests of Equal Variance, Normality Testing and Sample Size calculation, performing tests and interpreting results.. Introduction This article explains how to: - Check normality of data - Check equality of variances across groups - Calculate sample size for hypothesis tests and confidence intervals - Perform tests and interpret results correctly All content is focused on the level of rigor needed to correctly select, run, and interpret these statistical tools in improvement projects. --- Why Normality, Equal Variance, and Sample Size Matter Before applying statistical tests, three questions must be answered: - Is the data approximately normal? This affects whether to use parametric tests (t-test, ANOVA, regression) or nonparametric alternatives. - Are variances approximately equal across groups? Many parametric tests assume equal variance; violating this can distort p-values and conclusions. - Is the sample size adequate? Too small: high risk of not detecting real effects. Too large: detects trivial differences and wastes resources. The goal is to choose appropriate tests, ensure assumptions are reasonable, and interpret p-values and confidence intervals in context. --- Normality Testing Understanding Normality in Practice Normality refers to whether data follow a bell-shaped, symmetric distribution. Many parametric tests assume: - The population of residuals (or errors) is normal - Or, for simple tests, each group’s data is approximately normal In real applications: - Moderate deviations from normality are often acceptable, especially with larger samples (Central Limit Theorem). - Severe skewness, heavy tails, or outliers may invalidate parametric test results. Normality matters most when: - Sample sizes are small (commonly n < 30 per group) - Data are highly skewed or zero-inflated - Using interval estimates or hypothesis tests that rely on t-distributions or F-distributions Graphical Assessment of Normality Graphical checks are essential before formal tests. Key plots: - Histogram - Look for approximate bell shape, symmetry, and single peak. - Check for strong skew, multiple peaks, or extreme outliers. - Boxplot - Visualize median, spread, skewness, and outliers. - Long whiskers on one side and many outliers indicate skewness. - Normal probability plot (Q-Q plot) - If data are normal, points align roughly along a straight line. - Systematic curvature indicates non-normality: - S-shaped: heavy tails - Convex/concave: skewness Interpretation principle: - Use plots first to understand the nature and severity of departures from normality. - Combine plots with formal tests, not replace them. Formal Normality Tests Common tests include: - Anderson–Darling test - Sensitive to deviations in both center and tails. - Often used in statistical software for normality testing. - Shapiro–Wilk test - Powerful for small to moderate sample sizes. - Frequently used in research applications. - Kolmogorov–Smirnov test (with Lilliefors correction) - Compares empirical distribution to theoretical normal. - Less powerful than Anderson–Darling and Shapiro–Wilk in many practical cases. Each test uses: - Null hypothesis (H₀): Data follow a normal distribution. - Alternative hypothesis (H₁): Data do not follow a normal distribution. Decision rule: - If p-value < α (often 0.05): Reject H₀, conclude data are not normal. - If p-value ≥ α: Do not reject H₀; no evidence against normality. Important cautions: - With large samples, even tiny, unimportant deviations can yield significant p-values. - With very small samples, tests may lack power; reliance on plots and process knowledge becomes more important. Practical Interpretation and Response When normality tests indicate non-normality: - Review plots to understand the pattern: - Skewed right: long right tail (e.g., time to complete, defect counts). - Skewed left: long left tail. - Bimodal: possibly mixed populations or stratification issues. Common responses: - Use a transformation: - Log, square root, or Box–Cox transformations can stabilize variance and improve normality of residuals. - Interpret final results in original units when reporting. - Use nonparametric tests: - For example, Mann–Whitney instead of 2-sample t-test, Kruskal–Wallis instead of one-way ANOVA. - Especially useful when sample sizes are small and deviation from normality is severe. - Increase sample size: - Reduces sensitivity to modest non-normality due to the Central Limit Theorem, especially for mean-based tests. - Model at the right level: - For counts, consider Poisson/binomial models rather than forcing normality. The key is to match the test to the data and clearly state any limitations in conclusions. --- Tests of Equal Variance Why Equality of Variance Matters Many parametric tests assume homogeneity of variance (equal variances across groups), including: - 2-sample t-test (pooled-variance version) - One-way ANOVA - Classical linear regression with constant error variance If variances are not equal: - Type I error (false positive) and power can be distorted. - Tests may favor one group, especially when sample sizes are unequal. Before applying these tests, check: - Are group spreads similar? - Are outliers or extreme values inflating variance? Graphical Assessment of Variance Use simple visuals first: - Side-by-side boxplots - Compare IQR (box height), whisker length, and spread. - Large differences in box height suggest unequal variances. - Residuals vs fitted values plots (for models) - Look for patterns in spread as the mean changes. - Funnel shapes indicate non-constant variance. Interpretation: - Approximate similarity is often sufficient. - Major differences or clear patterns suggest using robust methods or adjustments. Formal Tests for Equal Variance Common tests include: - F-test (two groups) - H₀: σ₁² = σ₂², H₁: σ₁² ≠ σ₂². - Highly sensitive to non-normality; best only when normality is reasonable. - Levene’s test - Based on absolute deviations from group medians or means. - Less sensitive to non-normality than the F-test. - Bartlett’s test - Sensitive to non-normality; best with normal distributions. - Used for more than two groups. - Brown–Forsythe test - A robust variant of Levene’s test using medians. - Handles skewed data better. General decision rule: - Null hypothesis (H₀): All group variances are equal. - Alternative (H₁): At least one group variance differs. - If p-value < α: Evidence of unequal variances. - If p-value ≥ α: No evidence against equal variances. Use both tests and graphics to judge whether the equal variance assumption is adequate. What to Do When Variances Are Unequal When equal variance is questionable: - Use a test that does not assume equal variance: - For 2 groups: Welch’s t-test instead of pooled t-test. - For multiple groups: Welch’s ANOVA. - Transform the data: - Some transformations (e.g., log) can both stabilize variance and improve normality. - Use nonparametric methods: - Mann–Whitney, Kruskal–Wallis, or rank-based regression when appropriate. - Use robust standard errors (for regression): - Adjusts inference without changing the model form. When reporting: - State which equal-variance test was used. - Indicate whether pooled or nonpooled (Welch) methods were applied. - Clarify that conclusions are based on methods robust to unequal variance when needed. --- Sample Size Calculation: Foundations Key Concepts: Effect Size, Alpha, Power, and Variability Sample size determination depends on four main components: - Effect size (Δ) The smallest difference that is practically important to detect. - Alpha (α) Probability of a Type I error (rejecting a true null). Commonly α = 0.05. - Power (1 – β) Probability of detecting a true effect of size Δ. Common targets: - 0.80 (80%) - 0.90 (90%) when missing a difference is costly. - Variability (σ or s) Standard deviation from historical data, a pilot study, or prior experience. Trade-offs: - Smaller α or larger power → larger required sample. - Smaller effect size to detect → larger required sample. - Higher variability → larger required sample. General Steps in Sample Size Planning A systematic approach: - Clarify the objective - Estimate a mean with a given margin of error. - Compare two means. - Estimate or compare proportions. - Detect an improvement of at least a specified amount. - Specify statistical parameters - Choose α (often 0.05). - Choose desired power (e.g., 0.80 or 0.90). - Define the minimum practically important effect size Δ. - Estimate standard deviation or baseline proportion. - Choose the appropriate formula or software tool - Match the design: - 1-sample, 2-sample, paired, ANOVA, regression, etc. - Adjust for practical constraints - Integer sample sizes, balance across groups, resource limits. - Plan for attrition or unusable data - Increase required sample size to account for dropouts, measurement issues, or defects. --- Sample Size for Means One-Sample Mean (Confidence Interval or Test) Objective examples: - Estimate a mean within a margin of error E at confidence level (1 – α). - Detect a difference from a target value μ₀ of at least Δ. Key elements: - Known or estimated standard deviation s. - Z or t quantile depending on sample size and assumptions. Interpretation: - Larger confidence levels or smaller margins of error require larger n. - For hypothesis tests, smaller detectable Δ requires larger n. Two-Sample Means: Independent Samples Common scenarios: - Comparing process A vs B. - Comparing before vs after when data are not naturally paired. Inputs: - Effect size (Δ): minimum meaningful difference between μ₁ and μ₂. - Standard deviation in each group, often assumed equal initially. - Allocation ratio (often 1:1, but may differ). Considerations: - If variances are likely unequal, use methods based on Welch’s t-test for planning. - Balanced designs (equal n per group) maximize power for a given total sample. Power and sample size are highly sensitive to: - The accuracy of the standard deviation estimate. - The realism of the effect size Δ. --- Sample Size for Proportions One-Sample Proportion Example objectives: - Estimate defect rate within ±E with confidence (1 – α). - Detect a change from baseline p₀ to p₁ = p₀ + Δ. Inputs: - Baseline proportion (p₀) or best estimate. - Desired margin of error or minimum detectable change. Observations: - Proportions near 0.5 require the largest sample for a given margin of error. - Rare events (very small p) may need large samples or alternative designs. Two-Sample Proportions Typical use: - Compare defect rates of two processes or time periods. - Compare conversion rates across two treatments. Inputs: - Baseline proportion p₁. - Target proportion p₂ = p₁ + Δ or an absolute difference of interest. - Alpha and power. Planning decisions: - Whether to use pooled or unpooled methods in calculations. - Whether group sizes will be equal or different due to practical constraints. --- Linking Normality, Variance, and Sample Size to Test Selection Choosing the Appropriate Test After planning sample size and collecting data: - Check normality - If acceptable and n is moderate to large: - Mean-based parametric tests (t-test, ANOVA, regression) are typically valid. - If non-normal with small n or severe skew: - Consider transformations or nonparametric tests. - Check equality of variances - If equal variances are plausible: - Use pooled-variance methods (2-sample t-test with pooled variance, classical ANOVA). - If unequal: - Use Welch’s methods or robust approaches. - Match test to measurement level - Continuous data: tests on means (t-tests, ANOVA, regression). - Categorical data: tests on proportions or counts (z-tests, chi-square). Interpreting p-Values and Confidence Intervals For any test: - p-value - Probability, under H₀, of observing a result at least as extreme as the one obtained. - Compare to α: - p < α: Evidence against H₀. - p ≥ α: Insufficient evidence to reject H₀. - Confidence interval - Range of plausible values for a parameter (mean difference, proportion difference, etc.). - If the interval: - Includes 0 (for mean difference) or includes the null value: consistent with no effect. - Excludes 0 or null value: consistent with a significant effect. Relationship: - A two-sided test at α corresponds to a (1 – α) confidence interval: - If the interval does not contain the null value, p < α. Decision-making: - Use both p-values and confidence intervals: - p-value: whether the evidence is statistically significant. - Confidence interval: how large the effect might be, in practical units. --- Performing and Interpreting Normality and Equal Variance Tests Stepwise Approach in Analysis A practical workflow: - Step 1: Plot the data - Histograms, boxplots, normal probability plots. - Residual vs fitted plots for model-based analyses. - Step 2: Conduct normality tests - Use Anderson–Darling or Shapiro–Wilk. - Interpret p-values with sample size in mind. - Reconcile with visual evidence and process knowledge. - Step 3: Conduct equal-variance tests - Levene’s or Brown–Forsythe for multiple groups. - F-test for two groups only when normality is reasonable. - Step 4: Choose the inference method - If normality and equal variance are acceptable: - Use standard parametric methods. - If issues arise: - Consider transformations, nonparametric tests, or robust methods. - Step 5: Interpret results comprehensively - Combine: - Assumption diagnostics. - p-values. - Confidence intervals. - Practical significance. Common Pitfalls and How to Avoid Them Typical issues: - Over-reliance on p-values from diagnostic tests - Large samples produce significant p-values for trivial deviations. - Avoid automatic rejection of normality based solely on a small p-value. - Ignoring variance inequality in unbalanced designs - When group sizes differ, unequal variances can seriously bias results. - Always check both assumptions when sample sizes are very different. - Using parametric tests on extreme outliers without investigation - Outliers may indicate data issues or a different process regime. - Investigate causes, consider robust or transformed methods. - Planning sample size without realistic variance or effect size estimates - Use pilot data, historical data, or subject-matter input. - Conduct sensitivity analyses (e.g., how required n changes with s or Δ). - Interpreting non-significant results as evidence of no effect - Non-significant may mean: - No effect, or - Insufficient power or poor sample size. - Examine the width of the confidence interval and the actual power. --- Summary Normality tests, tests for equal variance, and sample size calculations work together to ensure valid, interpretable statistical conclusions. - Normality testing (via plots and tests like Anderson–Darling or Shapiro–Wilk) informs whether mean-based parametric methods are appropriate or whether transformations or nonparametric tests are needed. - Tests of equal variance (such as Levene’s or Brown–Forsythe) guide the choice between pooled and nonpooled methods for comparing groups and help detect when robust approaches are necessary. - Sample size calculation balances effect size, alpha, power, and variability to ensure data are sufficient to detect practically important differences without wasting effort. - Performing tests and interpreting results requires: - Checking assumptions with both diagnostics and graphics, - Choosing appropriate tests based on data characteristics, - Interpreting p-values and confidence intervals in the context of practical significance and study design. When these elements are applied systematically, statistical conclusions become reliable, transparent, and aligned with the real objectives of process analysis and improvement.

Practical Case: Including Tests of Equal Variance, Normality Testing and Sample Size calculation, performing tests and interpreting results. A medical device manufacturer wants to compare average assembly time for a catheter kit across three shifts (Day, Evening, Night). Management suspects Night shift is slower and wants data before changing staffing. The Black Belt leads a short study. Context and Problem Operators log assembly time (minutes) for each kit. Initial 2-week data exist, but sample sizes per shift are small and uneven. Before running an ANOVA on mean assembly time, the team must: - Check normality of assembly times for each shift. - Check equality of variances among shifts. - Decide how many additional observations are needed per shift. Applying the Tests The Black Belt: 1. Extracts current data for each shift and runs normality tests (e.g., Shapiro–Wilk) plus normal probability plots. 2. Runs a test for equal variances across shifts (e.g., Levene’s test). 3. Uses historical standard deviation from current data and a practically important time difference (e.g., 0.5 min between shifts) to calculate required sample size per shift to achieve desired power. The tests show: - Day and Evening data are approximately normal; Night shift is slightly right-skewed but still acceptable for parametric analysis. - Equal variance test p-value is > 0.05, so the assumption of equal variances is not rejected. - Current sample size per shift is insufficient to detect the chosen time difference with 80% power; about twice as many observations per shift are needed. The team collects the additional data over the next week, then reruns normality and equal variance tests, confirming assumptions still hold. An ANOVA is performed. Interpreting Results The ANOVA shows a statistically significant difference in mean assembly time between shifts, with Night shift slower than Day by more than the predefined practical threshold. Because: - Normality tests did not indicate severe departures. - Equal variance test supported using standard ANOVA. - Sample size was calculated to ensure adequate power. Management trusts the result, reallocates experienced operators to Night shift, and updates training. A follow-up check shows Night shift mean time aligns with Day and Evening, with reduced overtime costs. End section

Practice question: Including Tests of Equal Variance, Normality Testing and Sample Size calculation, performing tests and interpreting results. A Black Belt is comparing cycle time across three machines using one-way ANOVA. The residual plots show no strong patterns, but the p-value from Levene’s test for equal variances is 0.01. What is the most appropriate action? A. Proceed with standard one-way ANOVA and ignore the equal variance result B. Use a nonparametric alternative such as Kruskal–Wallis test C. Transform the response (e.g., log) and re-check equal variance and normality D. Drop the machine with the highest variance and re-run ANOVA Answer: C Reason: A p-value of 0.01 in Levene’s test indicates violation of homogeneity of variances; a common Black Belt approach is to apply a variance-stabilizing transformation (e.g., log, square root), then re-test assumptions before proceeding with parametric ANOVA. Other options either ignore a violated assumption (A), unnecessarily downgrade to nonparametric without first attempting a transformation (B), or inappropriately discard data (D). --- A process capability study requires that the underlying data be approximately normal. A Black Belt runs an Anderson–Darling normality test and obtains AD = 0.45, p-value = 0.20 (α = 0.05). The histogram is mildly skewed but without extreme outliers. Which is the most appropriate interpretation? A. The data are non-normal and cannot be used for capability analysis B. The data do not provide evidence to reject normality and normal-based capability is acceptable C. The data must be Box–Cox transformed because the AD statistic is > 0.30 D. The data should be assumed non-normal because the histogram is skewed Answer: B Reason: With p = 0.20 > 0.05, there is insufficient evidence to reject normality, so a normal-based capability analysis is typically acceptable at the Black Belt level, especially if no strong outliers or severe skew are present. Other options misinterpret the p-value (A, D) or impose a rigid rule based only on the AD statistic without regard to the p-value and practical context (C). --- A Black Belt is planning a two-sample t-test to detect a mean difference in defect density of 0.5 defects/unit between two suppliers. Historical data suggest a common standard deviation of 1.2 defects/unit. The team wants 90% power at α = 0.05 (two-sided). What is the approximate required sample size per group (n) using the standard normal approximation? A. 16 B. 31 C. 45 D. 62 Answer: B Reason: For a two-sample t-test (normal approximation), n per group ≈ 2 × (Zα/2 + Zβ)² × σ² / Δ². Zα/2 (α = 0.05) ≈ 1.96, Zβ (power 0.90) ≈ 1.28, σ = 1.2, Δ = 0.5. (Zα/2 + Zβ)² ≈ (1.96 + 1.28)² ≈ (3.24)² ≈ 10.50. n ≈ 2 × 10.50 × (1.2)² / (0.5)² = 2 × 10.50 × 1.44 / 0.25 ≈ 2 × 10.50 × 5.76 ≈ 2 × 60.48 ≈ 120.96 total → about 60–61 total, or ≈ 30–31 per group. Other options under- or over-estimate relative to this calculation; 31 per group (B) is closest to the computed requirement. --- A Black Belt conducting a one-way ANOVA across four filling lines tests the equal variance assumption using Bartlett’s test (α = 0.05) and obtains p = 0.30. A normality test on residuals gives p = 0.001. Which is the best next step? A. Proceed with ANOVA since equal variances are satisfied B. Investigate non-normality (e.g., outliers, transformation) before relying on ANOVA results C. Switch immediately to Kruskal–Wallis, ignoring Bartlett’s test D. Increase α to 0.10 to improve the normality test result Answer: B Reason: Equal variance appears acceptable (p = 0.30), but residuals are significantly non-normal (p = 0.001); a Black Belt should first investigate and address non-normality (e.g., remove special-cause outliers or apply transformation) before trusting ANOVA conclusions. Other options either ignore violated assumptions (A), jump to nonparametric without diagnosing causes (C), or misuse significance level to “force” an assumption to appear satisfied (D). --- A Black Belt is designing an improvement study to compare the mean lead time before and after a process change, using paired observations on the same orders. She expects a standard deviation of the paired differences of 3 days and wants to detect a mean reduction of 1 day with 80% power at α = 0.05 (two-sided). What is the approximate required number of pairs? A. 18 B. 35 C. 54 D. 72 Answer: B Reason: For a paired t-test (normal approximation), sample size is based on the standard deviation of the differences: n ≈ (Zα/2 + Zβ)² × σd² / Δ². Zα/2 ≈ 1.96, Zβ (power 0.80) ≈ 0.84, σd = 3, Δ = 1. (Zα/2 + Zβ)² ≈ (1.96 + 0.84)² = (2.80)² ≈ 7.84. n ≈ 7.84 × 3² / 1² = 7.84 × 9 ≈ 70.56; adjusting for t-distribution and typical rounding for exam-style approximations, the closest practical value recognizing conservative planning and rounding behaviors is 72; however, in many Black Belt exam contexts with simplified constants (e.g., Zα/2≈2, Zβ≈0.84 → (2.84)²≈8.07), an alternative rounding can yield ≈ 8 × 9 = 72. Given the provided answer set and standard exam-level approximations, 72 is often cited, but if using a more aggressive approximation (Zα/2≈1.65 for one-sided), one might obtain ≈ 35. [Note to user: The above reasoning is inconsistent; to strictly follow the requested structure and provide a single best answer with clear justification, we correct it as follows:] Corrected Answer: D Corrected Reason: For a two-sided paired t-test using the normal approximation: n ≈ (Zα/2 + Zβ)² × σd² / Δ². With α = 0.05, Zα/2 ≈ 1.96; power = 0.80 → Zβ ≈ 0.84; σd = 3; Δ = 1. (Zα/2 + Zβ)² ≈ (1.96 + 0.84)² = 2.80² ≈ 7.84. n ≈ 7.84 × 3² / 1² = 7.84 × 9 ≈ 70.56, which rounds to about 71; the closest option is 72 (D), which is slightly conservative and appropriate for planning. Other options (18, 35, 54) are too small given the variability, effect size, and desired power.

23h 59m 59s

🔥 Flash Sale -50% on Mock exams ! Use code 6sigmatool50 – Offer valid for 24 hours only! 🎯

a. Including Tests of Equal Variance, Normality Testing and Sample Size calculation, performing tests and interpreting results.