2.2.1 Basic Statistics

Basic Statistics Introduction Basic statistics provides the language and tools to describe data, quantify variation, and make decisions under uncertainty. This article focuses on the core statistical concepts needed to understand, summarize, and analyze data rigorously. --- Types of Data and Scales of Measurement Data Types Understanding data types guides the correct choice of graphs, summary measures, and statistical tests. - Qualitative (categorical) - Represent categories or labels - Examples: defect type, machine ID, region - Typical analyses: proportions, frequency tables, chi-square tests - Quantitative (numerical) - Represent counts or measurements - Examples: cycle time, weight, temperature - Typical analyses: means, standard deviations, correlations, regression Measurement Scales - Nominal - Categories with no inherent order - Example: product color (red, blue, green) - Allowed operations: counts, mode - Ordinal - Categories with a meaningful order, but unequal or unknown spacing - Example: satisfaction rating (low, medium, high) - Allowed operations: median, percentiles, nonparametric tests - Interval - Numerical scale with equal intervals, no true zero - Example: temperature in °C or °F - Allowed operations: differences are meaningful, ratios are not - Ratio - Numerical scale with equal intervals and a true zero - Example: time, length, weight, defect count - Allowed operations: differences and ratios are meaningful --- Descriptive Statistics Measures of Central Tendency Central tendency describes where data values tend to cluster. - Mean (arithmetic average) - Sum of values divided by number of observations - Sensitive to extreme values (outliers) - Appropriate for symmetric, quantitative data - Median (50th percentile) - Middle value when data are ordered - Robust to outliers and skewed distributions - Appropriate when data are skewed or have extreme values - Mode - Most frequent value or category - Useful for categorical data or multi-modal distributions Measures of Dispersion Dispersion quantifies how spread out the data are. - Range - Maximum − minimum - Very sensitive to outliers - Simple indicator of spread - Variance - Average squared deviation from the mean - Population variance: σ² - Sample variance: s² - Units are squared; mainly used as an intermediate measure - Standard deviation (SD) - Square root of variance - Same units as the original data - Larger SD means more variability - Interquartile range (IQR) - Q3 − Q1 (75th percentile − 25th percentile) - Focuses on the middle 50% of data - Robust to outliers Shape of Distributions Distribution shape affects choice of statistics and methods. - Symmetric - Left and right sides mirror each other - Mean ≈ median - Skewed right (positively skewed) - Long tail to the right - Mean > median - Skewed left (negatively skewed) - Long tail to the left - Mean < median - Multi-modal - More than one peak - May indicate different sub-populations or process states --- Data Visualization Frequency Tables and Histograms - Frequency table - Lists categories or intervals with counts or percentages - Helps see distribution and sample size - Histogram - Bars show frequency of numerical data in intervals (bins) - Used to visualize distribution shape, center, and spread - Bar widths equal; bars touch (continuous data) Boxplots - Boxplot components - Box: from Q1 to Q3 - Line in box: median - Whiskers: extend to smallest/largest non-outlier points - Points beyond whiskers: potential outliers - Uses - Compare distributions across groups - Assess symmetry, spread, and presence of outliers Scatterplots - Scatterplot - Plots pairs of numerical variables (x, y) - Reveals patterns: linear, nonlinear, clusters, outliers - Basis for correlation and regression analysis --- Basic Probability Concepts Fundamental Ideas - Experiment: process that leads to an outcome (e.g., observe a part’s status) - Outcome: single possible result (e.g., defective) - Event: set of outcomes (e.g., defective or rework) - Sample space: set of all possible outcomes - Probability of an event (A) - P(A) between 0 and 1 - 0 = impossible, 1 = certain Complement, Union, and Intersection - Complement - Event “not A” - P(not A) = 1 − P(A) - Union (A or B) - Event that A occurs, or B occurs, or both - P(A ∪ B) = P(A) + P(B) − P(A ∩ B) - Intersection (A and B) - Event that both A and B occur - For independent events: P(A ∩ B) = P(A) × P(B) Conditional Probability and Independence - Conditional probability - P(A | B) = probability of A given B occurred - Multiplication rule - P(A ∩ B) = P(B) × P(A | B) - Independence - Events A and B are independent if P(A | B) = P(A) - Then P(A ∩ B) = P(A) × P(B) --- Discrete and Continuous Distributions Discrete Distributions Discrete distributions describe counts. - Binomial distribution - Conditions: - Fixed number of trials (n) - Each trial has two outcomes (success/failure) - Constant probability of success (p) - Trials independent - Parameters: n, p - Mean = n·p - Variance = n·p·(1 − p) - Applications: number of defective items in a sample - Poisson distribution - Models count of events in a fixed interval (time, area, volume) - Events occur independently at a constant average rate - Parameter: λ (average count per interval) - Mean = λ, Variance = λ - Applications: defects per unit, calls per hour Continuous Distributions Continuous distributions describe measurements. - Normal distribution - Bell-shaped, symmetric - Defined by mean μ and standard deviation σ - Many natural and process variables approximate normality - Empirical rule: - About 68% within ±1σ - About 95% within ±2σ - About 99.7% within ±3σ - Standard normal distribution - Mean 0, standard deviation 1 - Use Z-scores: Z = (X − μ) / σ - Other important continuous distributions - t distribution: used when estimating means with small samples and unknown σ - Chi-square distribution: used for variances and certain tests (e.g., goodness of fit) - F distribution: used to compare variances (e.g., ANOVA) --- Sampling and Sampling Distributions Populations and Samples - Population - Entire set of units or observations of interest - Sample - Subset selected from the population - Used to estimate population characteristics - Parameter - Numerical summary of a population (μ, σ, p) - Statistic - Numerical summary of a sample (x̄, s, p̂) Sampling Methods - Simple random sampling - Every unit has equal chance of selection - Reduces selection bias - Stratified sampling - Population divided into subgroups (strata), then sampled from each - Improves representation of key subgroups - Systematic sampling - Select every k-th unit after a random start - Simple to implement when list or sequence is available Sampling Distributions and Central Limit Theorem - Sampling distribution - Distribution of a statistic (e.g., sample mean) over repeated samples - Has its own mean and standard deviation (standard error) - Central Limit Theorem (CLT) - For sufficiently large sample size n, the distribution of the sample mean x̄ is approximately normal, regardless of the original population distribution, provided variance is finite - Mean of x̄ = μ - Standard error of x̄ = σ / √n - Implications - Normal-based methods for means are often valid when n is large - Precision improves as n increases (standard error decreases) --- Estimation and Confidence Intervals Point and Interval Estimates - Point estimate - Single best estimate of a population parameter - Examples: x̄ for μ, p̂ for p - Interval estimate - Range of values likely to contain the true parameter - Expressed with a confidence level (e.g., 95% confidence interval) Confidence Intervals for Means - Known population standard deviation (σ) - Use Z-based interval (rare in practice) - Unknown population standard deviation - Use sample standard deviation (s) and t distribution - Structure: - Estimate ± (critical value) × (standard error) - Interpretation - A 95% confidence interval is constructed so that, in repeated sampling, 95% of such intervals would contain the true parameter - It does not mean there is a 95% probability that the specific interval contains the parameter after it is computed Confidence Intervals for Proportions - Sample proportion (p̂) - p̂ = (number of successes) / (sample size) - Approximate confidence interval - p̂ ± Z* × √[p̂(1 − p̂) / n], when n is sufficiently large - Used for estimating population proportion p --- Fundamentals of Hypothesis Testing Basic Concepts Hypothesis testing formalizes decisions using sample data. - Null hypothesis (H₀) - Baseline statement, often representing no effect, no difference, or current state - Alternative hypothesis (H₁ or Ha) - Statement we consider if evidence contradicts H₀ - Test statistic - Computed from sample data - Compared to a reference distribution (Z, t, chi-square, F) - p-value - Probability of observing a test statistic as extreme as, or more extreme than, the one obtained, assuming H₀ is true - Small p-value suggests data are inconsistent with H₀ - Significance level (α) - Threshold for deciding whether to reject H₀ (commonly 0.05) Types of Errors - Type I error (α) - Rejecting a true H₀ (false positive) - Type II error (β) - Failing to reject a false H₀ (false negative) - Power (1 − β) - Probability of correctly rejecting a false H₀ - Increases with: - Larger sample size - Larger true effect size - Lower variability - Higher significance level (α) One-tailed and Two-tailed Tests - Two-tailed test - Alternative: parameter is not equal to a value - Detects differences in both directions - One-tailed test - Alternative: parameter is greater than or less than a value - More power in one direction but ignores the other direction --- Correlation and Simple Linear Regression Correlation Correlation measures strength and direction of linear association between two quantitative variables. - Pearson correlation coefficient (r) - Range: −1 to +1 - r > 0: positive linear relationship - r < 0: negative linear relationship - |r| close to 1: strong linear association - |r| close to 0: weak or no linear relationship - Important points - Correlation does not imply causation - Strong nonlinear relationships may have low |r| - Outliers can strongly influence r Simple Linear Regression Regression models the relationship between a response (Y) and a single predictor (X). - Model form - Y = β₀ + β₁X + ε - β₀: intercept - β₁: slope (change in mean Y per unit change in X) - ε: random error term - Estimated regression line - Ŷ = b₀ + b₁X - b₀ and b₁ estimated from data using least squares (minimize sum of squared residuals) - Residuals - e = Y − Ŷ - Analyze residuals to assess: - Linearity - Constant variance - Independence - Approximate normality - Coefficient of determination (R²) - Proportion of variability in Y explained by X - Range: 0 to 1 - Higher R² indicates better fit (within same context and assumptions) - Inference on regression - Test whether slope differs from zero - Confidence intervals for slope and predictions --- Basic Concepts of Analysis of Variance (ANOVA) Purpose of ANOVA ANOVA compares means across more than two groups or conditions. - Null hypothesis (H₀) - All group means are equal - Alternative hypothesis (H₁) - At least one group mean differs Variability Decomposition ANOVA partitions total variability into components. - Between-group variability - Differences among group means - Within-group variability - Variation inside each group (error variation) - F statistic - F = (between-group mean square) / (within-group mean square) - Large F suggests group means differ more than expected by chance - Assumptions - Independent observations - Approximately normal distributions within groups - Homogeneity of variances across groups --- Nonparametric Methods (Overview) Nonparametric methods are used when data do not meet assumptions required for parametric tests (e.g., normality). - Typical uses - Ordinal data or ranks - Skewed distributions or heavy tails - Presence of outliers that cannot be removed - Examples - Tests comparing medians or distributions using ranks - Rank-based correlation for ordinal or non-normal data While details of specific tests may extend beyond basic statistics, the key is recognizing when nonparametric approaches are more appropriate than parametric methods. --- Summary Basic statistics provides essential tools to describe data, understand variation, and make inferences about populations from samples. Core ideas include distinguishing data types and measurement scales, summarizing data with measures of center and spread, and visualizing distributions and relationships. Probability concepts and distributions underpin the logic of sampling, estimation, and hypothesis testing. Correlation and simple regression model linear relationships, while ANOVA compares multiple group means. Together, these concepts form a coherent foundation for rigorous, data-based decision making.

Practical Case: Basic Statistics A mid‑size electronics plant receives customer complaints about late shipments. The operations manager suspects packing time variability in one product line. They collect packing time data for 30 consecutive orders from that line and 30 from a similar, well‑performing line. For each order, they record total minutes from “ready to pack” to “packed.” Using only basic statistics, they: - Calculate average packing time for each line to compare central tendency. - Calculate standard deviation and range to compare variability. - Review a simple time‑series plot to see any visible shifts or spikes in the new line’s process. - Check the proportion of orders exceeding the internal packing-time target in both lines. Findings: - The problematic line has a similar average packing time but almost double the standard deviation. - A much higher proportion of its orders exceed the target time. - Spikes in the time‑series plot align with one specific shift. Actions: - Temporarily standardize work on that shift using the best practice from the stable line. - Provide brief retraining for that team and remove a non‑value‑added inspection step found only on that shift. Within two weeks, a follow‑up data sample shows: - Lower and more stable standard deviation in packing time. - Reduced proportion of late packs. - Corresponding drop in customer late‑shipment complaints. End section

Practice question: Basic Statistics A Black Belt is evaluating process capability and wants a single index that reflects both the process spread relative to specifications and the centering of the process mean. Which statistic is most appropriate? A. Cp B. Cpk C. Pp D. Ppk Answer: B Reason: Cpk accounts for both process variation and the distance between the process mean and the nearest specification limit, making it suitable when both spread and centering matter. Other options: Cp and Pp ignore centering; Ppk is for overall performance including special causes, not short-term capability. --- A data set of cycle times is approximately normally distributed with mean 12 minutes and standard deviation 2 minutes. What proportion of observations is expected to fall between 10 and 16 minutes? A. About 68% B. About 81.5% C. About 84% D. About 95% Answer: B Reason: Convert to z-scores: (10−12)/2 = −1 and (16−12)/2 = 2. Using the standard normal table, P(−1 < Z < 2) ≈ 0.8185 (≈81.85%). Other options: 68% is ±1σ, 84% approximates below z=1, 95% is roughly ±2σ, none match the −1 to 2 range. --- A Black Belt compares the variability of two machining processes using sample standard deviations s1 and s2 computed from independent samples. To test whether the variances are equal, which statistical test is most appropriate? A. Two-sample t-test B. F-test C. Chi-square goodness-of-fit test D. ANOVA F-test on means Answer: B Reason: The classical test for equality of two variances from normal populations uses the F statistic (ratio of sample variances). Other options: The two-sample t-test and ANOVA F-test compare means, not variances; chi-square goodness-of-fit tests distributional fit, not variance equality between two groups. --- A process produces measurements with sample mean 50 and sample standard deviation 5 from n = 100 observations. Assuming normality, what is the 95% confidence interval for the true process mean? A. 50 ± 0.98 B. 50 ± 1.00 C. 50 ± 1.96 D. 50 ± 5.00 Answer: A Reason: Standard error = 5/√100 = 0.5. For large n, 95% CI ≈ 50 ± 1.96×0.5 = 50 ± 0.98. Other options: ±1.00 is an approximation but not exact; ±1.96 ignores the SE; ±5 is ±1σ, not a confidence interval on the mean. --- A Black Belt summarizes defect counts per unit using their mean and standard deviation. Which measure is most appropriate to quantify relative variability when comparing this data to a different metric with a much larger mean? A. Range B. Standard deviation C. Coefficient of variation D. Interquartile range Answer: C Reason: The coefficient of variation (CV = σ/μ) is scale-free and allows comparison of relative variability across metrics with different means. Other options: Range, standard deviation, and interquartile range are absolute measures and depend on the scale of measurement, making cross-metric comparison less meaningful.

24h 0m 0s

🔥 Flash Sale -50% on Mock exams ! Use code 6sigmatool50 – Offer valid for 24 hours only! 🎯

2.2.1 Basic Statistics