2.2.2 Descriptive Statistics

Descriptive Statistics Introduction Descriptive statistics summarize and organize data so patterns, central tendencies, and variation become clear. They provide the foundation for understanding a process or system before moving into more advanced statistical methods. Descriptive statistics focus on: - What the data look like - Where the data are centered - How spread out the data are - How the data are shaped - How data sets compare to each other This article covers the concepts and calculations needed to confidently use descriptive statistics in data-driven problem solving. --- Types of Data Qualitative vs Quantitative Data Understanding the type of data is the first step before selecting descriptive measures. - Qualitative (categorical) - Describe categories or labels - Examples: defect type, machine ID, region, color - Summarized by: counts, proportions, mode, frequency tables, bar charts, pie charts - Quantitative (numeric) - Represent measured or counted values - Examples: cycle time, length, weight, cost, number of defects - Summarized by: mean, median, standard deviation, histograms, boxplots, run charts Discrete vs Continuous Data Quantitative data divide further into discrete and continuous. - Discrete data - Counted values, often integers - Examples: number of errors per unit, number of calls, number of defects - Usually come from counting events or items - Continuous data - Measured on a scale with potentially infinite gradations - Examples: time, temperature, length, pressure, weight - Usually come from measurement instruments This classification influences which graphs and statistics are most informative, though many descriptive tools apply to both types. --- Frequency Distributions and Tables Frequency and Relative Frequency A frequency distribution shows how often each value or category occurs. - Frequency: the count of occurrences in each category or class - Relative frequency: frequency divided by total number of observations - Percent frequency: relative frequency × 100 For categorical data: - Create a frequency table listing: - Category - Frequency - Relative frequency (or percent) For numeric data: - Group values into class intervals (bins) - Count how many observations fall into each interval Choosing Class Intervals For continuous data, class intervals should: - Cover the entire range of data - Be mutually exclusive (no overlap) - Be of equal width (for standard histograms) Practical guidelines: - Use between 5–15 classes for most data sets - Class width ≈ (max − min) / number of classes - Round class limits to convenient values --- Graphical Descriptions of Data Histograms A histogram displays the frequency distribution of quantitative data. Key points: - Horizontal axis: class intervals (ranges of values) - Vertical axis: frequency or relative frequency - Bars touch, indicating continuous scale Histograms help identify: - Shape (symmetry or skewness) - Central region - Spread - Potential outliers or gaps - Multiple peaks (modes) Bar Charts and Pie Charts For categorical data: - Bar chart - Horizontal axis: categories - Vertical axis: frequency or percent - Bars separated by space (categories are discrete) - Useful for comparing categories - Pie chart - Circle divided into slices proportional to percent frequency - Emphasizes contribution of each category to the whole - Less precise for detailed comparison than bar charts Stem-and-Leaf Plots Stem-and-leaf plots show the distribution of small to moderate data sets while preserving the actual data values. - Split each observation into: - Stem: leading digit(s) - Leaf: last digit - Example: for 43, 47, 52, 58 - Stems: 4, 5 - Leaves on 4-stem: 3, 7 - Leaves on 5-stem: 2, 8 Benefits: - Show shape of distribution - Maintain individual data values - Quickly identify center and spread Boxplots Boxplots (box-and-whisker plots) summarize distribution using five key numbers. - Elements: - Minimum (not including outliers) - First quartile (Q1) - Median (Q2) - Third quartile (Q3) - Maximum (not including outliers) - Box: from Q1 to Q3 (interquartile range, IQR) - Line inside box: median - Whiskers: extend from box to min and max within a limit (commonly 1.5 × IQR from the quartiles) - Points beyond whiskers: potential outliers Boxplots are powerful for: - Comparing distributions across groups - Visualizing center, spread, skewness, and outliers Run Charts and Time Plots When data are collected over time, plotting in time order reveals patterns. - Run chart - Horizontal axis: time (or sequence) - Vertical axis: measured value - Often includes a reference line (e.g., mean or median) Run charts reveal: - Trends (systematic increases or decreases) - Shifts (sudden changes in level) - Cycles or seasonal patterns - Short-term fluctuations Although often used for process monitoring, run charts are also descriptive tools to understand data behavior across time. --- Measures of Central Tendency Mean The mean is the arithmetic average and is the most widely used measure of center. For a sample of size n: - Sample mean: [ \bar{x} = \frac{\sum{i=1}^{n} xi}{n} ] For a population: - Population mean: [ \mu = \frac{\sum{i=1}^{N} xi}{N} ] Key characteristics: - Uses all data values - Sensitive to extreme values (outliers) - Frequently used in further calculations (e.g., variance, standard deviation) Median The median is the middle value when data are ordered. - If n is odd: median is the middle observation - If n is even: median is the average of the two middle observations Properties: - Resistant to extreme values - Useful when data are skewed or contain outliers - Splits ordered data into two halves of equal count Mode The mode is the most frequently occurring value or category. - For numeric data: - May be unimodal (one clear peak) - May be bimodal or multimodal (two or more peaks) - For categorical data: - Category with the highest frequency The mode is useful for: - Identifying typical categories - Detecting multiple clusters in numeric data Comparing Mean, Median, and Mode Relationship to distribution shape: - Symmetric distribution: - Mean ≈ median ≈ mode - Right-skewed distribution: - Mean > median > mode - Left-skewed distribution: - Mean < median < mode Choice of measure: - Use mean when data are symmetric without extreme outliers - Use median when data are skewed or contain outliers - Use mode for categorical data or when identifying peaks in distribution --- Measures of Variation Range The range is the simplest measure of spread. - Range = maximum − minimum Characteristics: - Easy to compute - Uses only two data points - Highly sensitive to outliers - Gives a rough sense of total spread, but not internal variation Variance Variance measures average squared deviation from the mean. For a sample of size n: - Sample variance: [ s^2 = \frac{\sum{i=1}^{n}(xi - \bar{x})^2}{n - 1} ] For a population: - Population variance: [ \sigma^2 = \frac{\sum{i=1}^{N}(xi - \mu)^2}{N} ] Key features: - Always nonnegative - In squared units of the original data - Forms the basis for standard deviation and many inferential methods Standard Deviation Standard deviation is the square root of variance and is the most commonly used measure of spread. For a sample: - Sample standard deviation: [ s = \sqrt{s^2} ] For a population: - Population standard deviation: [ \sigma = \sqrt{\sigma^2} ] Interpretation: - Measures typical distance of data points from the mean - Expressed in the same units as the original data - Larger standard deviation means greater variability Standard deviation is central for: - Describing dispersion - Comparing variability between groups - Relating data to the normal distribution - Calculating many process-related metrics Coefficient of Variation The coefficient of variation (CV) expresses variation relative to the mean. For sample data: - [ CV = \frac{s}{\bar{x}} \times 100% ] Uses: - Comparing variability across different units or scales - Example: compare variability of time in minutes vs cost in dollars - Comparing variability between groups with different means Interpretation: - Higher CV indicates more variability relative to the mean - Only meaningful when data are on a ratio scale and mean > 0 --- Percentiles and Quartiles Percentiles A percentile indicates the value below which a given percentage of observations fall. - p-th percentile (Pₚ): - Value where p% of data are at or below that value - Examples: - 25th percentile: 25% of values are below this point - 90th percentile: 90% of values are below this point Percentiles are useful for: - Setting performance benchmarks - Describing relative standing (e.g., “in the 95th percentile”) Quartiles Quartiles divide ordered data into four equal parts. - First quartile (Q1): 25th percentile - Second quartile (Q2): 50th percentile (median) - Third quartile (Q3): 75th percentile Quartiles are key to: - Constructing boxplots - Measuring middle spread (via interquartile range) Interquartile Range (IQR) The interquartile range measures spread of the middle 50% of data. - IQR = Q3 − Q1 Characteristics: - Resistant to extreme values - Useful for: - Summarizing middle variability - Identifying potential outliers Common outlier rule using IQR: - Lower fence = Q1 − 1.5 × IQR - Upper fence = Q3 + 1.5 × IQR - Values outside these fences are often flagged as outliers --- Shape of Distributions Symmetry and Skewness Understanding shape helps interpret which measures of center and spread are most meaningful. - Symmetric distribution - Left and right sides mirror each other - Mean and median are close - Example: ideal normal distribution - Right-skewed (positively skewed) - Tail extends to the right (toward larger values) - Mean > median - Often arises from: - Time to complete a task - Cost or income distributions - Data bounded below by zero - Left-skewed (negatively skewed) - Tail extends to the left (toward smaller values) - Mean < median - Can arise from upper-bounded data (e.g., near 100% yields) Modality Modality indicates how many peaks a distribution has. - Unimodal: one main peak - Bimodal: two main peaks - Multimodal: more than two peaks Multiple peaks often indicate: - Mixtures of different subgroups - Changes in process conditions - Need to separate data by relevant factors for clearer analysis Kurtosis (Conceptual) Kurtosis describes how concentrated data are around the mean and in the tails. - High kurtosis (leptokurtic): - More data near mean and heavy tails - More extreme values than a normal distribution - Low kurtosis (platykurtic): - Less data near mean and lighter tails - Fewer extreme values than normal In descriptive work, kurtosis is mostly used conceptually to: - Compare tail behavior to the normal distribution - Indicate potential issues with extreme values --- Descriptive Statistics for Categorical Data Frequency Tables and Proportions For qualitative data, description focuses on counts and proportions. Key measures: - Frequency of each category - Relative frequency (proportion) of each category - Percent frequency of each category These measures reveal: - Most common categories - Rare categories - Overall pattern of occurrence Cross-Tabulations (Contingency Tables) When analyzing two categorical variables together, cross-tabulations show joint frequencies. - Structure: - Rows: categories of one variable - Columns: categories of the other variable - Cells: counts (or percentages) for each combination Uses: - Describing relationships between categorical variables - Comparing category distributions across groups Common summaries: - Row percentages - Column percentages - Overall cell percentages --- Descriptive Statistics for Paired and Multiple Variables Covariance and Correlation (Descriptive View) When examining the relationship between two quantitative variables, descriptive statistics can summarize how they move together. - Covariance (conceptual description): - Positive: variables tend to increase together - Negative: when one increases, the other tends to decrease - Magnitude depends on units, so it is not standardized - Correlation coefficient (r): - Standardized measure of linear association, ranging from −1 to +1 - Values: - Near +1: strong positive linear relationship - Near 0: little or no linear relationship - Near −1: strong negative linear relationship - Unitless, so comparisons across different variables are possible At the descriptive level, correlation is used to: - Summarize strength and direction of linear relationships - Complement scatterplots Scatterplots Scatterplots are primary descriptive tools for two quantitative variables. - Horizontal axis: predictor or input variable - Vertical axis: response or output variable - Each point: one observation’s pair of values Scatterplots help identify: - Direction of relationship (positive, negative, none) - Strength and form (linear, nonlinear) - Outliers or unusual patterns - Clusters or subgroups --- Outliers and Data Integrity Identifying Outliers Outliers are observations that are distant from the rest of the data. Common descriptive indicators: - Points far from others in histograms or scatterplots - Points beyond boxplot whiskers (using IQR fences) - Values more than a certain number of standard deviations from the mean (e.g., beyond ±3s in approximately normal data) Outliers may arise from: - Measurement or recording errors - Changes in process conditions - Rare but genuine extreme events Handling Outliers Before deciding what to do with outliers: - Check for data errors and correct them if found - Investigate context (time, conditions, source) Possible actions: - Keep them if they represent real, meaningful variation - Analyze with and without outliers to assess impact - Document any decision to remove or adjust them Outliers strongly influence: - Mean and standard deviation - Range - Correlation They affect both descriptive summaries and later inferential analyses, so careful consideration is necessary. --- Using Descriptive Statistics Effectively Stepwise Descriptive Approach A systematic descriptive approach typically follows this order: - Understand the data type: - Qualitative vs quantitative - Discrete vs continuous - Visualize the data: - Histograms, boxplots, stem-and-leaf for numeric data - Bar charts, pie charts for categorical data - Run charts or time plots if data are time-ordered - Scatterplots for pairs of quantitative variables - Summarize numerically: - Center: mean, median, mode - Spread: range, variance, standard deviation, IQR - Position: percentiles, quartiles - Relationships: correlation (for paired quantitative variables) - Interpret in context: - Compare groups or time periods - Relate shape, center, and spread to process behavior - Watch for skewness, multimodality, and outliers Comparing Groups Descriptive statistics are essential for comparing different groups or conditions. For each group: - Visualize: - Overlaid histograms - Side-by-side boxplots - Grouped bar charts (for categorical outcomes) - Summarize: - Means and medians - Standard deviations and IQRs - Sample sizes (for understanding stability of estimates) Look for: - Differences in location (center) - Differences in variability - Differences in distribution shape - Presence or absence of outliers in each group --- Summary Descriptive statistics organize and summarize data to reveal key characteristics before deeper analysis. Core elements include: - Understanding data types (qualitative vs quantitative; discrete vs continuous) - Displaying data through histograms, bar charts, boxplots, run charts, stem-and-leaf plots, and scatterplots - Measuring center with mean, median, and mode - Measuring spread with range, variance, standard deviation, interquartile range, and coefficient of variation - Describing position using percentiles and quartiles - Interpreting distribution shape, including symmetry, skewness, modality, and kurtosis - Summarizing relationships between variables with cross-tabulations, covariance, correlation, and scatterplots - Identifying and assessing outliers and their impact Mastering these descriptive tools allows for clear understanding and communication of how data behave, forming a solid foundation for any subsequent statistical or process analysis.

Practical Case: Descriptive Statistics A regional lab-testing company receives frequent complaints about late COVID test results. Management suspects turnaround time (TAT) varies by clinic and by day of week but has no clear picture. The Black Belt leads a 2-week data pull of all completed tests: location, day of week, and TAT in hours. Using only descriptive statistics, the team: - Calculates mean, median, minimum, maximum, and standard deviation of TAT overall and by clinic. - Builds simple tables of TAT by day of week and by shift (day vs night). - Plots TAT distributions per clinic (boxplots) to compare spread and outliers. - Identifies the proportion of tests meeting the 24-hour target for each clinic. Findings show: - Two clinics with similar average TAT but much higher spread and many extreme outliers. - Mondays have the highest median TAT and lowest on-time percentage. - Night shift has slightly lower average TAT but far fewer samples, indicating underutilized capacity. Based on these descriptive insights, the company: - Rebalances sample loading from Monday to Tuesday/Wednesday. - Moves some daytime processing to night shift for the two problem clinics. - Sets a simple KPI: weekly median TAT and percent within 24 hours per clinic. Within one month, the median TAT drops by several hours, and the proportion of on-time results increases substantially, with management now reviewing these descriptive statistics in a weekly dashboard. End section

Practice question: Descriptive Statistics A Black Belt is analyzing cycle time data that are highly right-skewed due to a few very large outliers. Which measure of central tendency is most appropriate to summarize the “typical” cycle time for management reporting? A. Mean B. Median C. Mode D. Midrange Answer: B Reason: For skewed distributions with outliers, the median is the most robust and representative measure of central tendency because it is not unduly influenced by extreme values. The mean and midrange are highly affected by outliers; the mode may not reflect the central location of continuous data. --- A Black Belt has the following ordered defect count data from 9 samples: 1, 2, 2, 3, 4, 4, 4, 6, 9. What is the interquartile range (IQR) of this data set? A. 2 B. 3 C. 4 D. 5 Answer: B Reason: For 9 points, Q2 (median) is the 5th value (=4). Q1 is the median of the first 4 values: (2+2)/2 = 2. Q3 is the median of the last 4 values: (4+6)/2 = 5. Therefore, IQR = Q3 – Q1 = 5 – 2 = 3. Other options miscalculate quartiles or use inclusive medians incorrectly. --- A Black Belt reports that the process mean is 50 units and the standard deviation is 5 units based on a large sample. Assuming normality, approximately what percentage of observations is expected to lie between 45 and 55 units? A. 50% B. 68% C. 95% D. 99.7% Answer: B Reason: 45 to 55 corresponds to ±1 standard deviation from the mean (50 ± 5). For a normal distribution, about 68% of observations lie within ±1σ. 95% and 99.7% correspond to ±2σ and ±3σ, and 50% represents only half of the distribution. --- A Black Belt is comparing the variability of two key CTQs that have different units and different means. Which descriptive statistic is most appropriate to compare their relative variability? A. Standard deviation B. Range C. Coefficient of variation D. Interquartile range Answer: C Reason: The coefficient of variation (CV = σ/mean) allows comparison of relative variability across variables with different scales and means. Standard deviation, range, and IQR measure absolute spread and are not directly comparable across different units or magnitudes. --- A Black Belt computed the sample variance of setup time from 25 observations as 16 min². What is the estimated standard deviation, and which statement is most accurate? A. σ = 4 minutes; this is the sample standard deviation. B. σ = 4 minutes; this is the unbiased estimate of population variance. C. σ = 8 minutes; this is the sample standard deviation. D. σ = 8 minutes; this is the population standard deviation. Answer: A Reason: Standard deviation is the square root of variance, so √16 = 4 minutes. Since the variance given is a sample variance, its square root is the sample standard deviation. The other options either double the value or confuse variance with standard deviation and sample with population parameters.

23h 59m 59s

🔥 Flash Sale -50% on Mock exams ! Use code 6sigmatool50 – Offer valid for 24 hours only! 🎯

2.2.2 Descriptive Statistics