Quick Summary
This guide will equip you to fully describe the distribution of a single quantitative variable and to compare multiple distributions. You will master the four key characteristics—Shape, Outliers, Center, and Spread (SOCS)—and learn to communicate your analysis effectively using precise statistical language and context, a critical skill for the AP exam. By the end of this lesson, you will be able to look at a graph or set of data, calculate key statistics, and write a complete, college-level description and comparison of distributions.
Key Concepts
Describing a distribution is one of the most fundamental skills in statistics. We use a consistent four-part framework, often remembered by the acronym SOCS. When asked to describe or compare distributions on the AP exam, you must address all four points.
The SOCS Framework
Shape: How are the data values distributed?
Outliers: Are there any unusual values that stand apart from the rest?
Center: Where is the "middle" or typical value of the distribution?
Spread: How much variability is there in the distribution?
1. Shape
The shape of a distribution can be described by its modality (number of peaks) and its skewness (symmetry).
Modality:
Unimodal: The distribution has one main peak. This is the most common shape.
Bimodal: The distribution has two distinct peaks. This often suggests there are two different subgroups in the data.
Multimodal: The distribution has more than two peaks.
Uniform: The distribution is roughly flat, with no clear peaks. Frequencies are evenly spread across the values.
[Image: A set of four simple histograms side-by-side, labeled Unimodal, Bimodal, Multimodal, and Uniform.]
Skewness (for unimodal distributions):
Symmetric: The left and right sides of the distribution are roughly mirror images of each other. The mean and median are approximately equal.
- Mean \approx Median
Skewed Right (Positively Skewed): The "tail" of the distribution extends to the right. Most of the data is clustered on the left, with a few high values pulling the tail rightward. The mean is pulled in the direction of the tail, so it is greater than the median.
- Mean > Median
Skewed Left (Negatively Skewed): The "tail" of the distribution extends to the left. Most of the data is clustered on the right, with a few low values pulling the tail leftward. The mean is pulled to the left, so it is less than the median.
- Mean < Median
[Image: A set of three histograms side-by-side, labeled Symmetric, Skewed Right, and Skewed Left. Arrows should indicate the direction of the "tail" for the skewed graphs, and the relative positions of the mean and median should be marked.]
2. Outliers
Outliers are data points that fall significantly far from the main cluster of data. On a graph, they appear as isolated points. For now, you can identify potential outliers by visual inspection. A formal rule, the 1.5 x IQR rule, will be covered later, but for describing a distribution, simply noting "potential outliers" is sufficient.
- Effect of Outliers: Outliers can have a dramatic effect on certain statistical measures. This leads to the concept of resistance. A measure is resistant if it is not strongly influenced by extreme values (outliers).
3. Center
The center describes a "typical" value in the distribution. We have two primary measures of center.
Mean (x̄ for a sample, μ for a population):
Calculation: The sum of all data values divided by the number of values (the average).
Resistance: The mean is not resistant. A single high or low outlier can pull the mean significantly in its direction.
When to use: Best for symmetric distributions with no outliers.
Median (Med or M):
Calculation: The middle value when the data is sorted. It's the 50th percentile.
Resistance: The median is resistant. Outliers have very little effect on the median because it only depends on the middle position, not the actual values of the extremes.
When to use: Best for skewed distributions or distributions with strong outliers.
4. Spread (Variability)
Spread describes how spread out or clustered together the data are. We have three primary measures of spread.
Range:
Calculation: Maximum value - Minimum value.
Resistance: The range is not resistant. It is determined entirely by the two most extreme values. It is the simplest but least useful measure of spread.
Standard Deviation (s for a sample, σ for a population):
Concept: Measures the typical or average distance of a data point from the mean. A small standard deviation means data points are clustered tightly around the mean. A large standard deviation means data points are widely spread out.
Resistance: The standard deviation is not resistant. Because it uses the mean in its calculation, it is heavily affected by outliers.
When to use: Use in combination with the mean for symmetric distributions.
Interquartile Range (IQR):
Calculation: Q3 (75th percentile) - Q1 (25th percentile). It represents the range of the middle 50% of the data.
Resistance: The IQR is resistant. It ignores the lowest 25% and highest 25% of the data, so it is not affected by outliers.
When to use: Use in combination with the median for skewed distributions.
Summary of Choosing Measures:
| Distribution Shape | Best Measure of Center | Best Measure of Spread |
|---|---|---|
| Roughly Symmetric | Mean | Standard Deviation |
| Skewed (Left or Right) | Median | Interquartile Range (IQR) |
| Has Strong Outliers | Median | Interquartile Range (IQR) |
Comparing Distributions
When asked to compare two or more distributions, you must still address SOCS, but your answer must use explicit comparative language. Do not just list the characteristics for each group separately.
Good Comparison: "The median salary for Group A (55,000) is **higher than** the median salary for Group B ($50,000)." * **Bad Description (not a comparison):** "The median salary for Group A is $55,000. The median salary for Group B is $50,000." You must use words like *greater than, less than, similar to, more variable than, less skewed than* for **each** element of SOCS. ## Key Vocabulary - **Distribution**: An arrangement of the values a variable takes on in a sample or population, showing their frequency of occurrence. - **Symmetric Distribution**: A distribution in which the right and left sides are approximate mirror images of each other around the center. - **Skewed Distribution**: An asymmetric distribution where a long "tail" of values extends to one side. It is skewed in the direction of the tail (e.g., a tail to the right is skewed right). - **Mean**: The arithmetic average of a dataset. It is sensitive to outliers. - **Median**: The midpoint of a distribution; the 50th percentile. It is resistant to outliers. - **Standard Deviation**: A measure of the typical amount that a value in a dataset deviates from the mean. It is sensitive to outliers. - **Interquartile Range (IQR)**: The range of the middle 50% of the data, calculated as Q3 - Q1. It is resistant to outliers. - **Outlier**: An observation that falls abnormally far from the other values in a dataset. ## Calculator Tech (TI-84) To describe a distribution, you first need to calculate its summary statistics. The TI-84 can do this quickly. **Goal:** Calculate summary statistics (mean, median, standard deviation, etc.) for a list of quantitative data. **Example Data:** {10, 12, 15, 15, 17, 20, 22, 30} 1. **Enter Data into a List:** * Press `STAT`. * Select `1:Edit...`. * Enter your data into a list, for example, `L1`. If `L1` has old data, use the arrow keys to highlight `L1` at the top, press `CLEAR`, then `ENTER`. * Type each number and press `ENTER`. 2. **Calculate One-Variable Statistics:** * Press `STAT` again. * Use the right arrow to go to the `CALC` menu. * Select `1:1-Var Stats`. * The $1-Var Stats menu will appear.
* **List:** Make sure it says `L1` (or whichever list you used). To get `L1`, press `2nd` -> . * **FreqList:** Leave this blank unless you have a frequency table. * **Calculate:** Highlight and press `ENTER`.
Read the Output:
: The mean of the sample. (For our example: 17.625)
: The sample standard deviation. This is the one you will almost always use in AP Statistics. (For our example: 6.39)
: The population standard deviation. (Ignore this for now).
: The sample size. (For our example: 8)
Use the down arrow to see more:
: The minimum value. (10)
Q1: The first quartile (25th percentile). (13.5): The median (50th percentile). (16)
Q3: The third quartile (75th percentile). (21): The maximum value. (30)
From this single screen, you get the mean, median, standard deviation, and the values needed to calculate the IQR (Q3 - Q1 = 21 - 13.5 = 7.5) and range (30 - 10 = 20).
How to Show Work on the FRQ
For descriptive statistics questions, your "work" is a well-written paragraph that clearly communicates your analysis. Always remember the three C's: Context, Comparison, and Clarity.
Template for Describing a Single Distribution
When describing the distribution of a single quantitative variable, structure your response around SOCS.
"The distribution of [variable in context with units] is [shape: skewed right/left or roughly symmetric] and [modality: unimodal/bimodal]. The center is best described by the [median/mean] because the distribution is [skewed/symmetric]. The median is [value with units] and the mean is [value with units]. The spread is best described by the [IQR/standard deviation]. The IQR is [value with units] and the standard deviation is [value with units]. There [are/are not] any apparent outliers at approximately [value(s) with units]."
Template for Comparing Two Distributions
When comparing two distributions, use the same SOCS framework but with comparative language for every component.
"The distribution of [variable, group 1] is [shape], while the distribution of [variable, group 2] is [shape]. The center of the distribution for group 1 (median/mean = [value]) is [higher than/lower than/similar to] the center for group 2 (median/mean = [value]). The distribution for group 1 is [more variable/less variable/similarly variable] than for group 2, as shown by its [larger/smaller] [IQR/standard deviation] of [value] compared to group 2's [value]. Both distributions appear to have [outliers/no outliers], with potential outliers for group 1 at [value(s)] and for group 2 at [value(s)]."
Practice Problems
Problem 1:
The following stem-and-leaf plot shows the number of points scored by a high school basketball team in each of its 20 games last season.
2 | 8 9
3 | 2 5 6 8 8
4 | 0 1 1 3 4 5 5 7
5 | 0 1 2
6 | 1 9
Key: 2|8 = 28 points
Describe the distribution of points scored by the team.
Solution:
First, we enter the 20 data points into L1 on the calculator and run 1-Var Stats to get the summary statistics: Mean (x̄) = 43.15, StDev (Sx) = 10.34, Min = 28, Q1 = 37, Median = 42, Q3 = 50.5, Max = 69.
Now, we apply the FRQ template for a single distribution.
The distribution of points scored by the basketball team is skewed to the right and appears to be unimodal. Because the distribution is skewed, the median is the more appropriate measure of center. The median number of points scored was 42 points. The spread is best described by the IQR, which is Q3 - Q1 = 50.5 - 37 = 13.5 points. There is a potential outlier at 69 points, as it is separated from the rest of the data by a gap.
Problem 2:
As part of a study on physical fitness, a researcher collected data on the number of push-ups completed in one minute by a group of 30 male students and a separate group of 30 female students. The parallel boxplots below display the results. Compare the distributions of the number of push-ups for males and females.
[Image: Two parallel boxplots. The "Males" boxplot is positioned higher on the y-axis than the "Females" boxplot.
Male Boxplot: Min=15, Q1=25, Median=30, Q3=35, Max=50.
Female Boxplot: Min=5, Q1=12, Median=18, Q3=22, Max=35. There is one outlier marked with an asterisk at 35 for the females.]
Solution:
We apply the FRQ template for comparing two distributions, using comparative language for each part of SOCS.
Shape: The distribution of the number of push-ups for males appears to be roughly symmetric, while the distribution for females appears to be skewed to the right.
Outliers: The distribution for males has no apparent outliers. The distribution for females has one potential high outlier at approximately 35 push-ups.
Center: The center of the distribution of push-ups for males is substantially higher than for females. The median for males (30 push-ups) is much greater than the median for females (18 push-ups).
Spread: The distribution of push-ups for males is less variable than for females. The IQR for males (35 - 25 = 10 push-ups) is smaller than the IQR for females (22 - 12 = 10 push-ups), indicating the middle 50% of male counts are more consistent. However, the overall range for males (50 - 15 = 35 push-ups) is larger than the range for females (35 - 5 = 30 push-ups). (Note: Comparing either IQR or range with justification is acceptable).
Common Mistakes to Avoid
Forgetting Context: This is the most common mistake. Always relate your numbers back to the variable being measured. Don't just say "The median is 42." Say "The median number of points scored was 42 points." Context is required for full credit.
Listing vs. Describing: Do not just list the 5-number summary (Min, Q1, Med, Q3, Max) or the mean and standard deviation. You must use these numbers to describe the shape, center, and spread in complete sentences. A list of numbers is not a description.
Using Inappropriate Statistics for the Shape: A classic error is describing a heavily skewed distribution using the mean and standard deviation. If you identify a distribution as skewed, you must use the median and IQR as your primary measures of center and spread, as they are resistant to the skew and any outliers.
Failing to Use Comparative Language: When asked to "compare," you must use words like "greater than," "less than," or "similar to." Describing two distributions in separate paragraphs without explicitly comparing them will not earn full credit.
Confusing Skew Direction: A simple way to remember the direction of skew is that the skew follows the tail. If the long tail is on the right (higher values), it is skewed right. If the long tail is on the left (lower values), it is skewed left.