Biased and Unbiased | AP Stats Unit 5 Study Guide

Quick Summary

This guide will equip you to evaluate the quality of a statistical estimate. You will learn to distinguish between a population parameter and a sample statistic, and understand that the value of a statistic varies from sample to sample. By analyzing the sampling distribution of a statistic, you will be able to determine if an estimator is biased or unbiased and assess its variability, ultimately allowing you to choose the best possible statistic for estimating an unknown population parameter.

Key Concepts

Parameter vs. Statistic: This is the foundational concept of statistical inference.
- A parameter is a numerical value that describes a characteristic of an entire population. We often use Greek letters to denote parameters (e.g., $μ$ for population mean, $p$ for population proportion, $σ$ for population standard deviation). Parameters are typically unknown and are what we want to estimate.
- A statistic is a numerical value that describes a characteristic of a sample. We use Roman letters to denote statistics (e.g., $\overset{x}{ˉ}$ for sample mean, $\overset{p}{^}$ for sample proportion, $s$ for sample standard deviation). We calculate statistics from our sample data to estimate unknown parameters.
Sampling Variability: If you take many different random samples from the same population, you will get a different value for your statistic each time. For example, the sample mean height of 30 students in one sample will likely be different from the sample mean height of a different 30 students. This natural, expected, sample-to-sample fluctuation is called sampling variability. Our goal is not to eliminate it, but to understand and quantify it.
The Sampling Distribution: To understand an estimator, we imagine taking all possible samples of a specific size $n$ from a population. If we calculated our statistic (e.g., the sample mean $\overset{x}{ˉ}$ ) for every single one of these samples and made a graph of all the results, that graph would be the sampling distribution of the statistic. This distribution is the key to evaluating any estimator.
Evaluating Estimators: Bias and Variability: We judge a statistic based on two criteria related to its sampling distribution:
1. Bias: This refers to the accuracy of an estimator. It compares the center of the sampling distribution to the true parameter value.
  - An unbiased estimator is a statistic whose sampling distribution has a mean that is exactly equal to the true value of the parameter it is trying to estimate. In the long run, an unbiased estimator does not systematically over- or underestimate the parameter. The sample mean ( $\overset{x}{ˉ}$ ) and sample proportion ( $\overset{p}{^}$ ) are both unbiased estimators.
  - A biased estimator is a statistic whose sampling distribution has a mean that is not equal to the true parameter value. It systematically overestimates or underestimates the parameter. For example, using the sample range to estimate the population range is a biased estimator; it will almost always underestimate the true population range.
2. Variability: This refers to the precision of an estimator. It is described by the spread of its sampling distribution, usually measured by the standard deviation.
  - An estimator with low variability is preferred. This means that if we were to take many samples, the values of our statistic would be tightly clustered and consistent, giving us more confidence in any single estimate.
  - An estimator with high variability means the values of our statistic would be very spread out over many samples. Any single estimate is less reliable because it could be far from the center by chance.
The Ideal Estimator: The goal is to use an estimator with low bias and low variability. This means our estimates are both accurate (centered on the true value) and precise (tightly clustered).

[Image: Four targets illustrating the concepts of bias and variability. Target 1 (top-left): High Bias, High Variability (shots are scattered and not near the bullseye). Target 2 (top-right): Low Bias, High Variability (shots are centered around the bullseye but widely scattered). Target 3 (bottom-left): High Bias, Low Variability (shots are tightly clustered but far from the bullseye). Target 4 (bottom-right): Low Bias, Low Variability (shots are tightly clustered around the bullseye - the ideal scenario).]

The Effect of Sample Size ( $n$ ):
- Bias: Increasing the sample size does not fix bias. If your sampling method is flawed or your statistic is inherently biased, taking a larger sample just means you will get a more precise wrong answer.
- Variability: Increasing the sample size reduces the variability of the sampling distribution for most common statistics (like $\overset{x}{ˉ}$ and $\overset{p}{^}$ ). A larger sample provides more information, leading to more consistent and precise estimates. This is a major reason why we prefer larger random samples.

Key Vocabulary

Parameter: A numerical characteristic of a population (e.g., $μ$ , $p$ ). It is a fixed value, but usually unknown.
Statistic: A numerical characteristic of a sample (e.g., $\overset{x}{ˉ}$ , $\overset{p}{^}$ ). Its value is known for a given sample, but it varies from sample to sample.
Sampling Distribution: The probability distribution of a statistic, formed by considering all possible random samples of a fixed size $n$ from a population.
Unbiased Estimator: A statistic whose sampling distribution is centered exactly at the true value of the parameter it estimates. The mean of the sampling distribution equals the parameter.
Bias: A measure of the accuracy of an estimator. It is the difference between the mean of the sampling distribution and the true value of the parameter.
Variability (of a statistic): A measure of the precision of an estimator. It is the spread of its sampling distribution, typically quantified by the standard deviation of the statistic.

Calculator Tech (TI-84)

No major calculator functions are required for this topic. The concepts of bias and variability are theoretical and are typically assessed by interpreting graphs and scenarios, not by performing calculations.

How to Show Work on the FRQ

Free-Response Questions on this topic will almost never involve calculations. Instead, they will present you with graphical representations of sampling distributions (usually dotplots) for two or more different estimators and ask you to compare them and choose the best one. Your response must be a well-written comparison using precise statistical language.

Template for Comparing Two Estimators (e.g., Statistic A vs. Statistic B)

When asked to compare two estimators and choose the better one, structure your response in three parts:

Compare Bias:
- "To evaluate bias, I will compare the center (mean) of each sampling distribution to the true population parameter of [state the true parameter value, e.g., $μ = 10$ ]."
- "The sampling distribution for Statistic A appears to be centered at approximately [estimate the center from the graph]. Since this is [equal to / not equal to] the true parameter, Statistic A is an [unbiased / biased] estimator."
- "The sampling distribution for Statistic B appears to be centered at approximately [estimate the center from the graph]. Since this is [equal to / not equal to] the true parameter, Statistic B is an [unbiased / biased] estimator."
Compare Variability:
- "To evaluate variability, I will compare the spread of the two sampling distributions."
- "The values for Statistic A range from approximately [min value] to [max value], showing [lesser/greater] spread. The values for Statistic B range from approximately [min value] to [max value], showing [lesser/greater] spread."
- "Because the distribution for Statistic [A or B] is more tightly clustered, it has lower variability."
Conclusion and Choice:
- "An ideal estimator has both low bias and low variability."
- "In this case, Statistic [A or B] is the better estimator because it is [unbiased or has lower bias] and it has lower variability than Statistic [the other one]."

Practice Problems

Problem 1:

A researcher wants to estimate the true median household income in a town, which is known to be $50, 000. T h ey a reco n s i d er in g tw o d i ff ere n t s t a t i s t i cs t oes t ima t e t hi s m e d ian, St a t i s t i c X an d St a t i s t i c Y . T h ey g e n er a t e 100 r an d o m s am pl eso f s i ze 40, c a l c u l a t e b o t h s t a t i s t i cs f ore a c h s am pl e, an d cre a t e t h e d o tpl o t so f t h eres u lt in g 100 v a l u es f ore a c h s t a t i s t i c b e l o w . [I ma g e : Tw o d o tpl o t ss i d e - b y - s i d e . T h e x - a x i s f or b o t hi s " Ho u se h o l d I n co m e ($ )". A vertical line is drawn at μ $) o f i t se m pl oyees, r a t e d o na sc a l eo f 1 t o 10. T h e t r u e m e an score f or a ll e m pl oyees i s$ μ = 7.2$. Three different managers take random samples of employees and calculate the sample mean satisfaction score, $\overset{x}{ˉ}$ . Manager A uses a sample of size $n = 10$ , Manager B uses $n = 50$ , and Manager C uses $n = 200$ . The dotplots below show the simulated sampling distributions for these three sample sizes.

[Image: Three dotplots stacked vertically, labeled Plot 1, Plot 2, and Plot 3. All have an x-axis from 5 to 9. A vertical line is at 7.2 on all plots, labeled "True Mean".

Plot 1: Dots are extremely spread out, from roughly 5.5 to 8.5. The center is at 7.2.
Plot 2: Dots are moderately spread out, from roughly 6.5 to 7.9. The center is at 7.2.
Plot 3: Dots are very tightly clustered, from roughly 7.0 to 7.4. The center is at 7.2.]

Identify which plot (1, 2, or 3) corresponds to which sample size ( $n = 10$ , $n = 50$ , $n = 200$ ). Justify your answer in the context of bias and variability.

Solution:

To match each plot to its sample size, I will analyze the bias and variability of the sampling distributions shown. The sample mean ( $\overset{x}{ˉ}$ ) is an unbiased estimator of the population mean ( $μ$ ). As expected, all three plots are centered at the true population mean of $μ = 7.2$ , so all three sampling distributions are unbiased. The key difference between the plots is their variability.

Variability of a sampling distribution decreases as the sample size $n$ increases. Therefore, the plot with the largest spread must correspond to the smallest sample size, and the plot with the smallest spread must correspond to the largest sample size.

Plot 1 shows the greatest variability, with values for $\overset{x}{ˉ}$ ranging widely from about 5.5 to 8.5. This corresponds to the smallest sample size, $n = 10$ .

Plot 2 shows moderate variability, with values more clustered than in Plot 1. This corresponds to the medium sample size, $n = 50$ .

Plot 3 shows the least variability, with values tightly packed around the true mean of 7.2. This corresponds to the largest sample size, $n = 200$ .

In summary, increasing the sample size reduces the variability of the sampling distribution of the sample mean without affecting its bias, leading to more precise estimates.

Common Mistakes to Avoid

Confusing Bias and Variability: Do not use these terms interchangeably. Bias is about the center (accuracy) of the sampling distribution. Variability is about the spread (precision). An estimator can have low bias but high variability, or vice-versa. Always address both properties separately.
Thinking a Larger Sample Size Fixes Bias: This is a critical misunderstanding. Taking a larger sample will not correct for a biased sampling method (like a convenience sample) or a biased statistic. A large sample from a biased method simply gives you a very precise, but very wrong, answer. Sample size reduces variability, not bias.
Confusing a Single Sample with the Sampling Distribution: Bias is a property of the estimator in the long run (across all possible samples), not a property of a single sample. You cannot look at one sample mean that is far from the population mean and conclude that the sample mean is a "biased estimator." That single result is due to sampling variability.
Mixing up Population, Sample, and Sampling Distributions: Be precise with your language. Data is collected from a sample. We use that data to make an inference about the population. The sampling distribution is a theoretical concept that tells us how a statistic would behave if we took many, many samples. Never say "the sample is normal"; you should say "the sampling distribution is approximately normal."

Biased and Unbiased Point Estimates - AP Statistics Study Guide