Residuals | AP Stats Unit 2 Study Guide

Quick Summary

This guide will equip you to master the concept of residuals in linear regression. You will learn how to calculate a residual, which measures the vertical distance between an actual data point and the predicted value on a regression line. By the end of this lesson, you will be able to interpret the meaning of any residual in context, understanding that its sign and magnitude reveal the direction and size of the prediction error.

Key Concepts

A residual is the cornerstone of evaluating how well a linear model fits our data. It is a measure of error for a single data point.

Definition: A residual is the difference between the observed (actual) y-value of a data point and the predicted y-value (written as ŷ, read "y-hat") for that same point using the least-squares regression line (LSRL). In simple terms, it's the leftover error in our prediction.
The Formula: The calculation is straightforward and essential to memorize.
- Formula: $res i d u a l = a c t u a l y - p re d i c t e d y$
- In symbols: $res i d u a l = y - \overset{y}{^}$
[Image: A scatterplot with a least-squares regression line drawn. One data point (x, y) is highlighted above the line. A vertical line segment is drawn from the point down to the regression line at (x, ŷ). This vertical segment is labeled "Residual = y - ŷ (Positive)".]
Interpreting the Sign of a Residual:
- Positive Residual (residual > 0): This occurs when $y > \overset{y}{^}$ . The actual data point is above the regression line. This means the LSRL underestimated the actual value.
- Negative Residual (residual < 0): This occurs when $y < \overset{y}{^}$ . The actual data point is below the regression line. This means the LSRL overestimated the actual value.
- Zero Residual (residual = 0): This occurs when $y = \overset{y}{^}$ . The actual data point is exactly on the regression line. This means the LSRL made a perfect prediction for this specific point.
Interpreting the Magnitude of a Residual:
- The absolute value of a residual tells you the size of the prediction error. A small residual (e.g., -0.5) indicates a more accurate prediction than a large residual (e.g., 15.2).
- The data point with the largest absolute value residual is the point for which the linear model made the worst prediction.
A Fundamental Property of the LSRL:
- The least-squares regression line is the unique line that minimizes the sum of the squared residuals.
- A direct consequence of this is that the sum of all residuals for an LSRL is always zero. This is a mathematical guarantee and a useful check.
- $Σ (y - \overset{y}{^}) = 0$

Key Vocabulary

Residual: The difference between an actual observed y-value and the y-value predicted by the regression line (.

Set Type: to the first option (scatterplot).
Set Xlist: to L1 (your explanatory variable).
Set Ylist: to RESID (using the steps above).
Press ZOOM -> 9:ZoomStat to see the residual plot.

How to Show Work on the FRQ

Interpreting a residual is a common FRQ task. To get full credit, you must provide a clear, contextual interpretation. Use the following three-part template.

FRQ Template: Interpreting a Residual

Identify: State the actual (y) and predicted (ŷ) values in context.
Compare: State whether the actual value was higher or lower than the predicted value, and by how much (the value of the residual).
Conclude: State whether the model resulted in an overestimate or an underestimate.

Example Sentence Structure:

"The residual of [residual value with units] for a [context of x-value] means that the actual [response variable in context] of [y-value] was [|residual value|] [units] [higher/lower] than the value of [ŷ-value] predicted by the linear model. Therefore, the model [overestimated/underestimated] the [response variable in context]."

Practice Problems

Problem 1:

A real estate agent uses a linear model to predict the selling price of homes based on their square footage. The least-squares regression line is:

$p re d i c t e d p r i ce = 50, 000 + 150 * (s q u a re f oo t a g e)$

A specific home with 2,000 square feet of space sold for an actual price of $335, 000. (a) C a l c u l a t e t h eres i d u a l f or t hi s h o m e . (b) I n t er p re tt h eres i d u a l in t h eco n t e x t o f t h e p ro b l e m . * * S o l u t i o n : * * * * (a) C a l c u l a t i o n : * * * * * St e p 1 : F in d t h e p re d i c t e d p r i ce (\overset{y}{^}) . * *$ ŷ = 50,000 + 150 * (2000)ŷ = 50,000 + 300,000$

 $ŷ = $350,000` * **Step 2: Use the residual formula.** $residual = actual price (y) - predicted price (ŷ)$  $residual = $335,000 - $350,000$  $residual = -$15,000` The residual for this home is -$15,000. **(b) Interpretation:** *Using the FRQ template:* The residual of -$15,000 for a home with 2,000 square feet means that the actual selling price of $335,000 was $15,000 lower than the price of $350,000 predicted by the linear model. Therefore, the model **overestimated** the selling price of this home. --- **Problem 2:** The scatterplot below shows the relationship between the number of hours a student studied for a final exam and their score on the exam. The least-squares regression line is also shown. Point A represents a student named Maria. [Image: A scatterplot with "Hours Studied" on the x-axis and "Exam Score" on the y-axis. The points show a positive, linear association. An LSRL is drawn through the data. Point A is clearly located below the regression line.] (a) Is the residual for Maria positive, negative, or zero? Explain your reasoning. (b) What does the residual for Maria tell us about the model's prediction for her exam score? **Solution:** **(a) Sign of the Residual:** The residual for Maria is **negative**. *Reasoning:* Point A, which represents Maria's data, is located vertically below the least-squares regression line. This means her actual exam score (y) was lower than the exam score predicted by the line (ŷ) for the number of hours she studied. Since the residual is calculated as `actual - predicted`, a lower actual value results in a negative residual. **(b) Interpretation:** A negative residual for Maria means that the linear model **overestimated** her exam score based on the number of hours she studied. Her actual score on the exam was lower than the score the regression line would have predicted. ## Common Mistakes to Avoid 1. **Incorrect Formula Order:** A very common mistake is to calculate the residual as $predicted - actual$  ( $\overset{y}{^} - y ‘) . T hi s g i v esyo u t h ecorrec t n u mb er b u tt h e w ro n g s i g n, l e a d in g t o a co m pl e t e l y in correc t in t er p re t a t i o n . A lw a ysre m e mb er : * * R es i d u a l = A c t u a l - P re d i c t e d (A P) * * .2. * * M i s in t er p re t in g t h e S i g n : * * C o n f u s in g w ha tp os i t i v e an d n e g a t i v eres i d u a l s m e an . A s im pl e w a y t ore m e mb er : a * * P * * os i t i v eres i d u a l m e an s t h e p o in t i s " a * * P * * a r t " f ro m t h e l in e (ab o v e i t), an d t h e p re d i c t i o n w a s t oo l o w (an u n d eres t ima t e) . A * * N * * e g a t i v eres i d u a l m e an s t h e p o in t i s " u * * N * * d er " t h e l in e, an d t h e p re d i c t i o n w a s t oo hi g h (an o v eres t ima t e) .3. * * F or g e tt in g C o n t e x t : * * O nan FRQ, s im pl ys t a t in g " t h e m o d e l o v eres t ima t e d " i s n o t e n o ug h . Y o u m u s t s p ec i f y * w ha t * w a so v eres t ima t e d or u n d eres t ima t e d in t h eco n t e x t o f t h e p ro b l e m (e . g ., " t h e m o d e l o v eres t ima t e d t h e * se ll in g p r i ceo f t h e h o m e * ") .4. * * C o n f u s in g t h e S u m o f R es i d u a l s w i t h t h e S u m o f Sq u a re d R es i d u a l s : * * T h es u m o f t h eres i d u a l s ($ Σ(y - ŷ) $) f or an L SR L i s * a lw a ys * zero . T h es u m o f t h e * s q u a re d * res i d u a l s ($ Σ(y - ŷ)^2`) is what the LSRL *minimizes*, and this sum will be a positive value (unless all points fall perfectly on the line).

Stating a Residual is "Good" or "Bad" Without Scale: A residual of -50 might seem large, but if the y-values are in the millions, it's a tiny error. The "size" of a residual is relative to the scale of the response variable. Avoid subjective labels unless you are comparing it to other residuals in the same dataset.

Residuals - AP Statistics Study Guide

Quick Summary

Key Concepts

Key Vocabulary

How to Show Work on the FRQ

Practice Problems