Quick Summary
This guide will equip you to master the concept of residuals in linear regression. You will learn how to calculate a residual, which measures the vertical distance between an actual data point and the predicted value on a regression line. By the end of this lesson, you will be able to interpret the meaning of any residual in context, understanding that its sign and magnitude reveal the direction and size of the prediction error.
Key Concepts
A residual is the cornerstone of evaluating how well a linear model fits our data. It is a measure of error for a single data point.
Definition: A residual is the difference between the observed (actual) y-value of a data point and the predicted y-value (written as ŷ, read "y-hat") for that same point using the least-squares regression line (LSRL). In simple terms, it's the leftover error in our prediction.
The Formula: The calculation is straightforward and essential to memorize.
Formula:
In symbols:
[Image: A scatterplot with a least-squares regression line drawn. One data point (x, y) is highlighted above the line. A vertical line segment is drawn from the point down to the regression line at (x, ŷ). This vertical segment is labeled "Residual = y - ŷ (Positive)".]
Interpreting the Sign of a Residual:
Positive Residual (residual > 0): This occurs when . The actual data point is above the regression line. This means the LSRL underestimated the actual value.
Negative Residual (residual < 0): This occurs when . The actual data point is below the regression line. This means the LSRL overestimated the actual value.
Zero Residual (residual = 0): This occurs when . The actual data point is exactly on the regression line. This means the LSRL made a perfect prediction for this specific point.
Interpreting the Magnitude of a Residual:
The absolute value of a residual tells you the size of the prediction error. A small residual (e.g., -0.5) indicates a more accurate prediction than a large residual (e.g., 15.2).
The data point with the largest absolute value residual is the point for which the linear model made the worst prediction.
A Fundamental Property of the LSRL:
The least-squares regression line is the unique line that minimizes the sum of the squared residuals.
A direct consequence of this is that the sum of all residuals for an LSRL is always zero. This is a mathematical guarantee and a useful check.
Key Vocabulary
- Residual: The difference between an actual observed y-value and the y-value predicted by the regression line (y - ŷ`). It measures the vertical distance from a point to the line. - **Predicted Value (ŷ)**: The value of the response variable predicted by the least-squares regression line for a given explanatory variable value (x). - **Actual Value (y)**: The observed, real-world value of the response variable for a given data point. - **Least-Squares Regression Line (LSRL)**: The line that makes the sum of the squared residuals as small as possible, providing the "best fit" for a linear relationship in the data. - **Underestimate**: A prediction (ŷ) that is smaller than the actual value (y), resulting in a positive residual. - **Overestimate**: A prediction (ŷ) that is larger than the actual value (y), resulting in a negative residual. ## Calculator Tech (TI-84) After performing a linear regression on your TI-84, the calculator automatically computes and stores the residuals for every data point. Here's how to access them. **Prerequisite:** You must first run a linear regression. 1. Enter your explanatory variable data into list `L1` and your response variable data into list `L2`. (`STAT -> 1:Edit...`) 2. Calculate the LSRL. (`STAT -> CALC -> 8:LinReg(a+bx)`). Make sure your inputs are `Xlist: L1`, `Ylist: L2`. You can optionally store the regression equation into Y1 by adding `Store RegEQ: Y1`. Press `Calculate`. **Accessing the List of Residuals:** The calculator automatically creates a list named `RESID` containing the residuals in the same order as your data points. 1. Press `2nd -> STAT [LIST]`. 2. The first option in the `NAMES` menu is `1: L1`, `2: L2`, etc. Scroll down until you find `RESID`. 3. Press `ENTER` to select it. You can now use this list like any other. For example, to view the residuals, go to your home screen, select `RESID` from the list menu, and press `ENTER`. To store them in `L3` for easier viewing, go to the list editor (`STAT -> 1:Edit...`), highlight the header for `L3`, select `RESID` from the list menu, and press `ENTER`. **Creating a Residual Plot (A key skill for Unit 2.8):** A residual plot graphs the residuals against the explanatory variable (x-values). 1. Press `2nd -> Y= [STAT PLOT]`. 2. Select `1: Plot1...` and turn it $On.
Set
Type:to the first option (scatterplot).Set
Xlist:toL1(your explanatory variable).Set
Ylist:toRESID(using the steps above).Press
ZOOM -> 9:ZoomStatto see the residual plot.
How to Show Work on the FRQ
Interpreting a residual is a common FRQ task. To get full credit, you must provide a clear, contextual interpretation. Use the following three-part template.
FRQ Template: Interpreting a Residual
Identify: State the actual (y) and predicted (ŷ) values in context.
Compare: State whether the actual value was higher or lower than the predicted value, and by how much (the value of the residual).
Conclude: State whether the model resulted in an overestimate or an underestimate.
Example Sentence Structure:
"The residual of [residual value with units] for a [context of x-value] means that the actual [response variable in context] of [y-value] was [|residual value|] [units] [higher/lower] than the value of [ŷ-value] predicted by the linear model. Therefore, the model [overestimated/underestimated] the [response variable in context]."
Practice Problems
Problem 1:
A real estate agent uses a linear model to predict the selling price of homes based on their square footage. The least-squares regression line is:
A specific home with 2,000 square feet of space sold for an actual price of ŷ = 50,000 + 150 * (2000)ŷ = 50,000 + 300,000$
ŷ = $350,000`
* **Step 2: Use the residual formula.**
$residual = actual price (y) - predicted price (ŷ)residual = $335,000 - $350,000residual = -$15,000`
The residual for this home is -$15,000.
**(b) Interpretation:**
*Using the FRQ template:*
The residual of -$15,000 for a home with 2,000 square feet means that the actual selling price of $335,000 was $15,000 lower than the price of $350,000 predicted by the linear model. Therefore, the model **overestimated** the selling price of this home.
---
**Problem 2:**
The scatterplot below shows the relationship between the number of hours a student studied for a final exam and their score on the exam. The least-squares regression line is also shown. Point A represents a student named Maria.
[Image: A scatterplot with "Hours Studied" on the x-axis and "Exam Score" on the y-axis. The points show a positive, linear association. An LSRL is drawn through the data. Point A is clearly located below the regression line.]
(a) Is the residual for Maria positive, negative, or zero? Explain your reasoning.
(b) What does the residual for Maria tell us about the model's prediction for her exam score?
**Solution:**
**(a) Sign of the Residual:**
The residual for Maria is **negative**.
*Reasoning:* Point A, which represents Maria's data, is located vertically below the least-squares regression line. This means her actual exam score (y) was lower than the exam score predicted by the line (ŷ) for the number of hours she studied. Since the residual is calculated as `actual - predicted`, a lower actual value results in a negative residual.
**(b) Interpretation:**
A negative residual for Maria means that the linear model **overestimated** her exam score based on the number of hours she studied. Her actual score on the exam was lower than the score the regression line would have predicted.
## Common Mistakes to Avoid
1. **Incorrect Formula Order:** A very common mistake is to calculate the residual as $predicted - actual (Σ(y - ŷ)Σ(y - ŷ)^2`) is what the LSRL *minimizes*, and this sum will be a positive value (unless all points fall perfectly on the line).
- Stating a Residual is "Good" or "Bad" Without Scale: A residual of -50 might seem large, but if the y-values are in the millions, it's a tiny error. The "size" of a residual is relative to the scale of the response variable. Avoid subjective labels unless you are comparing it to other residuals in the same dataset.