Quick Summary
This guide will enable you to master confidence intervals for the slope of a regression model. You will learn to identify the correct inference procedure, verify the necessary conditions for inference, and calculate a confidence interval for the true slope of a population regression line. Most importantly, you will be able to interpret this interval in context, making a statistically sound conclusion about the relationship between two quantitative variables.
Key Concepts
When we perform a least-squares regression on a sample of data, we get a sample slope, b. This is our best estimate for the true population slope, β (beta). However, a different sample would likely yield a different sample slope. A confidence interval provides a range of plausible values for the unknown population slope, β, based on our single sample.
The t-Interval for the Slope
The appropriate procedure is a t-interval for the slope of a regression line. We use a t-distribution because the true standard deviation of the sampling distribution of the slope is unknown and must be estimated from the data.
The Formula:
The confidence interval is calculated as:
Statistic ± (Critical Value) × (Standard Error of Statistic)
b ± t(SE_b)*
b: The sample slope, calculated from the data. This is our point estimate for β.
SE_b: The standard error of the slope. This value estimates the standard deviation of the sampling distribution of the slope. It tells us how much we expect the sample slope to vary from the true slope , on average.
t*: The critical value from a t-distribution with n - 2 degrees of freedom (df) for a given confidence level. We use degrees of freedom because we estimate two parameters from the data: the slope (β) and the y-intercept (α).
Finding Values from Computer Output:
In nearly all AP exam questions, you will be given computer regression output, not raw data. You must be able to locate the necessary values.
Sample Computer Regression Output:
The sample slope (b) is the "Coef" (coefficient) for the explanatory variable ("X-Variable"). Here, b = 0.550.
The standard error of the slope (SE_b) is the "SE Coef" for the explanatory variable. Here, SE_b = 0.170.
The standard deviation of the residuals (s) is given separately. Here, s = 1.945. Do not confuse this with SE_b!
Conditions for Inference (LINER)
To ensure our calculations and conclusions are valid, we must check five conditions. Use the mnemonic LINER.
L - Linear: The true relationship between the explanatory variable (x) and the response variable (y) is linear.
How to Check: Examine the scatterplot of the original data to see if the form is roughly linear. More importantly, examine the residual plot. The residual plot should show no obvious leftover pattern (e.g., no curves, no fanning out).
[Image: Two residual plots side-by-side. Left plot is labeled "Good: No Pattern" and shows random scatter around y=0. Right plot is labeled "Bad: Pattern (Curved)" and shows a clear U-shape.]
I - Independent: Individual observations are independent of each other.
- How to Check: Check for random sampling or random assignment. If sampling without replacement, verify the 10% condition: the sample size must be no more than 10% of the population size (i.e., n \le 0.10N`). 3. **N - Normal:** For any given value of x, the responses (y-values) are Normally distributed around the true regression line. * **How to Check:** We cannot check the distribution of y for every x, so we check the distribution of the **residuals**. Examine a histogram, boxplot, or Normal probability plot of the residuals. The histogram should be roughly unimodal and symmetric, and the Normal probability plot should be roughly linear. * [Image: A nearly linear Normal probability plot of residuals, labeled "Good: Residuals are approximately Normal."] 4. **E - Equal Variance (or Equal Standard Deviation):** The standard deviation of the responses (y-values) is the same for all values of x. This property is called **homoscedasticity**. * **How to Check:** Examine the **residual plot**. The amount of vertical scatter of the residuals should be roughly the same for all x-values. There should be no "fanning" or "cone" shape where the residuals become more (or less) spread out as x increases. 5. **R - Random:** The data come from a well-designed random sample or randomized experiment. (This is often checked along with the Independent condition). ## Key Vocabulary - **Population Slope (β):** The true, unknown rate of change in the mean response $y for each one-unit increase in the explanatory variable x` for the entire population. - **Sample Slope (b):** The slope of the least-squares regression line calculated from sample data. It serves as the point estimate for the population slope, β. - **Standard Error of the Slope (SE_b):** An estimate of the standard deviation of the sampling distribution of the slope. It quantifies the typical amount of error in the sample slope $b as an estimate of the population slope β`. - **Residual:** The difference between an observed y-value and the y-value predicted by the regression line (residual = y - ŷ). We analyze residuals to check the conditions for inference. - **Degrees of Freedom (df):** For regression inference, the degrees of freedom are $n - 2, where n` is the sample size. This is because we use the sample data to estimate two parameters: the slope and the y-intercept. ## Calculator Tech (TI-84) If you are given raw data in two lists (e.g., L1 and L2), you can calculate the confidence interval directly. **Function:** `LinRegTInt` (Linear Regression T-Interval) **Keystrokes:** 1. Enter your x-values in L1 and y-values in L2. ($STAT -> 1:Edit...)
Press
STAT, scroll right to .Scroll down to and press
ENTER.
Inputs:
Xlist:L1 (or whichever list has your explanatory variable)L2 (or whichever list has your response variable)
1
Enter your desired confidence level as a decimal (e.g., 0.95 for 95%).
RegEQ:(Optional) You can store the regression equation in a Y-variable like Y1 by pressing`VARS -> Y-VARS -> 1:Function... -> 1:Y1$.
The output screen will give you the confidence interval , the sample slope , the degrees of freedom , the standard deviation of the residuals , and the correlation coefficient r`. ## How to Show Work on the FRQ Use the four-step **State-Plan-Do-Conclude** process to earn full credit on inference questions. ### State - **Parameter:** Define the parameter of interest in context. - *Template:* "We want to estimate **β**, the true slope of the population least-squares regression line relating [response variable y in context] to [explanatory variable x in context] at a [C]% confidence level." ### Plan - **Procedure:** Name the inference method. - *Template:* "The appropriate procedure is a **t-interval for the slope of a regression line**." - **Check Conditions:** Check the LINER conditions, using the context of the problem. - *Template:* - **Linear:** "The scatterplot of [y-variable] vs. [x-variable] is roughly linear. The residual plot shows no leftover pattern, so we assume the true relationship is linear." - **Independent:** "The data were randomly sampled. Assuming the population of [context] is at least [10 × n], the 10% condition is met. Individual observations are independent." - **Normal:** "The Normal probability plot of the residuals is roughly linear (or the histogram of residuals is roughly symmetric), so we can assume the residuals are approximately Normally distributed." - **Equal Variance:** "The residual plot shows a similar amount of scatter for all x-values, so we assume the standard deviation of the response is constant for all x." - **Random:** "The data were collected from a random sample." ### Do - **Formula:** Write the general formula for the confidence interval. - *Template:* `b ± t*(SE_b)` - **Calculations:** 1. Identify $b and SE_b` from the computer output. 2. Find the degrees of freedom: $df = n - 2.
Find the critical value using a t-table or your calculator's function ($2nd -> VARS [DISTR] -> 4:invT(area: (1-C)/2, df: n-2)`).
Substitute the values into the formula and calculate the interval.
- Example:
0.550 ± 2.110(0.170) = 0.550 ± 0.359 = (0.191, 0.909)
Conclude
Interpretation: Interpret the interval in the context of the problem.
- Template: "We are [C]% confident that the interval from [lower bound] to [upper bound] captures the true slope of the population least-squares regression line relating [y-variable in context] to [x-variable in context]."
Linkage (if asked): If the question asks whether there is a convincing relationship, check if 0 is in the interval.
If 0 is NOT in the interval: "Because 0 is not in our confidence interval, we have convincing evidence of a linear relationship between [x-variable] and [y-variable]."
If 0 IS in the interval: "Because 0 is in our confidence interval, we do not have convincing evidence of a linear relationship between [x-variable] and [y-variable]. A slope of 0 is a plausible value."
Practice Problems
Problem 1:
A real estate agent wants to understand the relationship between the size of a house (in square feet) and its selling price (in thousands of dollars). She collects data from a random sample of 18 recently sold houses in a large suburb and produces the following computer output. Assume the conditions for inference have been met.
Regression Analysis: Price ($1000s) versus Size (sq. ft.)
Predictor Coef SE Coef T P
Constant 90.25 10.51 8.59 0.000
Size (sq. ft) 0.125 0.024 5.21 0.000
S = 15.8 R-Sq = 62.9%
(a) Construct and interpret a 95% confidence interval for the slope of the population regression line.
(b) Based on your interval, is there convincing evidence of a linear relationship between house size and selling price? Explain.
Solution:
(a)
State: We want to estimate β, the true slope of the population least-squares regression line relating selling price (in thousands of dollars) to house size (in square feet) with 95% confidence.
Plan: The appropriate procedure is a t-interval for the slope of a regression line. The problem states that the conditions for inference have been met.
Do:
From the output: $b = 0.125$ and .
The sample size is , so degrees of freedom are .
For 95% confidence with df = 16, the critical value is (from a t-table or ).
The formula is b ± t*(SE_b)`. - Calculation: `0.125 ± 2.120(0.024) = 0.125 ± 0.05088` - Interval: **(0.074, 0.176)** **Conclude:** We are 95% confident that the interval from 0.074 to 0.176 captures the true slope of the population least-squares regression line relating selling price to house size. This means for each additional square foot of size, the true mean selling price is estimated to increase by between $74 and $176. **(b)** Yes, there is convincing evidence of a linear relationship between house size and selling price. The 95% confidence interval, (0.074, 0.176), does not contain 0. This suggests that a slope of 0 is not a plausible value, so we can conclude the true slope is not zero. --- **Problem 2:** A biologist studies the relationship between the number of trees in a forest plot and the number of bird species found in that plot. After collecting data from 30 plots, she calculates a 90% confidence interval for the slope of the regression line to be **(-0.21, 1.45)**. What conclusion should the biologist make about the relationship between the number of trees and the number of bird species? **Solution:** The 90% confidence interval for the true slope is (-0.21, 1.45). This interval gives a range of plausible values for the true slope of the population regression line relating the number of bird species to the number of trees. Because the value **0 is included in this interval**, it is a plausible value for the true slope β. A slope of 0 would mean there is no linear relationship between the number of trees and the number of bird species. Therefore, based on this interval, the biologist **does not have convincing evidence** of a linear relationship between the number of trees in a plot and the number of bird species found there. ## Common Mistakes to Avoid - **Misinterpreting the Interval:** Do not say, "There is a 95% probability that the true slope β is in the interval (0.074, 0.176)." The true slope β is a fixed (but unknown) number. It is either in the interval or it is not. The 95% confidence is in the *method* used to construct the interval, not in any single interval. - **Using Incorrect Degrees of Freedom:** For regression inference, always use **df = n - 2**. Using $n or n - 1` is a common error that will lead to an incorrect t* critical value. Remember, we estimate two parameters (slope and intercept), so we lose two degrees of freedom. - **Poorly Checking Conditions:** Do not just list "LINER." You must explicitly check each condition in the context of the problem, referencing the specific graphs provided (scatterplot, residual plot, histogram of residuals). A very common mistake is checking the Normal condition on the original Y-data instead of on the **residuals**. - **Confusing $s and SE_b`:** On computer output, $s is the standard deviation of the residuals, which measures the typical prediction error. is the standard error of the slope, which is used in the confidence interval formula. Be sure to pull the correct value ($SE Coef`) from the output.
Concluding Causation: A confidence interval for the slope, even if it doesn't contain zero, only provides evidence of a linear association. It does not imply that changes in the x-variable cause changes in the y-variable. Causation can only be established from a well-designed, randomized experiment.