Linear Regression

Predict one variable from another using a line of best fit

You probably make predictions based on patterns all the time without realizing it. If your friend always takes about 15 minutes longer than they say they will, you adjust your expectations. If you notice that studying an extra hour tends to boost your test score by a few points, you plan your study time accordingly. If traffic is heavy, you leave earlier because you know traffic and travel time are related.

Linear regression is the mathematical version of this intuition. It takes two related variables and finds the best straight line to describe their relationship. Once you have that line, you can use it to make predictions: given a value of one variable, what is your best guess for the other? This chapter shows you how to find that line, interpret what it tells you, use it to make predictions, and recognize when it might lead you astray.

Core Concepts

From Correlation to Regression

In the previous chapter on correlation, you learned how to measure the strength and direction of a linear relationship using the correlation coefficient $r$. Correlation answers the question: “How strongly are these two variables related?”

Regression goes further. It asks: “If I know the value of one variable, what is my best prediction for the other?”

Think of it this way:

  • Correlation tells you that hours studied and exam scores are strongly related ($r = 0.85$)
  • Regression tells you that if you study 4 hours, your predicted score is 78 points

Regression gives you a formula—an equation for a line—that lets you plug in a value and get a prediction. That is why regression is such a workhorse in science, economics, medicine, and any field that wants to predict one thing from another.

The Regression Line: Finding the Best Fit

When you look at a scatter plot with a clear linear pattern, you could draw many different lines through the data. Some would fit better than others. But which line is the best fit?

The least-squares regression line is the line that minimizes the sum of the squared vertical distances from each point to the line. In other words, it is the line that, overall, comes as close as possible to all the points—where “close” is measured vertically.

Why vertical distances? Because we are trying to predict the $y$-variable (response) from the $x$-variable (explanatory). The vertical distance from a point to the line represents our prediction error for that point.

Why squared distances? Squaring does two things: it makes all errors positive (so errors above and below the line do not cancel out), and it penalizes large errors more heavily than small ones. A line that is wildly off for one point gets heavily penalized, which encourages the line to fit all points reasonably well rather than fitting some perfectly while ignoring others.

The Regression Equation

The least-squares regression line has the form:

$$\hat{y} = a + bx$$

Where:

  • $\hat{y}$ (read “y-hat”) is the predicted value of $y$ for a given $x$
  • $a$ is the y-intercept—the predicted value of $y$ when $x = 0$
  • $b$ is the slope—how much $\hat{y}$ changes for each one-unit increase in $x$
  • $x$ is the explanatory (independent) variable

The “hat” symbol on $\hat{y}$ is important. It reminds you that this is a prediction, not necessarily the actual value. Real data points have $y$ values; your regression equation gives you $\hat{y}$ values.

Calculating the Slope and Intercept

If you have data on two variables $x$ and $y$, here are the formulas for the slope and intercept:

Slope: $$b = r \cdot \frac{s_y}{s_x}$$

Where:

  • $r$ is the correlation coefficient
  • $s_y$ is the standard deviation of the $y$-values
  • $s_x$ is the standard deviation of the $x$-values

This formula reveals something beautiful: the slope depends on both how strongly the variables are related ($r$) and how spread out each variable is ($s_y$ and $s_x$). If the relationship is weak ($r$ near 0), the slope will be near 0 too.

Intercept: $$a = \bar{y} - b\bar{x}$$

Where:

  • $\bar{y}$ is the mean of the $y$-values
  • $\bar{x}$ is the mean of the $x$-values
  • $b$ is the slope you just calculated

This formula guarantees something important: the regression line always passes through the point $(\bar{x}, \bar{y})$—the center of your data.

Interpreting the Slope

The slope $b$ tells you the predicted change in $y$ for each one-unit increase in $x$. This interpretation is crucial and comes up constantly in applications.

Example: If a regression equation predicting exam score from hours studied is $\hat{y} = 55 + 6x$, the slope is 6. This means: “For each additional hour studied, the predicted exam score increases by 6 points.”

Notice the careful language:

  • “Predicted” score (not actual score)
  • “Increases by” for positive slopes, “decreases by” for negative slopes
  • We do not say the extra hour causes a 6-point increase—regression shows association, not causation

The slope is also related to the correlation. If $r > 0$, the slope $b > 0$ (positive relationship). If $r < 0$, the slope $b < 0$ (negative relationship).

Interpreting the Intercept

The intercept $a$ is the predicted value of $y$ when $x = 0$. Sometimes this makes sense; sometimes it does not.

When the intercept makes sense: If you are predicting total cost from number of items purchased, the intercept might represent a base shipping cost—the cost when you buy zero items.

When the intercept does not make sense: If you are predicting weight from height, the intercept gives the “predicted weight of a person with height zero”—which is meaningless. In such cases, the intercept is just a mathematical necessity for the line to fit the data. Do not try to interpret it as something real.

Always ask yourself: “Is $x = 0$ within the range of my data, and does it represent something meaningful?” If not, do not over-interpret the intercept.

Making Predictions

Once you have your regression equation, making predictions is straightforward: plug in a value of $x$ and calculate $\hat{y}$.

Example: With $\hat{y} = 55 + 6x$, predict the exam score for someone who studies 5 hours.

$\hat{y} = 55 + 6(5) = 55 + 30 = 85$

The predicted exam score is 85 points.

But there is a critical question: should you trust this prediction?

Interpolation vs. Extrapolation

Interpolation means making predictions within the range of your original data. If your study hours data ranged from 1 to 8 hours, predicting for 5 hours is interpolation.

Extrapolation means making predictions outside the range of your original data. Predicting for 15 hours of studying when your data only went up to 8 hours is extrapolation.

Extrapolation is risky. The linear relationship that holds within your data might not continue beyond it. Perhaps studying beyond 8 hours has diminishing returns as fatigue sets in. Perhaps it even hurts performance. Your regression line cannot know this—it just extends the pattern it sees.

A famous example: if you used a regression line based on children’s growth data to predict the height of a 50-year-old, you would get an absurd result. Linear growth in childhood does not continue through adulthood.

Rule of thumb: Be cautious with interpolation, and be very skeptical of extrapolation. The farther you extrapolate, the less reliable your prediction.

Residuals: Measuring Prediction Error

A residual is the difference between an actual observed value and the value predicted by the regression line:

$$\text{Residual} = y - \hat{y} = \text{Actual} - \text{Predicted}$$

Residuals tell you how far off your predictions are for each data point:

  • Positive residual: The actual value is higher than predicted (the point is above the line)
  • Negative residual: The actual value is lower than predicted (the point is below the line)
  • Zero residual: The actual value exactly equals the prediction (the point is on the line)

Example: If you predicted a score of 85 but the student actually scored 82, the residual is: $$\text{Residual} = 82 - 85 = -3$$

The negative residual tells you the prediction was 3 points too high.

Properties of Residuals

The residuals from a least-squares regression line have special properties:

  1. They sum to zero: $\sum(y_i - \hat{y}_i) = 0$. The positive and negative residuals balance out perfectly. This is a mathematical consequence of how the line is calculated.

  2. The mean of residuals is zero: Since they sum to zero and there are $n$ of them, the average residual is zero.

  3. The sum of squared residuals is minimized: This is what “least squares” means—no other line would give a smaller sum of squared residuals.

Residual Plots: Checking Your Model

A residual plot graphs the residuals (on the y-axis) against the explanatory variable $x$ (on the x-axis). Residual plots are your primary tool for assessing whether a linear model is appropriate.

What a good residual plot looks like:

  • Points scattered randomly above and below the horizontal line at zero
  • No clear pattern or curve
  • Roughly constant spread (no fanning out or funneling in)

What bad residual plots reveal:

Curved pattern: If the residuals show a U-shape or inverted U-shape, the relationship is not linear. A straight line is not a good model—you might need a curve.

Fanning out (or in): If residuals are small for low x-values but large for high x-values (or vice versa), the spread is not constant. This is called heteroscedasticity, and it means predictions are more reliable for some x-values than others.

Clear clusters or patterns: Unusual structures in residuals suggest problems with the data or model.

The goal is residuals that look like random scatter with no pattern. Pattern in the residuals means information that the linear model failed to capture.

The Coefficient of Determination: $r^2$

The coefficient of determination, denoted $r^2$, tells you what proportion of the variation in $y$ is explained by the linear relationship with $x$.

If $r = 0.8$, then $r^2 = 0.64$. This means 64% of the variation in $y$-values can be explained by the linear relationship with $x$. The remaining 36% is due to other factors or random variation.

Interpretation template: “$r^2 \times 100$% of the variation in [response variable] can be explained by the linear relationship with [explanatory variable].”

Example: If the regression of exam scores on study hours has $r^2 = 0.72$, you would say: “72% of the variation in exam scores can be explained by the linear relationship with study hours.”

What $r^2$ values mean:

  • $r^2 = 1$ (or 100%): Perfect prediction—all points fall exactly on the line
  • $r^2 = 0.9$ (90%): Very strong—the line captures almost all the variation
  • $r^2 = 0.5$ (50%): Moderate—the line captures half the variation
  • $r^2 = 0.1$ (10%): Weak—the line captures little of the variation
  • $r^2 = 0$ (0%): The linear model explains nothing—predictions are no better than just guessing the mean

Note that $r^2$ is always between 0 and 1 (or 0% and 100%), and since it is the square of $r$, it is always non-negative.

Limitations of Regression

Linear regression is powerful, but it has important limitations you must keep in mind:

1. Correlation is not causation (again): The regression equation shows an association. It does not prove that changing $x$ will cause $y$ to change. Do not confuse prediction with causation.

2. Linear models only: Regression assumes a straight-line relationship. If the true relationship is curved, linear regression will give poor predictions and misleading interpretations.

3. Sensitive to outliers: A single unusual point can dramatically change the slope and intercept, especially if it has an extreme $x$-value (high leverage).

4. Only as good as the data: Regression on a small or biased sample may not generalize to the broader population. Garbage in, garbage out.

5. Extrapolation danger: Predictions outside the range of your data are unreliable and potentially nonsensical.

6. Cannot capture complex relationships: Real-world relationships often involve multiple variables, interactions, and nonlinear effects that simple linear regression cannot model.

Regression to the Mean

Regression to the mean is a subtle but important phenomenon that often gets misunderstood.

Here is the idea: in repeated measurements, extreme values tend to be followed by values closer to the average. If a student scores extremely high on one test (perhaps partly due to luck), their next score is likely to be lower—closer to their true average. If a baseball player has an unusually poor month, the next month is likely to be better.

This is not because something causes the return to average. It is a statistical phenomenon. Extreme values are partly skill/ability and partly luck. When luck does not repeat, values move back toward the mean.

Why this matters for regression: The slope of the regression line is always between $-1$ and $1$ in standardized units (z-scores). This means that predicted values are always closer to the mean than the observed explanatory variable is. If you select students who scored extremely high on a pretest, your regression model will predict they score somewhat lower on a posttest—not because they got worse, but because of regression to the mean.

Example: If you pick the students with the highest SAT scores and check their college GPAs, the GPAs will likely be above average but not as extremely above average as their SAT scores were. This is regression to the mean, not evidence that high SAT scores do not help predict college success.

This phenomenon was actually how regression got its name. Francis Galton noticed that very tall parents tended to have children who were still tall but somewhat closer to average height—they “regressed” toward the mean.

Notation and Terminology

Term Meaning Example
Regression line Line that best fits the data $\hat{y} = a + bx$
Slope ($b$) Change in $\hat{y}$ per unit change in $x$ $b = 6$ means $\hat{y}$ increases by 6 for each unit increase in $x$
Intercept ($a$) Predicted $y$ when $x = 0$ May or may not be meaningful in context
$\hat{y}$ Predicted value of $y$ “y-hat”
Residual $y - \hat{y}$ Actual minus predicted
$r^2$ Proportion of variance explained $r^2 = 0.64$ means 64% of variation is explained
Extrapolation Predicting beyond data range Often unreliable
Interpolation Predicting within data range Generally more reliable
Least squares Method minimizing sum of squared residuals Standard method for finding the regression line
Explanatory variable The predictor variable ($x$) Also called independent variable
Response variable The variable being predicted ($y$) Also called dependent variable

Examples

Example 1: Using a Regression Equation to Make Predictions

A researcher finds that the regression equation relating daily caffeine intake (in mg) to hours of sleep is:

$$\hat{y} = 8.5 - 0.01x$$

where $x$ is caffeine intake and $\hat{y}$ is predicted hours of sleep.

a) Predict the hours of sleep for someone who consumes 200 mg of caffeine daily. b) Predict the hours of sleep for someone who consumes no caffeine.

Solution:

a) For 200 mg of caffeine:

Substitute $x = 200$ into the equation: $$\hat{y} = 8.5 - 0.01(200) = 8.5 - 2 = 6.5$$

The predicted hours of sleep is 6.5 hours.

b) For no caffeine ($x = 0$):

Substitute $x = 0$ into the equation: $$\hat{y} = 8.5 - 0.01(0) = 8.5$$

The predicted hours of sleep is 8.5 hours.

Notice that when $x = 0$, the prediction equals the intercept. This makes sense—the intercept is defined as the predicted value when $x = 0$.

Example 2: Interpreting Slope in Context

A study of new employees at a company finds the following regression equation relating months of training ($x$) to job performance score ($y$, on a 100-point scale):

$$\hat{y} = 62 + 4.5x$$

Interpret the slope in the context of this problem.

Solution:

The slope is $b = 4.5$.

Interpretation: For each additional month of training, the predicted job performance score increases by 4.5 points.

Important notes on the interpretation:

  • We say “predicted” score, not “actual” score
  • We say “increases by,” not “causes an increase of” (regression shows association, not causation)
  • The units matter: 4.5 points per month of training

This does not mean that training causes better performance—employees who train longer might also be more motivated or have more aptitude. But the regression equation lets us predict that employees with more training months tend to have higher performance scores.

Example 3: Calculating and Interpreting Residuals

Using the training-performance regression equation $\hat{y} = 62 + 4.5x$, suppose you have data on three employees:

Employee Months of Training ($x$) Actual Performance ($y$)
Alice 3 78
Bob 5 80
Carol 5 88

Calculate the residual for each employee and interpret what the residuals tell you.

Solution:

Step 1: Calculate predicted values.

For Alice ($x = 3$): $\hat{y} = 62 + 4.5(3) = 62 + 13.5 = 75.5$

For Bob ($x = 5$): $\hat{y} = 62 + 4.5(5) = 62 + 22.5 = 84.5$

For Carol ($x = 5$): $\hat{y} = 62 + 4.5(5) = 62 + 22.5 = 84.5$

Step 2: Calculate residuals (Actual - Predicted).

For Alice: Residual $= 78 - 75.5 = +2.5$

For Bob: Residual $= 80 - 84.5 = -4.5$

For Carol: Residual $= 88 - 84.5 = +3.5$

Step 3: Interpret.

Employee Residual Interpretation
Alice $+2.5$ Scored 2.5 points higher than predicted
Bob $-4.5$ Scored 4.5 points lower than predicted
Carol $+3.5$ Scored 3.5 points higher than predicted

Alice and Carol performed better than the model predicted for their training levels—they are above the regression line. Bob performed worse than predicted—he is below the line.

The residuals also show that even employees with the same training (Bob and Carol both have 5 months) can have very different actual performances. The regression line predicts the average outcome, but individual results vary.

Example 4: Interpreting $r^2$ in Context

A regression analysis examining the relationship between house size (in square feet) and selling price (in thousands of dollars) produces the following results:

  • Regression equation: $\hat{y} = 45 + 0.12x$
  • Correlation: $r = 0.87$

a) Calculate and interpret $r^2$. b) What does this tell you about using house size to predict selling price?

Solution:

a) Calculate $r^2$:

$r^2 = (0.87)^2 = 0.7569 \approx 0.76$

Interpretation: Approximately 76% of the variation in house selling prices can be explained by the linear relationship with house size (square footage).

b) What this means:

The $r^2$ value of 0.76 tells us that house size is a fairly strong predictor of selling price. About three-quarters of the differences in house prices can be accounted for by differences in size.

However, the remaining 24% of the variation in prices is due to other factors: location, number of bedrooms, age of the house, condition, local market conditions, and so on. If two houses are the same size, they might still have very different prices.

Practical implication: Size alone gives a reasonably good prediction of price, but for a more accurate estimate, you would want to consider other factors as well.

Example 5: Analyzing a Residual Plot and Assessing Model Appropriateness

A researcher uses linear regression to model the relationship between age (in years, from 15 to 65) and reaction time (in milliseconds). The residual plot shows the following pattern:

  • For ages 15-30, residuals are mostly negative (actual times faster than predicted)
  • For ages 30-50, residuals scatter around zero
  • For ages 50-65, residuals are mostly positive (actual times slower than predicted)
  • The overall pattern resembles a U-shape or curve

What does this residual plot tell you about the appropriateness of the linear model?

Solution:

Analysis of the residual plot:

The U-shaped (or curved) pattern in the residuals is a clear warning sign that a linear model is not appropriate for this data.

What the pattern means:

A good residual plot should show random scatter with no discernible pattern. The U-shaped pattern indicates that:

  1. The true relationship is not linear. A straight line systematically underpredicts for young and old ages, and overpredicts (or predicts accurately) for middle ages.

  2. The relationship is likely curved. Reaction time might decrease (improve) from teens to young adulthood, stay relatively stable through middle age, and then increase (worsen) with older age. This would create a U-shaped or J-shaped curve, not a straight line.

  3. The linear model misses important information. The pattern in the residuals represents systematic prediction errors that a better model could capture.

Recommendations:

  1. Consider a nonlinear model. A quadratic (parabolic) model like $\hat{y} = a + bx + cx^2$ might fit better, allowing for the curved relationship.

  2. Do not use the linear model for predictions. Predictions from this linear model will be systematically wrong for young and old ages.

  3. Report the limitation. If you must use a linear model for simplicity, acknowledge that it does not fit well at the extremes of the age range.

Key lesson: Always examine residual plots before trusting a regression model. A high $r^2$ or a seemingly reasonable equation does not guarantee the model is appropriate—only the residual plot can reveal whether the linear assumption is valid.

Example 6: Complete Regression Analysis

A teacher collects data on hours spent on homework per week ($x$) and final exam scores ($y$) for 6 students:

Hours ($x$) 2 4 5 6 8 11
Score ($y$) 58 70 68 78 82 94

Given: $\bar{x} = 6$, $\bar{y} = 75$, $s_x = 3.03$, $s_y = 12.47$, $r = 0.97$

a) Find the regression equation. b) Predict the score for a student who does 7 hours of homework. c) Calculate the residual for the student who did 5 hours of homework. d) Interpret $r^2$ in context. e) Should you use this equation to predict the score for a student who does 20 hours of homework per week?

Solution:

a) Find the regression equation.

Slope: $$b = r \cdot \frac{s_y}{s_x} = 0.97 \times \frac{12.47}{3.03} = 0.97 \times 4.115 = 3.99 \approx 4$$

Intercept: $$a = \bar{y} - b\bar{x} = 75 - 4(6) = 75 - 24 = 51$$

Regression equation: $$\hat{y} = 51 + 4x$$

b) Predict the score for 7 hours of homework.

$$\hat{y} = 51 + 4(7) = 51 + 28 = 79$$

The predicted exam score is 79 points.

c) Calculate the residual for 5 hours.

First, find the predicted score for $x = 5$: $$\hat{y} = 51 + 4(5) = 51 + 20 = 71$$

The actual score for the student with 5 hours was 68.

Residual $= y - \hat{y} = 68 - 71 = -3$

The residual is $-3$, meaning this student scored 3 points lower than the model predicted.

d) Interpret $r^2$.

$r^2 = (0.97)^2 = 0.9409 \approx 0.94$

Interpretation: Approximately 94% of the variation in final exam scores can be explained by the linear relationship with hours of homework per week.

This is a very strong relationship—homework hours explain almost all the variation in scores.

e) Should you predict for 20 hours?

No, you should not. This would be extrapolation—predicting far beyond the range of the original data (which went from 2 to 11 hours).

The prediction would be: $\hat{y} = 51 + 4(20) = 51 + 80 = 131$ points.

This is problematic for two reasons:

  1. The exam is likely out of 100 points, so 131 is impossible.
  2. The linear relationship might not continue. At some point, additional homework hours might have diminishing returns, or exhaustion might even hurt performance.

Extrapolation can lead to nonsensical or unreliable predictions. Stick to the range of your data.

Key Properties and Rules

Formulas for the Least-Squares Regression Line

Regression equation: $$\hat{y} = a + bx$$

Slope: $$b = r \cdot \frac{s_y}{s_x}$$

Intercept: $$a = \bar{y} - b\bar{x}$$

Properties of the Least-Squares Line

  1. The line always passes through the point $(\bar{x}, \bar{y})$
  2. The sum of residuals equals zero: $\sum(y_i - \hat{y}_i) = 0$
  3. The sum of squared residuals is minimized (smaller than for any other line)
  4. The slope has the same sign as the correlation coefficient $r$

Interpreting Regression Output

Quantity Interpretation
Slope $b$ For each one-unit increase in $x$, $\hat{y}$ changes by $b$ units
Intercept $a$ Predicted value of $y$ when $x = 0$ (if meaningful)
$r^2$ Proportion of variation in $y$ explained by the linear relationship with $x$
Residual How far above ($+$) or below ($-$) the line an actual point falls

Residual Analysis Guidelines

Good residual plot (linear model is appropriate):

  • Random scatter around zero
  • No curves or patterns
  • Roughly constant spread across all $x$-values

Warning signs (linear model may not be appropriate):

  • Curved pattern (relationship is nonlinear)
  • Fanning out/in (spread changes with $x$)
  • Clusters or gaps (unusual data structure)

Prediction Guidelines

Type Description Reliability
Interpolation Predicting within the range of original $x$-values Generally reliable
Extrapolation Predicting outside the range of original $x$-values Often unreliable

Rule: Never extrapolate far beyond your data. The relationship might change in ways your model cannot anticipate.

Coefficient of Determination ($r^2$)

$$r^2 = (\text{correlation})^2$$

Interpretation: $r^2$ is the proportion (or percentage) of variance in $y$ that is explained by the linear relationship with $x$.

Properties:

  • Always between 0 and 1 (or 0% and 100%)
  • $r^2 = 0$: Linear model explains nothing
  • $r^2 = 1$: Linear model explains everything (all points on the line)
  • Higher $r^2$ means better predictions (less unexplained variation)

Real-World Applications

Predicting Test Scores from Study Time

Educational researchers use regression to understand how study habits relate to academic performance. A regression of exam scores on study hours might show that each additional hour of study predicts a 3-point increase in scores. Teachers can use this information when advising students about time management—while noting that the relationship is predictive, not necessarily causal (more motivated students might both study more and score higher).

Economic Forecasting

Economists use regression to predict economic indicators. For example:

  • Predicting consumer spending from income levels
  • Forecasting GDP growth from leading indicators
  • Estimating inflation from money supply changes

These models help businesses plan and help policymakers make decisions. However, economists know that predictions are uncertain, especially when extrapolating into unusual conditions (like economic crises).

Scientific Modeling

Scientists across fields use regression to quantify relationships:

  • Biologists might model how drug dosage relates to response
  • Physicists might calibrate instruments by regressing measurements on known standards
  • Environmental scientists might predict species population from habitat size

In many cases, the regression equation becomes a working model that summarizes the relationship discovered in experiments.

Sports Analytics

Sports teams use regression to predict performance:

  • A baseball team might predict a player’s future performance from past statistics
  • A basketball team might model how rest days affect shooting percentage
  • A football team might predict game outcomes from offensive and defensive statistics

“Moneyball” approaches to sports rely heavily on regression to find undervalued players—those whose predicted performance exceeds their cost.

Medical Research

Doctors use regression to predict health outcomes:

  • Predicting blood pressure from age, weight, and lifestyle factors
  • Estimating disease risk from genetic markers
  • Forecasting recovery time from treatment variables

These predictions help with diagnosis and treatment planning. A patient whose actual outcome differs greatly from the prediction (large residual) might warrant special attention.

Business and Finance

Businesses use regression constantly:

  • Retailers predict sales from advertising spending
  • Real estate agents estimate home prices from square footage and features
  • Insurance companies set premiums based on predicted claims
  • HR departments model salary from years of experience

In all cases, the key is not just making predictions, but understanding the limitations: how much unexplained variation exists ($r^2$), and whether predictions outside the data range are trustworthy.

Self-Test Problems

Problem 1: A regression equation for predicting monthly electric bills ($y$, in dollars) from average monthly temperature ($x$, in degrees F) is $\hat{y} = 120 - 1.5x$.

a) Predict the electric bill for a month with an average temperature of 70 degrees. b) Interpret the slope in context.

Show Answer

a) Prediction for 70 degrees: $$\hat{y} = 120 - 1.5(70) = 120 - 105 = 15$$ The predicted electric bill is $15.

b) Slope interpretation: The slope of $-1.5$ means that for each one-degree increase in average monthly temperature, the predicted electric bill decreases by $1.50.

(This makes sense in a heating climate—warmer months require less heating, so bills are lower.)

Problem 2: Using the same regression equation $\hat{y} = 120 - 1.5x$, suppose the actual electric bill in a month with an average temperature of 60 degrees was $50. Calculate and interpret the residual.

Show Answer

Step 1: Calculate the predicted value. $$\hat{y} = 120 - 1.5(60) = 120 - 90 = 30$$

Step 2: Calculate the residual. $$\text{Residual} = y - \hat{y} = 50 - 30 = +20$$

Interpretation: The actual bill was $20 higher than the model predicted. This household used more electricity than expected for a month with that temperature—perhaps they had guests, used more appliances, or had an unusually cold spell within that month.

Problem 3: A regression analysis produces $r = 0.80$. Calculate and interpret $r^2$.

Show Answer

Calculation: $$r^2 = (0.80)^2 = 0.64$$

Interpretation: 64% of the variation in the response variable can be explained by the linear relationship with the explanatory variable.

This means the model captures a substantial portion of the variation, but 36% of the variation is due to other factors not included in the model.

Problem 4: Given the following statistics for a dataset relating years of experience ($x$) to annual salary in thousands ($y$):

  • $\bar{x} = 8$ years, $\bar{y} = 65$ thousand dollars
  • $s_x = 4$, $s_y = 15$
  • $r = 0.85$

Find the equation of the least-squares regression line.

Show Answer

Step 1: Calculate the slope. $$b = r \cdot \frac{s_y}{s_x} = 0.85 \times \frac{15}{4} = 0.85 \times 3.75 = 3.1875 \approx 3.19$$

Step 2: Calculate the intercept. $$a = \bar{y} - b\bar{x} = 65 - 3.19(8) = 65 - 25.52 = 39.48$$

Regression equation: $$\hat{y} = 39.48 + 3.19x$$

Or approximately: $\hat{y} = 39.5 + 3.2x$

Interpretation: For each additional year of experience, the predicted salary increases by about $3,190 (since $y$ is in thousands).

Problem 5: A residual plot from a regression shows a clear U-shaped pattern, with residuals negative in the middle of the x-range and positive at both ends. What does this indicate, and what should you do?

Show Answer

What it indicates: A U-shaped pattern in residuals indicates that the true relationship between the variables is nonlinear, not linear. The straight line is systematically missing the pattern in the data:

  • At low and high x-values, actual y-values are higher than predicted (positive residuals)
  • At middle x-values, actual y-values are lower than predicted (negative residuals)

What to do:

  1. Consider a nonlinear model. A quadratic model ($\hat{y} = a + bx + cx^2$) might fit better.
  2. Do not trust predictions from the linear model. They will be systematically wrong.
  3. Look at the scatter plot to see the actual curved pattern.
  4. If you must use a linear model, acknowledge its limitations and restrict predictions to a narrow range where the line fits reasonably well.

Problem 6: A college admissions office develops a regression model predicting first-year GPA from SAT scores using data from students with SAT scores between 1100 and 1500. An applicant has an SAT score of 1650. Should the admissions office use the regression model to predict this student’s GPA? Why or why not?

Show Answer

No, the admissions office should not rely on this prediction.

Reason: This is extrapolation—predicting for an SAT score (1650) that is outside the range of the original data (1100-1500).

Why extrapolation is risky here:

  1. The linear relationship between SAT and GPA might not continue at very high scores
  2. Students with extremely high SAT scores might:
    • Be overconfident and study less
    • Take on too many challenging courses
    • Have unusual circumstances that made them outliers
  3. The prediction is being made 150 points beyond the highest value in the data

What to do instead:

  • Acknowledge high uncertainty in this prediction
  • Consider other factors for this applicant
  • Perhaps look for data from students with similarly high scores, if available
  • Treat any prediction as very approximate at best

Problem 7: Two regression models are fit to the same data:

  • Model A: $r^2 = 0.92$
  • Model B: $r^2 = 0.45$

Which model explains more variation in the response variable? Does the model with higher $r^2$ always make better predictions for new data?

Show Answer

Which explains more variation: Model A, with $r^2 = 0.92$, explains more variation. It accounts for 92% of the variation in the response variable, compared to only 45% for Model B.

Does higher $r^2$ always mean better predictions for new data?

Not necessarily. A higher $r^2$ on the original data does not guarantee better predictions on new data because:

  1. Overfitting: A model might fit the original data very well by capturing noise rather than true patterns. Such a model would have high $r^2$ on the original data but poor predictions on new data.

  2. Different populations: If the new data comes from a different population or context, the relationship might be different.

  3. Extrapolation: If new data falls outside the range of the original data, even a model with high $r^2$ might fail.

  4. Model appropriateness: A linear model with high $r^2$ might still be inappropriate if the relationship is truly nonlinear (this can sometimes happen with limited data).

Best practice: Validate your model on data that was not used to build it, and always check residual plots regardless of $r^2$ value.

Summary

  • Regression uses data on two variables to create an equation for predicting one variable from another. It goes beyond correlation by providing a formula for making predictions.

  • The least-squares regression line $\hat{y} = a + bx$ is the line that minimizes the sum of squared vertical distances from points to the line. It always passes through the point $(\bar{x}, \bar{y})$.

  • The slope $b = r \cdot \frac{s_y}{s_x}$ tells you how much the predicted $y$ changes for each one-unit increase in $x$. Always interpret it in context with proper units.

  • The intercept $a = \bar{y} - b\bar{x}$ is the predicted value when $x = 0$. Only interpret it if $x = 0$ is meaningful and within your data range.

  • Residuals are the differences between actual and predicted values: Residual $= y - \hat{y}$. Positive residuals indicate points above the line; negative residuals indicate points below it.

  • Residual plots help assess whether a linear model is appropriate. Look for random scatter with no pattern. Curved patterns indicate the relationship is nonlinear.

  • The coefficient of determination $r^2$ tells you what proportion of the variation in $y$ is explained by the linear relationship with $x$. Higher $r^2$ means better predictions with less unexplained variation.

  • Interpolation (predicting within your data range) is generally reliable. Extrapolation (predicting outside your data range) is risky and should be avoided or approached with great caution.

  • Regression shows association, not causation. A regression equation lets you make predictions, but it does not prove that changing $x$ causes $y$ to change.

  • Regression to the mean is the phenomenon where extreme values tend to be followed by values closer to the average. This is a statistical reality, not a causal effect.

  • Always examine your data with scatter plots and residual plots before trusting a regression model. Summary statistics alone can hide important patterns and problems.