Correlation and Scatter Plots

Explore relationships between two numerical variables

You have probably heard claims like “People who exercise more live longer” or “Students who study more get better grades.” These statements are about relationships between two things. But how do we actually see these relationships in data? How do we measure them? And most importantly, how do we know whether one thing actually causes another, or whether they just happen to move together?

This chapter gives you the tools to visualize relationships between two numerical variables using scatter plots, describe what you see, and quantify the strength of linear relationships with the correlation coefficient. Just as importantly, you will learn to recognize when correlation can be misleading—and why “correlation does not imply causation” is one of the most important phrases in all of statistics.

Core Concepts

Bivariate Data: Studying Two Variables at Once

So far, much of statistics focuses on one variable at a time—heights of students, test scores, temperatures. But often the most interesting questions involve the relationship between two variables. This is called bivariate data: paired observations where each individual or case has values for two numerical variables.

For example:

  • For each student, you might record both hours studied and exam score
  • For each city, you might record both average temperature and ice cream sales
  • For each car model, you might record both weight and fuel efficiency

The key is that the two measurements are linked—they come from the same individual, city, or object. You are not just looking at hours studied and exam scores separately; you are asking whether they are related to each other.

Scatter Plots: Visualizing Relationships

A scatter plot is the go-to graph for displaying bivariate data. Each paired observation becomes a single point on the graph:

  • The horizontal axis (x-axis) shows values of one variable, often called the explanatory or independent variable
  • The vertical axis (y-axis) shows values of the other variable, often called the response or dependent variable

If you suspect that one variable might influence or predict the other, put the explanatory variable on the x-axis and the response on the y-axis. For instance, if you think study time might affect exam scores, put hours studied on the x-axis and exam scores on the y-axis.

How to construct a scatter plot:

  1. Draw and label your axes with appropriate scales
  2. For each pair of values $(x, y)$, place a dot at that location
  3. Give your scatter plot a descriptive title

Once you plot the points, patterns often emerge. Points might cluster along an upward-sloping line, a downward-sloping line, a curve, or show no pattern at all. Reading these patterns is the art of interpreting scatter plots.

Describing Associations: Direction, Form, and Strength

When you look at a scatter plot, there are three key features to describe:

1. Direction (Positive, Negative, or None)

Positive association: As the x-variable increases, the y-variable tends to increase. The points slope upward from left to right. Example: Taller people tend to weigh more—height and weight have a positive association.

Negative association: As the x-variable increases, the y-variable tends to decrease. The points slope downward from left to right. Example: As the price of a product increases, the quantity sold tends to decrease—price and sales have a negative association.

No association: There is no consistent pattern. Knowing the x-value does not help you predict the y-value. The points are scattered randomly with no clear direction.

2. Form (Linear or Nonlinear)

Linear: The points cluster around a straight line. The relationship can be described well by a line.

Nonlinear: The points follow a curved pattern. Examples include:

  • Quadratic (U-shaped or inverted U-shaped)
  • Exponential (rapid increase or decrease)
  • Logarithmic (rapid change that levels off)

Many real-world relationships start as approximately linear but become curved at extreme values.

3. Strength (Strong, Moderate, or Weak)

Strong association: The points cluster tightly around the underlying pattern (line or curve). If you know the x-value, you can predict the y-value quite well.

Weak association: The points are loosely scattered around the pattern. There is a general trend, but individual points vary considerably from it.

Moderate association: Somewhere in between.

Think of strength as how much “scatter” there is around the pattern. Less scatter means stronger association; more scatter means weaker association.

The Correlation Coefficient: Measuring Linear Relationships

While describing scatter plots with words like “moderate positive association” is useful, we often want a precise number. The correlation coefficient, denoted $r$, does exactly this—it quantifies the strength and direction of a linear relationship between two variables.

The formula for the sample correlation coefficient:

$$r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)$$

Where:

  • $n$ is the number of data pairs
  • $\bar{x}$ and $\bar{y}$ are the means of the x and y variables
  • $s_x$ and $s_y$ are the standard deviations of the x and y variables
  • Each term $\frac{x_i - \bar{x}}{s_x}$ is called a z-score—it measures how many standard deviations that value is from its mean

In practice, you will usually compute $r$ using technology (a calculator or software), but understanding the formula helps you see what $r$ captures: it is essentially an average of the products of standardized values.

Properties of the Correlation Coefficient

The correlation coefficient $r$ has several important properties that make it especially useful:

1. Range: $-1 \leq r \leq 1$

The correlation is always between $-1$ and $1$. It cannot be $2$ or $-1.5$.

2. Sign indicates direction:

  • $r > 0$ means positive association (as x increases, y tends to increase)
  • $r < 0$ means negative association (as x increases, y tends to decrease)
  • $r = 0$ means no linear association

3. Magnitude indicates strength:

  • $|r|$ close to 1 means strong linear relationship
  • $|r|$ close to 0 means weak or no linear relationship
  • Rough guidelines: $|r| > 0.8$ is strong, $0.5 < |r| < 0.8$ is moderate, $|r| < 0.5$ is weak

4. Perfect correlations:

  • $r = 1$ means perfect positive linear relationship—all points lie exactly on an upward-sloping line
  • $r = -1$ means perfect negative linear relationship—all points lie exactly on a downward-sloping line

5. No units: Because $r$ is calculated from z-scores, it has no units. The correlation between height in inches and weight in pounds is the same as between height in centimeters and weight in kilograms.

6. Symmetric: The correlation between x and y is the same as between y and x.

7. Only measures linear relationships: A correlation of $r = 0$ does not mean there is no relationship—it means there is no linear relationship. A perfect U-shaped curve would have $r = 0$.

Correlation Does Not Imply Causation

This is perhaps the most important concept in this entire chapter. Just because two variables are correlated does not mean one causes the other. This principle is so important that it bears repeating: correlation does not imply causation.

Here is why this matters: when you find a strong correlation between variables A and B, there are at least four possibilities:

  1. A causes B: Changes in A directly cause changes in B
  2. B causes A: Changes in B directly cause changes in A
  3. Lurking variable C causes both: Some hidden third variable affects both A and B, making them appear related
  4. Coincidence: The correlation is due to random chance or a flawed study

The classic example: Ice cream sales and drowning deaths are positively correlated. Does ice cream cause drowning? Of course not. Both are caused by a lurking variable: hot weather. When it is hot, people buy more ice cream and more people go swimming, which leads to more drownings.

Another example: There is a strong positive correlation between the number of firefighters at a fire and the amount of damage caused. Does sending firefighters cause more damage? No—bigger fires require more firefighters and cause more damage. The size of the fire is the lurking variable.

To establish causation, you typically need a controlled experiment where you randomly assign subjects to treatment and control groups. Observational studies—where you just observe existing data—can show correlation but cannot prove causation, no matter how strong the correlation is.

Lurking Variables and Confounding

A lurking variable (also called a confounding variable) is a variable that:

  • Is not included in your analysis
  • Affects both variables you are studying
  • Creates a misleading association between them

Lurking variables are sneaky because they hide in the background. You might not even know they exist. This is why you should always think critically when someone claims that one thing causes another based on correlational data.

Common types of lurking variables include:

  • Time: Many things increase or decrease over time, creating spurious correlations
  • Age: Health variables often correlate because they are all related to age
  • Socioeconomic status: Income, education, and wealth affect many outcomes
  • Geographic location: Regional differences can create apparent correlations

Outliers and Influential Points

In bivariate data, some points can have an outsized effect on your analysis.

Outliers in bivariate data are points that fall far from the overall pattern. They might have an unusual x-value, an unusual y-value, or an unusual combination of both.

Influential points are observations that, if removed, would substantially change the correlation or the line of best fit. A point can be influential because:

  • It has an extreme x-value (called a high-leverage point)
  • It does not follow the pattern of the other points

Not all outliers are influential, and not all influential points are outliers. The most influential points are typically those with extreme x-values that also deviate from the overall trend.

What to do with influential points:

  1. First, verify the data is correct (no typos or measurement errors)
  2. Try to understand why this point is unusual
  3. Calculate the correlation both with and without the point
  4. Report both results and explain the situation

Never automatically delete points just because they are inconvenient. But also recognize that a single unusual point can dramatically change your conclusions.

When Correlation Is Misleading

The correlation coefficient $r$ can give a misleading picture in several situations:

1. Nonlinear relationships: If the relationship is curved, $r$ will underestimate the strength of the relationship. Always look at the scatter plot first.

2. Outliers: A single outlier can dramatically increase or decrease $r$.

3. Groups within data: If your data contains distinct subgroups, the overall correlation might be very different from the correlation within each group.

4. Restricted range: If you only observe a narrow range of x-values, the correlation might appear weaker than it really is.

5. Extrapolation: A relationship that holds within your data might not extend beyond it.

The lesson: always create a scatter plot before calculating correlation. Numbers alone can deceive; pictures show what is really happening.

Notation and Terminology

Term Meaning Example
Scatter plot Graph of paired numerical data $(x, y)$ points plotted on coordinate axes
Bivariate data Data where each observation has two values (height, weight) for each person
Association Relationship between variables How one variable changes as another changes
Positive association As $x$ increases, $y$ tends to increase Height and weight
Negative association As $x$ increases, $y$ tends to decrease Price and quantity sold
Linear relationship Points cluster around a straight line
Nonlinear relationship Points follow a curved pattern
Correlation ($r$) Strength and direction of linear relationship $-1 \leq r \leq 1$
$r = 1$ Perfect positive linear relationship All points on an upward line
$r = -1$ Perfect negative linear relationship All points on a downward line
$r = 0$ No linear relationship (May still have nonlinear relationship)
Lurking variable Hidden variable affecting both observed variables Hot weather affects both ice cream sales and drowning
Explanatory variable Variable that may explain or predict the response Also called independent variable; plotted on x-axis
Response variable Variable that may be affected by the explanatory variable Also called dependent variable; plotted on y-axis
Outlier Point far from the overall pattern
Influential point Point that strongly affects correlation or best-fit line

Examples

Example 1: Describing a Scatter Plot

A researcher collects data on the age of used cars (in years) and their selling prices (in thousands of dollars). The scatter plot shows points that start high on the left and gradually decrease to the right, forming a fairly tight cluster along a diagonal line.

Describe the association between car age and selling price.

Solution:

We need to describe three things: direction, form, and strength.

Direction: As car age increases (moving right), selling price decreases (points are lower). This is a negative association. Older cars tend to sell for less.

Form: The points cluster around what appears to be a straight line, not a curve. The relationship is linear.

Strength: The points form a fairly tight cluster around the linear pattern with relatively little scatter. The association is moderately strong to strong.

Complete description: There is a moderately strong, negative, linear association between car age and selling price. As cars get older, their selling prices tend to decrease.

This makes intuitive sense—older cars have more wear and tear, more miles, and are further from their original condition, so they sell for less.

Example 2: Estimating Correlation from a Scatter Plot

Match each description with the most likely correlation coefficient: $r = -0.9$, $r = -0.4$, $r = 0$, $r = 0.6$, $r = 0.95$.

A) Points scattered randomly with no pattern B) Points tightly clustered along an upward-sloping line C) Points loosely scattered with a slight downward trend D) Points forming a moderately clear upward pattern E) Points tightly clustered along a downward-sloping line

Solution:

A) Points scattered randomly with no pattern: No direction or pattern indicates no linear relationship. $r = 0$

B) Points tightly clustered along an upward-sloping line: Strong positive linear relationship. “Tightly clustered” means close to $\pm 1$. $r = 0.95$

C) Points loosely scattered with a slight downward trend: Weak negative association. “Loosely scattered” and “slight” suggest $r$ close to zero but negative. $r = -0.4$

D) Points forming a moderately clear upward pattern: Moderate positive relationship. $r = 0.6$

E) Points tightly clustered along a downward-sloping line: Strong negative linear relationship. $r = -0.9$

Summary:

Description $r$
A $0$
B $0.95$
C $-0.4$
D $0.6$
E $-0.9$
Example 3: Calculating the Correlation Coefficient

Five students reported their hours of sleep the night before an exam and their exam scores:

Student Sleep (hours) Score
1 5 72
2 6 78
3 7 82
4 8 88
5 9 85

Calculate the correlation coefficient $r$.

Solution:

Step 1: Calculate the means.

$\bar{x} = \frac{5 + 6 + 7 + 8 + 9}{5} = \frac{35}{5} = 7$ hours

$\bar{y} = \frac{72 + 78 + 82 + 88 + 85}{5} = \frac{405}{5} = 81$ points

Step 2: Calculate the standard deviations.

For sleep (x): $s_x = \sqrt{\frac{(5-7)^2 + (6-7)^2 + (7-7)^2 + (8-7)^2 + (9-7)^2}{4}}$ $s_x = \sqrt{\frac{4 + 1 + 0 + 1 + 4}{4}} = \sqrt{\frac{10}{4}} = \sqrt{2.5} \approx 1.58$

For scores (y): $s_y = \sqrt{\frac{(72-81)^2 + (78-81)^2 + (82-81)^2 + (88-81)^2 + (85-81)^2}{4}}$ $s_y = \sqrt{\frac{81 + 9 + 1 + 49 + 16}{4}} = \sqrt{\frac{156}{4}} = \sqrt{39} \approx 6.24$

Step 3: Calculate the z-scores and their products.

Student $x_i$ $\frac{x_i - \bar{x}}{s_x}$ $y_i$ $\frac{y_i - \bar{y}}{s_y}$ Product
1 5 $\frac{-2}{1.58} = -1.26$ 72 $\frac{-9}{6.24} = -1.44$ $1.82$
2 6 $\frac{-1}{1.58} = -0.63$ 78 $\frac{-3}{6.24} = -0.48$ $0.30$
3 7 $\frac{0}{1.58} = 0$ 82 $\frac{1}{6.24} = 0.16$ $0$
4 8 $\frac{1}{1.58} = 0.63$ 88 $\frac{7}{6.24} = 1.12$ $0.71$
5 9 $\frac{2}{1.58} = 1.26$ 85 $\frac{4}{6.24} = 0.64$ $0.81$

Step 4: Sum the products and divide by $n-1$.

Sum of products $= 1.82 + 0.30 + 0 + 0.71 + 0.81 = 3.64$

$r = \frac{3.64}{5-1} = \frac{3.64}{4} = 0.91$

Interpretation: The correlation coefficient $r \approx 0.91$ indicates a strong positive linear relationship between hours of sleep and exam scores. Students who slept more tended to score higher on the exam. However, remember: this does not prove that more sleep caused better scores.

Example 4: Identifying When Correlation Is Misleading

A researcher plots data on employees’ years of experience and their salaries. The scatter plot shows two distinct clusters: one cluster of points in the lower-left (less experience, lower salary) and another cluster in the upper-right (more experience, higher salary). The overall correlation is $r = 0.85$.

However, when the researcher separates the data by job type, they find:

  • For administrative staff: $r = 0.15$
  • For engineers: $r = 0.20$

Explain why the overall correlation is so different from the within-group correlations.

Solution:

This is a classic example of Simpson’s Paradox—where an overall trend appears in aggregated data but disappears (or even reverses) when data is separated into subgroups.

What is happening:

The overall correlation of $r = 0.85$ is driven by the difference between groups, not by the relationship within groups:

  1. Engineers (as a group) have more experience on average and higher salaries on average
  2. Administrative staff (as a group) have less experience on average and lower salaries on average
  3. Within each group, there is little relationship between experience and salary (low within-group correlations)

The lurking variable: Job type is the lurking variable. It affects both experience (engineers may stay longer) and salary (engineers are paid more), creating an apparent correlation between experience and salary that largely disappears when you account for job type.

Why this matters:

If someone used the overall $r = 0.85$ to argue “more experience leads to higher salary,” they would be misleading you. The data actually shows that job type matters far more than experience for predicting salary.

Lesson: Always consider whether your data contains subgroups that should be analyzed separately. A single correlation coefficient can hide important patterns.

Example 5: Analyzing Causation vs. Correlation

A large study finds a strong positive correlation ($r = 0.72$) between the number of hours students spend on social media per day and their levels of anxiety. A news headline proclaims: “Social Media Causes Teen Anxiety!”

Analyze this claim. What would you need to conclude that social media actually causes anxiety?

Solution:

The correlation tells us:

  • There is a strong positive relationship between social media use and anxiety
  • Students who use more social media tend to have higher anxiety levels (and vice versa)

The correlation does NOT tell us:

  • Whether social media causes anxiety
  • Whether anxiety causes increased social media use
  • Whether some third factor causes both

Alternative explanations for this correlation:

1. Reverse causation (anxiety causes social media use): Students who are already anxious might use social media more as a coping mechanism or because anxiety makes it harder to engage in face-to-face social activities.

2. Lurking variables:

  • Underlying depression or loneliness: These could cause both increased social media use and increased anxiety
  • Sleep deprivation: Students who sleep less might use more social media and experience more anxiety
  • Social comparison tendencies: Students prone to comparing themselves to others might both use social media more and feel more anxious
  • Family environment: A difficult home life could lead to both social media as an escape and anxiety symptoms

3. Bidirectional causation: The relationship might be a feedback loop—social media increases anxiety, which leads to more social media use, which increases anxiety further.

What would be needed to establish causation:

1. Randomized controlled experiment:

  • Randomly assign students to different levels of social media use (e.g., 0 hours, 1 hour, 3 hours per day)
  • Measure anxiety levels after a set period
  • Compare anxiety levels across groups

If the high-use group shows significantly more anxiety than the low-use group, that would provide evidence of causation.

2. Longitudinal studies with temporal precedence: Follow students over time, measuring both variables repeatedly. If social media use at time 1 predicts changes in anxiety at time 2 (controlling for anxiety at time 1), this provides stronger evidence than a simple correlation.

3. Rule out confounding variables: Carefully measure and statistically control for potential lurking variables like sleep, existing mental health conditions, personality traits, and family environment.

Conclusion: The headline “Social Media Causes Teen Anxiety” goes far beyond what a correlation can support. A responsible statement would be: “Social media use is associated with higher anxiety levels, but more research is needed to determine whether the relationship is causal and, if so, in which direction.”

Example 6: The Effect of an Influential Point

A dataset of 10 cities shows the relationship between average temperature (in degrees F) and average monthly electricity usage (in kWh). Nine of the cities have temperatures between 55 and 75 degrees and show a moderate negative correlation—as temperature increases, electricity usage decreases (less heating needed).

Then a 10th city is added: Phoenix, Arizona, with an average temperature of 95 degrees and very high electricity usage (due to air conditioning).

Without Phoenix, $r = -0.65$. With Phoenix, $r = 0.15$.

Explain how one city could change the correlation so dramatically.

Solution:

Why Phoenix is influential:

  1. Extreme x-value: Phoenix has a temperature (95 degrees) far beyond the range of the other cities (55-75 degrees). Points with extreme x-values have high leverage—they can pull the correlation toward them.

  2. Deviates from the pattern: For the original 9 cities, higher temperature meant lower electricity usage (heating decreases). Phoenix breaks this pattern completely—it has the highest temperature and high electricity usage (air conditioning).

  3. Single point changes the story: The original negative correlation reflected the relationship “warmer = less heating = less electricity.” Phoenix introduces a new relationship: “very hot = lots of cooling = lots of electricity.”

What is really happening:

The relationship between temperature and electricity usage is nonlinear—it is U-shaped:

  • At cold temperatures, electricity usage is high (heating)
  • At moderate temperatures, electricity usage is lower (neither heating nor cooling)
  • At hot temperatures, electricity usage is high again (air conditioning)

By including only cities in the 55-75 degree range, we saw only the left half of the U (negative relationship). Adding Phoenix shows us the right half, and the linear correlation coefficient—which assumes a straight-line relationship—becomes meaningless.

Lessons:

  1. Check for nonlinearity: A dramatic change in $r$ when adding a point suggests the relationship might not be linear.

  2. Consider the range: The original 9 cities covered only a narrow temperature range. The relationship within that range might not extend beyond it.

  3. Report both: Present the correlation with and without the influential point, explain the difference, and note that a linear model might not be appropriate for the full temperature range.

  4. Do not blame Phoenix: The city is not “wrong”—it reveals a limitation of using linear correlation for a nonlinear relationship.

Better approach: For data spanning a wide temperature range, either:

  • Use a nonlinear model (like a quadratic/parabola)
  • Separate the data into “heating-dominated” and “cooling-dominated” regions and analyze each separately

Key Properties and Rules

Properties of the Correlation Coefficient $r$

Range: $$-1 \leq r \leq 1$$

Interpretation guidelines:

Correlation Interpretation
$r = 1$ Perfect positive linear relationship
$0.8 < r < 1$ Strong positive relationship
$0.5 < r \leq 0.8$ Moderate positive relationship
$0 < r \leq 0.5$ Weak positive relationship
$r = 0$ No linear relationship
$-0.5 \leq r < 0$ Weak negative relationship
$-0.8 \leq r < -0.5$ Moderate negative relationship
$-1 < r < -0.8$ Strong negative relationship
$r = -1$ Perfect negative linear relationship

Key properties:

  1. Unitless: Correlation has no units; it is a pure number
  2. Symmetric: Correlation of x with y equals correlation of y with x
  3. Linear only: Only measures strength of linear relationships
  4. Sensitive to outliers: A single unusual point can dramatically affect $r$
  5. Unaffected by linear transformations: Changing units does not change $r$

Formula for Correlation

$$r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)$$

Alternative computational formula:

$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}$$

Describing Scatter Plots: A Checklist

When describing a scatter plot, always address:

  1. Direction: Positive, negative, or none
  2. Form: Linear or nonlinear (if nonlinear, what shape?)
  3. Strength: Strong, moderate, or weak
  4. Unusual features: Outliers, clusters, gaps

Correlation vs. Causation: Key Points

Correlation Causation
Measures association Establishes cause-and-effect
Can be found in observational data Requires controlled experiments
Two variables move together One variable directly affects another
May be explained by lurking variables Rules out alternative explanations

Remember: Correlation is necessary for causation, but not sufficient. If A causes B, we expect them to be correlated. But correlation alone does not prove causation.

When to Be Skeptical of Correlation

Be cautious when:

  • The scatter plot shows a nonlinear pattern
  • There are obvious outliers or clusters
  • The data spans a restricted range of x-values
  • The context suggests lurking variables
  • Someone claims causation based only on correlation

Real-World Applications

Health Studies: Diet and Disease

Medical researchers constantly study correlations between lifestyle factors and health outcomes. For example, studies show correlations between:

  • Processed meat consumption and colon cancer risk
  • Exercise and cardiovascular health
  • Sleep duration and obesity

But establishing that diet causes health outcomes is extremely difficult. Randomized experiments are often impractical or unethical (you cannot randomly assign people to smoke for 20 years). Researchers must carefully consider lurking variables—people who eat healthier might also exercise more, have higher income, and have better access to healthcare.

The correlation between coffee consumption and longevity has flip-flopped over the decades as researchers discovered and controlled for lurking variables like smoking (coffee drinkers used to smoke more) and exercise habits.

Economics: Income and Education

There is a strong positive correlation between years of education and lifetime income. Does education cause higher income? Partially, yes—education provides skills, credentials, and opportunities. But lurking variables complicate the picture:

  • Family background affects both education and income
  • Natural ability might lead to both more education and higher earning potential
  • Social connections made in college might matter as much as what is learned

Policy debates about education funding hinge on understanding exactly how much of the correlation is causal.

Sports: Practice Time and Performance

Athletes and coaches observe correlations between practice habits and performance. More hours practicing free throws correlates with better free throw percentage. This seems obviously causal—practice makes perfect. But even here, there are nuances:

  • Natural talent affects both how much someone practices (enjoyment) and how well they perform
  • The best athletes might get more coaching and practice opportunities
  • The relationship between practice and performance is probably nonlinear—additional practice helps beginners a lot but helps experts less

Marketing: Ad Spending and Sales

Companies track the correlation between advertising spending and sales revenue. If the correlation is positive, does that prove advertising works? Not necessarily:

  • Companies spend more on advertising when they expect high sales (before the holiday season)
  • Companies with more money to spend on ads also have more money for product development
  • Brand awareness built over years might matter more than this quarter’s ad spending

Many companies now run controlled experiments (A/B tests) where some regions see ads and others do not, to measure the true causal effect of advertising.

Social Science: Screen Time and Well-Being

Recent research has examined correlations between smartphone/social media use and mental health, especially in teens. Early headlines screamed about causation, but researchers have found:

  • The correlations are often weaker than initially reported
  • Reverse causation is plausible (unhappy teens might use phones more)
  • Lurking variables like sleep deprivation, family conflict, and economic stress could explain both
  • The relationship varies by age, gender, and type of use

This area shows how important it is for the public to understand that correlation is not causation—premature conclusions can lead to misguided policy.

Self-Test Problems

Problem 1: A scatter plot shows the relationship between a city’s distance from the equator (x) and its average January temperature (y). The points form a tight cluster that slopes downward from left to right. Describe the association.

Show Answer

Direction: Negative—as distance from the equator increases, average January temperature tends to decrease.

Form: Linear—the points cluster around a straight line.

Strength: Strong—the points form a “tight cluster.”

Complete description: There is a strong, negative, linear association between distance from the equator and January temperature. Cities farther from the equator tend to have colder January temperatures.

Problem 2: Match each correlation coefficient to the most likely scenario:

  • $r = 0.92$
  • $r = -0.78$
  • $r = 0.03$

A) Shoe size and math test score for adults B) Hours spent studying and exam grade C) Years of smoking and lung capacity

Show Answer

A) Shoe size and math test score: No logical connection between these variables. $r = 0.03$ (essentially no correlation)

B) Hours studying and exam grade: More studying generally leads to higher grades. $r = 0.92$ (strong positive correlation)

C) Years of smoking and lung capacity: More smoking generally decreases lung capacity. $r = -0.78$ (strong negative correlation)

Problem 3: A dataset shows a correlation of $r = 0.05$ between x and y. A student concludes there is no relationship between the variables. What important caveat should the student consider?

Show Answer

The student should consider that $r$ only measures linear relationships. A correlation near zero means there is no linear relationship, but there could still be a strong nonlinear relationship.

For example, the relationship between anxiety level and performance often follows an inverted U-shape (moderate anxiety helps, but too little or too much anxiety hurts). This would have $r \approx 0$ despite a clear pattern.

Lesson: Always look at the scatter plot. A correlation of 0 does not mean “no relationship”—it means “no linear relationship.”

Problem 4: A study finds a strong positive correlation between the number of Nobel Prize winners a country has produced and its chocolate consumption per capita. A journalist writes, “Chocolate Makes You Smarter!” Identify at least two lurking variables that could explain this correlation.

Show Answer

Several lurking variables could explain the correlation:

  1. Economic development/GDP: Wealthier countries can afford more chocolate and invest more in education and research institutions that produce Nobel laureates.

  2. Climate/geography: Many high-chocolate-consuming countries are in Northern Europe (colder climates where hot cocoa is popular). These same countries have historically had strong educational traditions and research universities.

  3. Educational investment: Countries that invest heavily in education tend to produce more Nobel Prize winners and have higher standards of living that allow for chocolate purchases.

  4. European cultural/historical factors: Both chocolate consumption patterns and the Nobel Prize itself have European origins, which could create a spurious correlation.

The headline confuses correlation with causation. It is extremely unlikely that eating chocolate directly causes Nobel Prize-worthy achievements.

Problem 5: Calculate the correlation coefficient for the following data:

x 2 4 6 8 10
y 15 11 9 5 5
Show Answer

Step 1: Calculate means.

$\bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6$

$\bar{y} = \frac{15 + 11 + 9 + 5 + 5}{5} = \frac{45}{5} = 9$

Step 2: Calculate deviations and products.

$x_i$ $x_i - \bar{x}$ $y_i$ $y_i - \bar{y}$ $(x_i - \bar{x})(y_i - \bar{y})$ $(x_i - \bar{x})^2$ $(y_i - \bar{y})^2$
2 -4 15 6 -24 16 36
4 -2 11 2 -4 4 4
6 0 9 0 0 0 0
8 2 5 -4 -8 4 16
10 4 5 -4 -16 16 16

Step 3: Sum the columns.

$\sum(x_i - \bar{x})(y_i - \bar{y}) = -24 + (-4) + 0 + (-8) + (-16) = -52$

$\sum(x_i - \bar{x})^2 = 16 + 4 + 0 + 4 + 16 = 40$

$\sum(y_i - \bar{y})^2 = 36 + 4 + 0 + 16 + 16 = 72$

Step 4: Apply the formula.

$r = \frac{-52}{\sqrt{40 \times 72}} = \frac{-52}{\sqrt{2880}} = \frac{-52}{53.67} \approx -0.97$

Interpretation: $r \approx -0.97$ indicates a strong negative linear relationship. As x increases, y tends to decrease in a nearly linear pattern.

Problem 6: For the data in Problem 5, suppose an additional point (12, 18) is added. Without calculating, predict what will happen to the correlation coefficient and explain why.

Show Answer

The correlation will move closer to zero (become less negative) or potentially even become positive.

Why:

  1. The original data showed a strong negative trend—as x increased, y decreased.

  2. The new point (12, 18) goes against this pattern: it has the highest x-value and the highest y-value.

  3. This point has high leverage (extreme x-value) and deviates from the pattern, making it very influential.

  4. The correlation coefficient will be pulled toward this outlier, weakening the negative relationship we saw before.

This demonstrates how a single influential point can dramatically change the correlation, which is why we always recommend plotting data before calculating $r$.

Problem 7: A researcher finds that cities with more libraries have lower crime rates ($r = -0.65$). Can we conclude that building more libraries reduces crime? What additional information would strengthen or weaken a causal claim?

Show Answer

No, we cannot conclude causation from this correlation alone.

Possible explanations for the correlation:

  1. Libraries cause lower crime: Libraries provide education, job training, safe spaces, and community programs that reduce crime.

  2. Reverse causation: Safer neighborhoods might attract more investment, including libraries.

  3. Lurking variables:

    • Wealth/property values: Wealthier areas can afford more libraries and tend to have lower crime
    • Education levels: More educated populations support libraries and have lower crime rates
    • Social cohesion: Close-knit communities might both use libraries more and have less crime
    • Government investment: Cities that invest in libraries also invest in policing, social services, and infrastructure

Information that would strengthen a causal claim:

  1. Natural experiments: Comparing crime rates before and after a library opens, controlling for other changes

  2. Controlling for confounders: Showing the relationship persists after adjusting for income, education, and policing levels

  3. Mechanism: Evidence for how libraries might reduce crime (job programs, after-school activities, etc.)

  4. Dose-response: Showing that more library usage (not just presence) correlates with larger crime reduction

Information that would weaken a causal claim:

  1. Finding that the correlation disappears when controlling for neighborhood income

  2. Evidence that library locations are chosen in already-safe areas

  3. No change in crime rates after new libraries open

Summary

  • Bivariate data consists of paired observations where each individual has values for two numerical variables. We study bivariate data to understand relationships between variables.

  • Scatter plots display bivariate data by plotting each pair $(x, y)$ as a point. The explanatory variable typically goes on the x-axis, and the response variable on the y-axis.

  • Describing associations requires addressing three things: direction (positive, negative, or none), form (linear or nonlinear), and strength (strong, moderate, or weak).

  • Positive association means that as x increases, y tends to increase. Negative association means that as x increases, y tends to decrease.

  • The correlation coefficient $r$ quantifies the strength and direction of a linear relationship. It ranges from $-1$ to $+1$, with values near $\pm 1$ indicating strong linear relationships and values near $0$ indicating weak or no linear relationship.

  • Properties of $r$: It is unitless, symmetric, unaffected by linear transformations, but sensitive to outliers. It only measures linear relationships—a correlation of zero does not rule out nonlinear relationships.

  • Correlation does not imply causation. When two variables are correlated, the relationship could be because A causes B, B causes A, a lurking variable causes both, or the correlation is coincidental.

  • Lurking variables (confounding variables) are hidden factors that affect both observed variables, creating misleading apparent associations. Always consider what lurking variables might explain a correlation.

  • Outliers and influential points can dramatically affect the correlation coefficient. Points with extreme x-values (high leverage) that deviate from the pattern are especially influential.

  • Correlation can be misleading when the relationship is nonlinear, when there are outliers, when data contains subgroups, or when the range of x-values is restricted. Always plot your data before calculating $r$.

  • Establishing causation typically requires controlled experiments where subjects are randomly assigned to conditions. Observational studies can reveal correlations but cannot prove causation.

  • When analyzing claims: Ask what lurking variables could explain the relationship, whether the causal direction is clear, and what kind of evidence would be needed to establish causation.