Data Analysis and Probability
Make sense of data and predict the likelihood of events
Numbers are everywhere. Your phone tracks your screen time. Sports apps show player statistics. Weather forecasts tell you the chance of rain. If all these numbers and percentages feel overwhelming, you are not alone. Many people glaze over when they see tables of data or hear phrases like “statistically speaking.”
But here is the good news: you already analyze data and think about probability every single day. When you check reviews before buying something online, you are analyzing data. When you decide whether to bring an umbrella based on a “40% chance of rain,” you are thinking about probability. When you figure out your grade average, you are calculating a statistic. This chapter just gives you the vocabulary and techniques to do what you already do - but with more precision and confidence.
Core Concepts
Measures of Center: Finding the “Typical” Value
When you have a collection of numbers (called a data set), one of the first questions to ask is: “What is a typical value?” There are three common ways to answer this question, and each tells you something slightly different.
Mean (Average)
The mean is what most people think of when they hear “average.” You calculate it by adding up all the values and dividing by how many values you have.
$$\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}}$$
Think of the mean as the “balance point” of your data. If your data were weights on a seesaw, the mean is where you would put the fulcrum to balance them.
When to use it: The mean works well when your data is fairly evenly spread without extreme outliers.
Median (Middle Value)
The median is the middle value when you arrange all your numbers in order from smallest to largest. If you have an odd number of values, the median is the one right in the center. If you have an even number of values, the median is the average of the two middle numbers.
Think of the median as the value that splits your data in half - 50% of values are below it, and 50% are above it.
When to use it: The median is great when you have outliers (extreme values) that would skew the mean. For example, when discussing home prices or incomes, the median often gives a better picture of “typical” than the mean.
Mode (Most Frequent)
The mode is simply the value that appears most often. A data set can have:
- One mode (unimodal): One value appears most frequently
- Multiple modes (bimodal, multimodal): Two or more values tie for most frequent
- No mode: Every value appears the same number of times
When to use it: The mode is especially useful for non-numerical data (like “What is the most popular pizza topping?”) or when you want to know what is most common.
Measures of Spread: How Scattered Is the Data?
Knowing the center is not the whole story. Two data sets can have the same mean but look completely different. That is where spread (or variability) comes in.
Range
The range is the simplest measure of spread. It is the difference between the largest and smallest values.
$$\text{Range} = \text{Maximum value} - \text{Minimum value}$$
A large range means your data is spread out. A small range means values are clustered together.
Reading Graphs: Data Visualization
Graphs turn numbers into pictures, making patterns easier to spot. Here are the three types you will encounter most often.
Bar Graphs
Bar graphs use rectangular bars to compare quantities across different categories. The height (or length) of each bar shows the value for that category.
Best for: Comparing quantities across distinct categories (favorite colors, sales by month, votes by candidate).
How to read: Look at the labels on the axes. One axis shows the categories, the other shows the values. Compare bar heights to compare values.
Line Graphs
Line graphs connect data points with lines, showing how values change over time or across a sequence.
Best for: Showing trends and changes over time (temperature throughout a day, stock prices over a year, your grades over a semester).
How to read: Look for upward trends (increasing), downward trends (decreasing), or flat sections (stable). The steeper the line, the faster the change.
Circle Graphs (Pie Charts)
Circle graphs show how a whole is divided into parts. The entire circle represents 100%, and each “slice” represents a portion of that whole.
Best for: Showing parts of a whole (budget breakdown, survey results, time allocation).
How to read: Larger slices mean larger portions. The percentages should add up to 100%. Compare slice sizes to compare portions.
Probability: The Mathematics of Chance
Probability measures how likely something is to happen. It is always a number between 0 and 1 (or 0% and 100%).
- Probability = 0: The event is impossible (rolling a 7 on a standard die)
- Probability = 1: The event is certain (getting heads or tails on a coin flip)
- Probability between 0 and 1: The event might happen
The basic probability formula is:
$$P(\text{event}) = \frac{\text{Number of favorable outcomes}}{\text{Total number of possible outcomes}}$$
We write $P(\text{event})$ to mean “the probability of the event.”
Theoretical vs. Experimental Probability
Theoretical probability is what should happen based on math. If you flip a fair coin, the theoretical probability of heads is $\frac{1}{2}$ because one out of two equally likely outcomes is heads.
Experimental probability is what actually happens when you run an experiment. If you flip a coin 100 times and get heads 47 times, the experimental probability of heads is $\frac{47}{100}$ = 0.47.
The more times you repeat an experiment, the closer the experimental probability usually gets to the theoretical probability. This is called the Law of Large Numbers.
Notation and Terminology
| Term | Meaning | Example |
|---|---|---|
| Data set | A collection of values | {5, 7, 3, 9, 7, 2} |
| Mean | Sum of values divided by count; the “average” | Mean of {2, 4, 6} is 4 |
| Median | The middle value when data is ordered | Median of {1, 3, 7} is 3 |
| Mode | The most frequently occurring value | Mode of {2, 5, 5, 8} is 5 |
| Range | Maximum minus minimum | Range of {3, 7, 12} is 9 |
| Outlier | A value much higher or lower than others | 100 in {2, 3, 4, 5, 100} |
| $P(\text{event})$ | Probability of an event | $P(\text{heads}) = 0.5$ |
| Favorable outcomes | Outcomes that match what you are looking for | Rolling a 6 on a die: 1 favorable outcome |
| Sample space | All possible outcomes | Coin flip: {heads, tails} |
| Theoretical probability | Probability based on mathematical reasoning | Fair coin: $P(\text{heads}) = \frac{1}{2}$ |
| Experimental probability | Probability based on actual trials | 47 heads in 100 flips: $P(\text{heads}) = 0.47$ |
Examples
A student received the following quiz scores: 85, 90, 78, 90, 92. Find the mean, median, mode, and range.
Solution:
Mean: Add all scores and divide by the number of scores. $$\text{Mean} = \frac{85 + 90 + 78 + 90 + 92}{5} = \frac{435}{5} = 87$$
Median: First, arrange the scores in order: 78, 85, 90, 90, 92. Since there are 5 scores (an odd number), the median is the middle value, which is the 3rd score. $$\text{Median} = 90$$
Mode: Look for the most frequent value. The score 90 appears twice, while all others appear once. $$\text{Mode} = 90$$
Range: Subtract the minimum from the maximum. $$\text{Range} = 92 - 78 = 14$$
The student’s typical score is around 87-90, and their scores span a range of 14 points.
The prices (in dollars) of six different items are: 12, 8, 15, 20, 10, 14. Find the median price.
Solution:
Step 1: Arrange the prices in order from least to greatest. $$8, 10, 12, 14, 15, 20$$
Step 2: Since there are 6 values (an even number), there is no single middle value. The median is the average of the two middle values (the 3rd and 4th values).
The two middle values are 12 and 14.
$$\text{Median} = \frac{12 + 14}{2} = \frac{26}{2} = 13$$
The median price is $13, even though no item actually costs $13. This tells you that half the items cost less than $13 and half cost more.
A survey asked students about their favorite subjects. The bar graph shows:
- Math: 45 students
- Science: 38 students
- English: 52 students
- History: 30 students
- Art: 35 students
Answer these questions: a) What is the most popular subject? b) How many more students prefer English over History? c) What is the total number of students surveyed?
Solution:
a) Most popular subject: Look for the tallest bar. English has the highest value at 52 students. $$\text{Most popular subject: English}$$
b) Difference between English and History: $$52 - 30 = 22 \text{ more students}$$
c) Total students surveyed: Add all the values. $$45 + 38 + 52 + 30 + 35 = 200 \text{ students}$$
Notice that you could also calculate what percentage preferred each subject. For example, the percentage who prefer Math is: $$\frac{45}{200} = 0.225 = 22.5%$$
A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles. If you pick one marble without looking, what is the probability of: a) Picking a red marble? b) Picking a blue or green marble? c) Picking a yellow marble?
Solution:
First, find the total number of marbles: $5 + 3 + 2 = 10$ marbles.
a) Probability of red: There are 5 red marbles out of 10 total. $$P(\text{red}) = \frac{5}{10} = \frac{1}{2} = 0.5 = 50%$$
b) Probability of blue or green: There are $3 + 2 = 5$ marbles that are blue or green. $$P(\text{blue or green}) = \frac{5}{10} = \frac{1}{2} = 0.5 = 50%$$
Another way to think about this: since red is 50%, everything that is “not red” is also 50%.
c) Probability of yellow: There are 0 yellow marbles in the bag. $$P(\text{yellow}) = \frac{0}{10} = 0 = 0%$$
Picking yellow is impossible because there are no yellow marbles.
You roll a standard six-sided die 60 times and record the results:
| Number | Times Rolled |
|---|---|
| 1 | 8 |
| 2 | 12 |
| 3 | 9 |
| 4 | 11 |
| 5 | 10 |
| 6 | 10 |
a) What is the theoretical probability of rolling a 2? b) What is the experimental probability of rolling a 2? c) What is the experimental probability of rolling an even number (2, 4, or 6)? d) Why might the experimental probability differ from the theoretical probability?
Solution:
a) Theoretical probability of rolling a 2: A standard die has 6 faces, each equally likely. Only one face shows 2. $$P(\text{rolling 2}) = \frac{1}{6} \approx 0.167 = 16.7%$$
b) Experimental probability of rolling a 2: You rolled 2 exactly 12 times out of 60 total rolls. $$P(\text{rolling 2}) = \frac{12}{60} = \frac{1}{5} = 0.2 = 20%$$
c) Experimental probability of rolling an even number: Even numbers (2, 4, 6) were rolled: $12 + 11 + 10 = 33$ times. $$P(\text{even}) = \frac{33}{60} = \frac{11}{20} = 0.55 = 55%$$
The theoretical probability of rolling an even number is $\frac{3}{6} = \frac{1}{2} = 50%$ (since 3 out of 6 faces are even).
d) Why the difference? Random events do not always match their theoretical probabilities perfectly, especially with a limited number of trials. This natural variation is expected. If you rolled the die 6,000 times instead of 60, the experimental probabilities would likely be much closer to the theoretical ones. This illustrates the Law of Large Numbers.
Key Properties and Rules
Calculating Mean, Median, Mode, and Range
Mean Formula: $$\text{Mean} = \frac{\sum \text{(all values)}}{\text{number of values}}$$
Median Rules:
- Odd number of values: Median is the middle value
- Even number of values: Median is the average of the two middle values
- Always arrange data in order first!
Mode:
- Find the value(s) that appear most often
- Can have multiple modes or no mode
Range: $$\text{Range} = \text{Maximum} - \text{Minimum}$$
Probability Rules
Basic Probability: $$P(\text{event}) = \frac{\text{favorable outcomes}}{\text{total possible outcomes}}$$
Probability Boundaries: $$0 \leq P(\text{event}) \leq 1$$
Complementary Events: If $P(\text{event})$ is the probability something happens, then: $$P(\text{event does NOT happen}) = 1 - P(\text{event})$$
For example, if the probability of rain is 30%, then the probability of no rain is 70%.
Probability of “Or” (for events that cannot both happen): $$P(A \text{ or } B) = P(A) + P(B)$$
For example, when rolling a die: $P(\text{1 or 6}) = \frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{1}{3}$
Graph Reading Tips
- Always check the axes: What is being measured? What are the units?
- Look at the scale: Does it start at zero? A graph that does not start at zero can make differences look bigger than they are.
- Identify trends: In line graphs, is the overall direction up, down, or stable?
- Check the totals: In pie charts, percentages should add to 100%.
Real-World Applications
School and Grades
Your GPA (Grade Point Average) is literally a mean - it is the average of your course grades weighted by credit hours. When you ask “What do I need on the final to get a B?” you are using algebra and statistics together.
Teachers often report the class median on tests because it is not thrown off by a few very high or very low scores. If the median is 75%, that means half the class scored above 75% and half scored below.
Sports Statistics
Sports are packed with data analysis. A baseball player’s batting average is a mean (hits divided by at-bats). The “over/under” in sports betting is related to expected values. When commentators say a basketball player “shoots 40% from three,” that is experimental probability based on their actual performance.
Weather Forecasts
When the forecast says “60% chance of rain,” that is probability. It does not mean it will rain 60% of the day. It means that historically, in conditions like today’s, it has rained 60% of the time. This is experimental probability based on past data.
Money and Finance
Average income, median home prices, and interest rates all involve these concepts. Financial analysts use measures of center and spread constantly. The median home price in an area is often more useful than the mean because a few mansions would pull the mean up unrealistically.
Health and Medicine
Medical studies report probabilities all the time. “This medication is effective in 85% of patients” means the experimental probability of effectiveness is 0.85. Doctors use statistics to make treatment decisions and researchers use probability to determine if a new drug actually works or if results were just due to chance.
Games and Decision Making
Understanding probability makes you a smarter game player. Whether it is calculating odds in card games, deciding whether to take a chance in a board game, or understanding why the house always wins in casinos, probability helps you make better decisions.
Self-Test Problems
Problem 1: Find the mean, median, and mode of this data set: 15, 20, 15, 25, 30, 15, 20.
Show Answer
Mean: $$\frac{15 + 20 + 15 + 25 + 30 + 15 + 20}{7} = \frac{140}{7} = 20$$
Median: First, arrange in order: 15, 15, 15, 20, 20, 25, 30. With 7 values, the median is the 4th value. $$\text{Median} = 20$$
Mode: 15 appears three times (more than any other value). $$\text{Mode} = 15$$
Problem 2: The ages of 8 employees are: 22, 25, 31, 28, 45, 33, 29, 27. Find the median age and the range.
Show Answer
Median: Arrange in order: 22, 25, 27, 28, 29, 31, 33, 45. With 8 values (even), the median is the average of the 4th and 5th values. $$\text{Median} = \frac{28 + 29}{2} = \frac{57}{2} = 28.5 \text{ years}$$
Range: $$\text{Range} = 45 - 22 = 23 \text{ years}$$
Problem 3: A spinner is divided into 8 equal sections numbered 1 through 8. What is the probability of spinning: a) The number 5? b) An odd number? c) A number greater than 6?
Show Answer
a) Probability of 5: Only one section shows 5, out of 8 total. $$P(5) = \frac{1}{8} = 0.125 = 12.5%$$
b) Probability of odd number: Odd numbers are 1, 3, 5, 7 (four outcomes). $$P(\text{odd}) = \frac{4}{8} = \frac{1}{2} = 0.5 = 50%$$
c) Probability of greater than 6: Numbers greater than 6 are 7 and 8 (two outcomes). $$P(> 6) = \frac{2}{8} = \frac{1}{4} = 0.25 = 25%$$
Problem 4: A survey of 200 students found that 80 prefer pizza, 60 prefer tacos, 40 prefer burgers, and 20 prefer salad. If this data were shown in a pie chart: a) What percentage of the pie would the “pizza” slice be? b) What percentage would “tacos” and “burgers” combined be?
Show Answer
a) Pizza percentage: $$\frac{80}{200} = 0.40 = 40%$$
b) Tacos and burgers combined: $$\frac{60 + 40}{200} = \frac{100}{200} = 0.50 = 50%$$
Problem 5: You flip a coin 50 times and get heads 28 times. a) What is the theoretical probability of getting heads? b) What is the experimental probability of getting heads based on your flips? c) If you flipped the coin 500 more times, would you expect the experimental probability to get closer to or farther from the theoretical probability? Why?
Show Answer
a) Theoretical probability of heads: A fair coin has 2 equally likely outcomes, one of which is heads. $$P(\text{heads}) = \frac{1}{2} = 0.5 = 50%$$
b) Experimental probability of heads: $$P(\text{heads}) = \frac{28}{50} = 0.56 = 56%$$
c) Prediction for more flips: The experimental probability would likely get closer to the theoretical probability of 50%. This is because of the Law of Large Numbers: as you perform more trials, random variations tend to even out, and the experimental probability approaches the theoretical probability.
Summary
-
Mean (average) is the sum of all values divided by the number of values. It represents the “balance point” of data.
-
Median is the middle value when data is arranged in order. For an even number of values, average the two middle values. It is not affected by outliers.
-
Mode is the most frequently occurring value. A data set can have multiple modes or no mode.
-
Range measures spread: subtract the minimum from the maximum.
-
Bar graphs compare quantities across categories; line graphs show changes over time; pie charts show parts of a whole.
-
Probability measures how likely an event is, from 0 (impossible) to 1 (certain): $P(\text{event}) = \frac{\text{favorable outcomes}}{\text{total outcomes}}$.
-
Theoretical probability is what should happen based on math. Experimental probability is what actually happens in trials.
-
The Law of Large Numbers says that experimental probability gets closer to theoretical probability as you run more trials.
-
Choose your measure of center wisely: mean for symmetric data, median when outliers exist, mode for categorical data or finding the most common value.