Introduction to Statistics and Data

Learn what statistics is and why data matters

You already make decisions using data every day. When you read product reviews before buying something, you are looking at data. When you hear that “9 out of 10 dentists recommend” a toothpaste, you are encountering a statistic. When you see a poll showing which candidate is ahead in an election, someone has collected and analyzed data to produce that number.

Statistics is simply the science of learning from data. That might sound abstract, but it really comes down to asking questions and making decisions based on evidence rather than guesses. How effective is a new medication? What do voters actually think? Is this manufacturing process working correctly? These are all questions that statistics helps answer.

If numbers and data feel intimidating, you are not alone. But here is the reassuring truth: the core ideas of statistics are built on common sense. You have been thinking statistically your whole life without realizing it. This chapter gives you the vocabulary and framework to do it more precisely.

Core Concepts

What Is Statistics?

Statistics is the science of collecting, organizing, analyzing, and interpreting data to answer questions and make decisions. Notice that statistics is not just about calculating numbers (though that is part of it). It is fundamentally about asking good questions and drawing reasonable conclusions.

Think of statistics as a conversation with data:

  1. Ask a question (“Do people prefer our new product?”)
  2. Collect data (survey customers)
  3. Analyze the data (calculate percentages, look for patterns)
  4. Draw conclusions (“65% of customers prefer the new version”)
  5. Make decisions (continue with the new product)

Every step matters. Bad questions lead to useless answers. Poorly collected data leads to wrong conclusions. And even good analysis can be misinterpreted. Statistics teaches you how to do all of this well.

Population vs. Sample

Here is a fundamental distinction that shapes everything in statistics.

A population is the entire group you want to learn about. This could be all high school students in the United States, all smartphones manufactured by a company, or all possible outcomes of flipping a coin. The population is what you care about, what you want to draw conclusions about.

A sample is a subset of the population that you actually study. Since you usually cannot examine every single member of a population (imagine surveying every high school student in the country), you study a smaller group and use what you learn to make inferences about the larger population.

This is the heart of statistical thinking: learning about a whole by studying a part.

Why does this distinction matter? Because every conclusion you draw from a sample comes with uncertainty. Your sample might not perfectly represent the population. A well-designed study minimizes this problem, but it never eliminates it entirely. Understanding this limitation makes you a smarter consumer of statistical claims.

Parameters vs. Statistics

These two terms sound similar but mean different things, and keeping them straight is important.

A parameter is a number that describes the population. For example, the true average height of all high school students in the US is a parameter. It is a fixed, definite value (even if we do not know exactly what it is).

A statistic is a number calculated from a sample. If you measure the heights of 500 students and calculate their average, that average is a statistic. It is an estimate of the parameter.

Here is the key insight: we use statistics (numbers from samples) to estimate parameters (numbers describing populations). When a news report says “the average American spends 4 hours a day on their phone,” that number almost certainly came from a sample, not from measuring every American. It is a statistic being used to estimate a parameter.

Types of Data: Categorical vs. Numerical

Not all data looks the same, and different types of data require different analysis methods.

Categorical data (also called qualitative data) places observations into groups or categories. Examples include:

  • Eye color (blue, brown, green, hazel)
  • Favorite music genre (pop, rock, classical, hip-hop)
  • Yes/no responses on a survey
  • Blood type (A, B, AB, O)

With categorical data, you can count how many observations fall into each category, but mathematical operations like averaging do not make sense. What would it mean to find the “average” eye color?

Numerical data (also called quantitative data) consists of numbers that have mathematical meaning. Examples include:

  • Height in centimeters
  • Temperature in degrees
  • Number of siblings
  • Test scores
  • Annual income

With numerical data, you can perform calculations like finding averages, ranges, and other measures.

Discrete vs. Continuous Data

Numerical data breaks down further into two types.

Discrete data takes on countable values, usually whole numbers. There are gaps between possible values. Examples include:

  • Number of siblings (you can have 0, 1, 2, 3… but not 2.7 siblings)
  • Number of cars in a parking lot
  • Points scored in a game
  • Number of customers served

Continuous data can take any value within a range, including decimals and fractions. The possible values form a continuous spectrum. Examples include:

  • Height (you could be 170.4 cm or 170.45 cm or any value in between)
  • Weight
  • Time to complete a task
  • Temperature

A helpful test: if you can meaningfully have “half” of a value, it is probably continuous. You cannot have half a sibling, but you can have half a kilogram.

Levels of Measurement

Data can also be classified by how much information it carries. These levels build on each other, with each level having all the properties of the levels below it plus additional properties.

Nominal (naming): Data is placed in categories with no inherent order. You can say whether two observations are the same or different, but you cannot rank them.

  • Examples: Eye color, country of birth, phone brand, blood type
  • Meaningful operations: Counting, mode

Ordinal (ordering): Data has categories with a meaningful order, but the differences between categories are not necessarily equal.

  • Examples: Survey ratings (poor, fair, good, excellent), class rank, education level (high school, bachelor’s, master’s, doctorate)
  • Meaningful operations: Counting, mode, median, comparisons (greater than, less than)

Interval (equal intervals): Data has meaningful order AND equal differences between values, but there is no true zero point.

  • Examples: Temperature in Celsius or Fahrenheit (0 degrees does not mean “no temperature”), calendar years, IQ scores
  • Meaningful operations: All of the above plus mean, standard deviation

Ratio (true zero): Data has all the properties of interval data plus a meaningful zero point that represents “none” of the quantity.

  • Examples: Height, weight, age, income, distance, temperature in Kelvin
  • Meaningful operations: All of the above plus ratios (twice as tall, half as expensive)

Why does this matter? Because certain calculations only make sense at certain levels. You cannot meaningfully say that 20 degrees Celsius is “twice as hot” as 10 degrees Celsius (that is a ratio comparison on interval data), but you can say that 20 kilograms is twice as heavy as 10 kilograms.

Variables: Explanatory vs. Response

When you are studying relationships between variables, it helps to identify their roles.

The explanatory variable (also called the independent variable) is the variable you think might influence or explain changes in another variable. It is often what you control or manipulate in an experiment.

The response variable (also called the dependent variable) is the variable you think might be affected. It is what you measure to see if there is an effect.

For example, if you are studying whether studying more hours leads to better test scores:

  • Explanatory variable: Hours of studying
  • Response variable: Test score

The names help you think about causation, but be careful: just because you label something as “explanatory” does not mean it actually causes changes in the response variable. That is a conclusion you need to support with evidence.

Observational Studies vs. Experiments

How you collect data determines what conclusions you can draw.

In an observational study, you observe and measure variables without trying to influence them. You record things as they naturally occur.

  • Example: Surveying people about their coffee consumption and sleep quality
  • Strength: You see real-world behavior
  • Limitation: You cannot establish cause and effect (maybe people who drink more coffee also have more stressful jobs, and stress affects sleep)

In an experiment, you actively impose treatments on subjects to see what effect the treatments have.

  • Example: Randomly assigning people to drink coffee or decaf, then measuring their sleep
  • Strength: You can establish cause and effect (because you controlled what changed)
  • Limitation: Artificial settings may not reflect real-world behavior

The gold standard for establishing that one thing causes another is a randomized controlled experiment. Random assignment helps ensure that any differences between groups are due to the treatment, not some other factor.

Why Sampling Matters: The Danger of Biased Data

Here is perhaps the most important practical lesson in statistics: your conclusions are only as good as your sample.

A biased sample is one that systematically differs from the population in some way. Conclusions from biased samples can be wildly wrong, no matter how sophisticated your analysis.

Common sources of bias:

Convenience sampling: Studying whoever is easiest to reach. If you survey people at a gym about exercise habits, you are missing people who never go to gyms.

Voluntary response bias: When people choose whether to participate, those with strong opinions (especially negative ones) are more likely to respond. Online reviews often suffer from this.

Undercoverage: When some groups in the population are less likely to be included. Phone surveys miss people without phones; online surveys miss people without internet access.

Nonresponse bias: When people who do not respond differ systematically from those who do. Busy people may not have time to complete surveys.

Question wording bias: How you ask a question can influence the answer. “Do you support protecting the environment?” gets different responses than “Do you support regulations that might cost jobs?”

The solution is random sampling, where every member of the population has a known chance of being selected. Random sampling does not eliminate uncertainty, but it eliminates systematic bias and allows you to quantify how uncertain your conclusions are.

Notation and Terminology

Term Meaning Example
Population The entire group you want to learn about All high school students in the US
Sample A subset of the population you actually study 500 surveyed students
Parameter A number describing the population True average height of all students
Statistic A number calculated from a sample Average height of 500 surveyed students
Categorical data Data in categories or groups Eye color, favorite genre
Numerical data Data that are numbers with meaning Height, temperature, age
Discrete Countable values Number of siblings
Continuous Measurable values on a scale Weight, time
Explanatory variable Variable that might explain or cause changes Hours of studying
Response variable Variable that might be affected Test score
Observational study Study where you observe without intervening Survey about habits
Experiment Study where you impose treatments Clinical trial
Bias Systematic error that skews results Surveying only gym members about exercise

Examples

Example 1: Identifying Population and Sample

A news headline reads: “Survey of 1,200 voters shows 52% support new education policy.”

Identify the population and sample in this study.

Solution:

Population: All voters (in whatever region the poll covers, likely the country or state). This is the group the pollsters want to learn about.

Sample: The 1,200 voters who were actually surveyed. This is the group the pollsters actually studied.

Notice that 52% is a statistic (calculated from the sample), and it is being used to estimate the parameter (the true percentage of all voters who support the policy).

Example 2: Classifying Variables as Categorical or Numerical

Classify each variable as categorical or numerical. If numerical, also classify as discrete or continuous.

a) A student’s letter grade (A, B, C, D, F) b) The number of text messages sent per day c) A person’s height in inches d) Zip code e) Number of pets owned

Solution:

a) Letter grade: Categorical Even though grades might feel like they have an order, they are categories. You cannot average letter grades directly (though you can convert them to numbers like GPA).

b) Text messages per day: Numerical, Discrete These are counts. You can send 50 or 51 messages, but not 50.7 messages.

c) Height in inches: Numerical, Continuous Height can be any value within a range. You could be 67.3 inches or 67.34 inches.

d) Zip code: Categorical This one is tricky! Zip codes are numbers, but they are really just labels for geographic areas. It makes no sense to calculate the “average zip code” or say that zip code 20000 is “twice” zip code 10000.

e) Number of pets: Numerical, Discrete You can have 0, 1, 2, 3 pets, but not 2.5 pets.

Example 3: Observational Study or Experiment?

For each scenario, determine whether it is an observational study or an experiment. Then explain whether you can draw conclusions about cause and effect.

a) Researchers track 5,000 adults over 10 years, recording their exercise habits and heart disease rates.

b) Researchers randomly assign 200 patients to receive either a new medication or a placebo, then compare their recovery rates.

c) A company compares sales in stores that chose to offer free samples versus stores that did not.

Solution:

a) Observational study The researchers are not telling people how much to exercise; they are just observing what people naturally do. While they might find that more exercise is associated with lower heart disease rates, they cannot conclude that exercise causes lower rates. Maybe healthier people are both more likely to exercise and less likely to get heart disease.

b) Experiment Researchers are actively assigning treatments (medication vs. placebo) and the assignment is random. This is a randomized controlled experiment. If the medication group has better recovery rates, researchers can conclude the medication caused the improvement, because random assignment makes the groups comparable in all other ways.

c) Observational study Even though it involves comparing groups, the researchers did not assign which stores offered samples. Stores chose for themselves. Stores that chose to offer samples might differ in other ways (maybe they are in busier locations or have more engaged managers). You cannot conclude that free samples caused any difference in sales.

Example 4: Levels of Measurement

Identify the level of measurement (nominal, ordinal, interval, or ratio) for each variable.

a) Temperature in Fahrenheit b) Customer satisfaction rating (1 = very dissatisfied, 2 = dissatisfied, 3 = neutral, 4 = satisfied, 5 = very satisfied) c) A runner’s finish time in a race d) Jersey numbers of basketball players e) Military rank (Private, Corporal, Sergeant, Lieutenant, Captain, etc.)

Solution:

a) Temperature in Fahrenheit: Interval The differences between values are meaningful (the difference between 70 and 80 degrees is the same as between 80 and 90 degrees). However, 0 degrees Fahrenheit does not mean “no temperature,” so ratios are not meaningful. 80 degrees is not “twice as hot” as 40 degrees.

b) Satisfaction rating: Ordinal There is a clear order (5 is better than 4, which is better than 3, etc.), but we cannot be sure the gaps are equal. Is the difference between “very dissatisfied” and “dissatisfied” the same as the difference between “satisfied” and “very satisfied”? Probably not.

c) Finish time: Ratio A time of 0 means “no time passed” (a true zero). A time of 20 minutes is genuinely twice as long as 10 minutes. You can meaningfully compare ratios.

d) Jersey numbers: Nominal These are just labels. Player 23 is not “more” than player 11 in any meaningful sense. You cannot average jersey numbers and get anything useful.

e) Military rank: Ordinal There is a definite order (Captain outranks Lieutenant outranks Sergeant, etc.), but the “gaps” between ranks are not equal or quantifiable.

Example 5: Critiquing a Sampling Method

A university wants to know how satisfied students are with campus dining. They decide to survey students by standing outside the main dining hall on a Thursday at noon and asking students leaving the hall to complete a brief survey. They collect 150 responses.

Identify at least three potential sources of bias in this sampling method and explain how each could affect the results.

Solution:

This sampling method has several problems:

1. Convenience sampling bias By surveying only students who happen to pass by at that specific time and place, they are missing students who eat at other times, at other dining locations, or who do not use campus dining at all. The sample is not representative of all students.

2. Undercoverage Students who never eat at the dining hall (perhaps because they are dissatisfied with it!) will never have a chance to be surveyed. This likely skews results toward more satisfied students since dissatisfied students may have already stopped using campus dining.

3. Timing bias Thursday at noon captures a specific subset of students. Students with classes at that time, students who work during lunch, or students who prefer to eat at other times are all excluded. Different types of students might have systematically different opinions.

4. Voluntary response bias Students choose whether to participate. Those who feel strongly (either very happy or very unhappy) might be more likely to stop and respond. Those with neutral opinions might just walk past.

5. Location bias Only students at the main dining hall are surveyed. Students who prefer satellite cafes, food trucks, or off-campus options are missed entirely.

A better approach: Randomly select students from the university’s enrollment records and email them a survey, or randomly select student ID numbers and intercept those specific students in various locations at various times. This gives every student a chance to be included, not just those at one place at one time.

Example 6: Putting It All Together

A pharmaceutical company wants to test whether a new pain medication is more effective than the current standard medication. They recruit 400 patients with chronic back pain from several hospitals.

a) What is the population? What is the sample? b) What would be the explanatory variable? The response variable? c) Design this as an experiment. What makes it an experiment rather than an observational study? d) Why is random assignment important in this study? e) What is a parameter the company wants to estimate? What statistic would they calculate?

Solution:

a) Population and sample:

  • Population: All people with chronic back pain (or, more specifically, all people with chronic back pain who could potentially use this medication)
  • Sample: The 400 patients recruited from the hospitals

b) Variables:

  • Explanatory variable: Which medication the patient receives (new medication vs. standard medication)
  • Response variable: Pain level or pain relief (however they choose to measure it)

c) Experimental design: To make this an experiment, researchers would:

  • Randomly assign patients to two groups
  • One group receives the new medication
  • The other group receives the standard medication
  • Compare pain outcomes between groups

This is an experiment because researchers actively impose the treatment (which medication each patient takes). In an observational study, patients would choose their own medications.

d) Why random assignment matters: Random assignment ensures that the two groups are similar in all characteristics except the medication. Without it, maybe sicker patients choose one medication over another, or doctors assign the new drug to patients they think will respond better. Random assignment eliminates these confounding factors, so if outcomes differ, we can confidently say the medication caused the difference.

e) Parameter and statistic:

  • Parameter: The true average pain relief for all chronic back pain patients taking the new medication (and similarly for the standard medication)
  • Statistic: The average pain relief observed in the 200 patients assigned to the new medication (and the 200 assigned to standard medication)

The company will use the statistics from their sample to estimate the parameters for the larger population.

Key Properties and Rules

Population and Sample Relationships

  • The population is what you want to learn about; the sample is what you actually study
  • Parameters describe populations; statistics describe samples
  • Statistics are used to estimate parameters
  • Larger samples generally give more reliable estimates

Data Classification Guidelines

Categorical vs. Numerical:

  • If values are categories or labels: Categorical
  • If values are numbers with mathematical meaning: Numerical
  • Beware of numbers that are really labels (zip codes, phone numbers, jersey numbers)

Discrete vs. Continuous:

  • If you count it: Discrete (integers only)
  • If you measure it: Usually continuous (can take any value in a range)

Levels of Measurement Summary

Level Can you… Examples
Nominal …say if two values are equal or different? Eye color, country
Ordinal …also rank values in order? Rankings, ratings
Interval …also measure differences between values? Temperature (C/F), dates
Ratio …also compare ratios and have a true zero? Height, weight, age

Study Design Principles

For observational studies:

  • You can identify associations and correlations
  • You cannot establish cause and effect
  • Be cautious about confounding variables

For experiments:

  • You can establish cause and effect (if well-designed)
  • Random assignment is crucial
  • Control groups provide comparison

Avoiding Bias

  • Random sampling gives every population member a chance to be selected
  • Convenience samples are almost always biased
  • Consider who might be missing from your sample
  • Consider how your methods might attract certain types of respondents
  • Question wording matters: neutral language gives more accurate results

Real-World Applications

Political Polling and Election Predictions

When you see headlines like “Candidate A leads by 3 points,” that number comes from a sample of voters, not from asking every voter in the country. Pollsters carefully design samples to represent the voting population, accounting for demographics, geography, and likelihood of actually voting.

The “margin of error” you often see reported (for example: plus or minus 2 percentage points) reflects the uncertainty inherent in using a sample to estimate population opinions. Understanding that polls are estimates with uncertainty, not exact measurements, makes you a more sophisticated consumer of political news.

Medical Research and Clinical Trials

Before a new medication reaches the pharmacy, it goes through rigorous testing. Clinical trials are experiments with random assignment: some patients receive the new treatment, others receive a placebo or existing treatment. This experimental design allows researchers to conclude whether the treatment actually works, not just whether patients who chose to take it happened to get better.

Understanding the difference between observational studies (“People who eat more vegetables have lower cancer rates”) and experiments (“This drug reduced tumor size compared to placebo”) helps you evaluate health claims you encounter in the news.

Market Research and Consumer Surveys

Companies invest heavily in understanding what customers want. They conduct surveys, focus groups, and experiments (like A/B testing on websites) to learn preferences and predict behavior. When a company claims “8 out of 10 customers recommend our product,” you can now ask: Who were those customers? How were they selected? Was it a random sample or just their most loyal fans?

Quality Control in Manufacturing

Factories cannot test every single product coming off the assembly line. Instead, they use sampling to monitor quality. By randomly selecting and testing a sample of products, they can estimate defect rates for the entire production run. If the sample shows too many defects, they investigate and fix the problem. Statistical quality control makes mass production reliable and safe.

Sports Analytics

Modern sports teams employ statisticians to gain competitive advantages. They analyze player performance data, game situations, and opponent tendencies. Baseball’s “Moneyball” revolution showed how statistical analysis could identify undervalued players. The “analytics movement” in basketball has changed how teams value three-point shooting versus mid-range shots. These decisions rest on statistical analysis of discrete events (shots, hits, plays) across many games.

Self-Test Problems

Problem 1: A researcher wants to study the eating habits of college students in California. She surveys 300 students from her own university. a) What is the population? b) What is the sample? c) Identify one potential problem with this sampling method.

Show Answer

a) Population: All college students in California (this is the group she wants to learn about)

b) Sample: The 300 students from her university who were surveyed

c) Potential problem: This is a convenience sample that may not represent all California college students. Students at her university might differ from students at other universities in important ways (size of school, public vs. private, location, demographics, etc.). Conclusions drawn from this sample might not apply to the broader population.

Problem 2: Classify each variable as categorical or numerical. For numerical variables, also identify as discrete or continuous. a) A person’s blood type b) Daily high temperature c) Number of apps on a phone d) Star rating of a movie (1 to 5 stars) e) Distance traveled to work

Show Answer

a) Blood type: Categorical Blood types (A, B, AB, O, with positive/negative) are categories, not numbers.

b) Daily high temperature: Numerical, Continuous Temperature can take any value within a range and is measured on a scale.

c) Number of apps: Numerical, Discrete You can have 47 or 48 apps, but not 47.5 apps. These are counts.

d) Star rating: Categorical (or Ordinal) While often represented with numbers, star ratings are really categories with an order. The difference between 2 and 3 stars is not necessarily the same as between 4 and 5 stars. (Note: some statisticians treat these as numerical for convenience, but strictly speaking they are ordinal categorical data.)

e) Distance to work: Numerical, Continuous Distance can be any positive value and is measured rather than counted.

Problem 3: For each scenario, identify whether it is an observational study or an experiment: a) A company randomly assigns customers to see different versions of their website and tracks which version leads to more purchases. b) Researchers compare lung cancer rates between people who smoke and people who do not smoke. c) A school assigns students to morning or afternoon classes based on a coin flip, then compares test scores between the two groups.

Show Answer

a) Experiment The company is actively assigning customers to different conditions (website versions). Random assignment makes this an experiment.

b) Observational study Researchers are not telling people to smoke or not smoke; they are observing people’s existing behaviors. This is why we can say smoking is “associated with” lung cancer, and we rely on extensive additional evidence to conclude it causes cancer.

c) Experiment Random assignment (coin flip) to conditions (morning vs. afternoon) with measurement of outcomes (test scores) is the definition of an experiment. If differences in scores are found, you can conclude that class timing caused the difference.

Problem 4: Identify the level of measurement for each variable: a) The year a building was constructed b) A person’s annual income c) Types of cuisine at restaurants (Italian, Mexican, Chinese, etc.) d) Finishing position in a race (1st, 2nd, 3rd, etc.) e) Length in meters

Show Answer

a) Year: Interval Years have equal intervals (the difference between 2020 and 2010 is the same as between 2010 and 2000). However, year 0 does not mean “no time,” so ratios are not meaningful. The year 2000 is not “twice” the year 1000 in any meaningful sense.

b) Income: Ratio An income of $0 means “no income” (true zero). An income of $100,000 is genuinely twice an income of $50,000. Ratios are meaningful.

c) Cuisine type: Nominal These are categories with no inherent order. You cannot rank or measure differences between cuisine types.

d) Finishing position: Ordinal There is a clear order (1st is better than 2nd, which is better than 3rd), but the differences are not equal. The gap between 1st and 2nd might be tiny (photo finish) while the gap between 2nd and 3rd might be huge.

e) Length in meters: Ratio Zero meters means “no length” (true zero). Two meters is twice one meter. All mathematical operations are meaningful.

Problem 5: A television station wants to know viewers’ opinions about their new evening news format. They announce on air: “Go to our website and tell us what you think!” and 500 viewers respond. What type of bias is most likely present, and how might it affect the results?

Show Answer

Voluntary response bias is the primary problem.

When viewers choose whether to respond, those with strong opinions (especially strong negative opinions) are more likely to take the time to do so. People who think the new format is “okay” or “fine” probably will not bother going to a website to say so.

This means the results will likely overrepresent extreme opinions and underrepresent moderate ones. If 70% of respondents say they dislike the new format, that does not mean 70% of all viewers dislike it. It means 70% of people who cared enough to respond dislike it.

Additional issues:

  • Undercoverage: Only viewers who heard the announcement, have internet access, and know how to navigate to the website can respond
  • Convenience/self-selection: The sample consists only of people motivated to seek out and complete the survey

A better approach would be to randomly select phone numbers or addresses and contact those specific households to ask their opinions.

Problem 6: A study finds that students who sit in the front of the classroom tend to have higher grades than students who sit in the back. a) Is this an observational study or an experiment? b) Can we conclude that sitting in front causes better grades? Why or why not? c) What might be a confounding variable? d) How could you design an experiment to test whether seating position actually affects grades?

Show Answer

a) Observational study Students are choosing where to sit; no one is assigning them to seats.

b) No, we cannot conclude causation. This study only shows an association (correlation) between seating position and grades. We cannot conclude that sitting in front causes better grades because students were not randomly assigned to seats.

c) Confounding variables (any of these would be good answers):

  • Motivation: Students who are more motivated might both choose to sit in front AND study harder, which leads to better grades
  • Engagement: Students who are naturally more engaged in the class might choose front seats and also participate more, leading to better learning
  • Vision or hearing: Students who can see or hear well might sit in back, while those with difficulties sit in front and may struggle academically
  • Arrival time: Responsible students might arrive early (getting front seats) and also be more responsible about homework

d) Experimental design: Randomly assign students to seats at the beginning of the semester. One group is assigned to the front third of the room, another to the back third. Keep seat assignments fixed for the semester. Compare final grades between groups.

With random assignment, we would expect the groups to be similar in motivation, ability, and other characteristics. Any difference in grades could then be attributed to seating position.

Summary

  • Statistics is the science of collecting, organizing, analyzing, and interpreting data to answer questions and make decisions.

  • The population is the entire group you want to learn about. The sample is the subset you actually study. We use statistics (numbers from samples) to estimate parameters (numbers describing populations).

  • Categorical data places observations into groups (eye color, yes/no responses). Numerical data consists of meaningful numbers (height, temperature).

  • Numerical data is discrete if it takes countable values (number of siblings) and continuous if it can take any value in a range (weight, time).

  • Levels of measurement from least to most information: nominal (categories), ordinal (ordered categories), interval (equal differences), ratio (true zero).

  • The explanatory variable might explain changes; the response variable might be affected.

  • In observational studies, you observe without intervening. You can find associations but not establish cause and effect. In experiments, you impose treatments. With random assignment, you can establish causation.

  • Biased samples produce misleading conclusions. Common sources include convenience sampling, voluntary response, undercoverage, and nonresponse. Random sampling gives every population member a chance to be selected and is the foundation of reliable statistical inference.

  • Statistics underlies polling, medical research, market research, quality control, and sports analytics. Understanding these concepts makes you a more critical consumer of data-based claims.