Introduction to Hypothesis Testing

Make decisions based on evidence and probability

Every day, you make decisions based on evidence. When you see dark clouds forming, you grab an umbrella because the evidence suggests rain. When a restaurant has dozens of glowing reviews, you trust it is probably good. When your car makes a strange noise it has never made before, you suspect something is wrong.

These everyday decisions follow a pattern: you start with an assumption about how things normally are, you observe evidence, and then you decide whether that evidence is unusual enough to change your mind. This is exactly the logic behind hypothesis testing, one of the most powerful tools in statistics.

Hypothesis testing gives you a formal framework for making decisions when you are uncertain. Should a pharmaceutical company approve a new drug? Is a marketing campaign actually increasing sales? Does a teaching method really improve test scores? In each case, you have data, but the data is subject to random variation. Hypothesis testing helps you distinguish between genuine effects and mere coincidence.

If you have ever served on a jury or watched a courtroom drama, you have seen hypothesis testing in action. The defendant is presumed innocent until proven guilty. The prosecution must present enough evidence to convince the jury beyond a reasonable doubt. The jury does not prove the defendant is guilty with certainty; they simply decide whether the evidence is strong enough to reject the presumption of innocence.

Statistical hypothesis testing works the same way. You start with a default assumption, collect evidence, and then ask: “Is this evidence surprising enough to make me doubt my assumption?” The mathematics tells you exactly how surprising your evidence is, and you use that to make your decision.

Core Concepts

The Logic of Hypothesis Testing

Imagine you have a coin and you want to know if it is fair. A fair coin should come up heads about 50% of the time. You flip it 100 times and get 62 heads. Is the coin unfair, or did you just get a slightly unusual result by chance?

This is the fundamental question of hypothesis testing. You cannot prove the coin is unfair just because you got more heads than expected. Even a perfectly fair coin will sometimes give you 60 or 62 or even 70 heads in 100 flips, just by random chance. The question is: how unlikely is your result if the coin really is fair?

Hypothesis testing formalizes this reasoning into a step-by-step procedure:

  1. State your assumptions. Start by assuming the coin is fair (this is your null hypothesis).

  2. Collect data. Flip the coin and record the results.

  3. Calculate how surprising your data would be. If the coin were truly fair, what is the probability of getting a result as extreme as what you observed?

  4. Make a decision. If the probability is very low, you have evidence against your assumption. If the probability is not particularly low, your data is consistent with your assumption.

The key insight is this: you never prove your assumption is true. You either find enough evidence to reject it, or you do not. This asymmetry is fundamental to hypothesis testing.

Null and Alternative Hypotheses

Every hypothesis test involves two competing claims about a population parameter.

The null hypothesis (denoted $H_0$) is the default assumption, the claim of “no effect” or “no difference” or “nothing special going on.” It represents the status quo. The null hypothesis always includes an equality.

The alternative hypothesis (denoted $H_a$ or $H_1$) is what you are trying to find evidence for. It represents a change from the status quo, an effect, a difference, or something noteworthy. The alternative hypothesis is what the researcher typically hopes or suspects is true.

Examples:

  • Testing if a coin is fair:

    • $H_0: p = 0.5$ (the coin is fair)
    • $H_a: p \neq 0.5$ (the coin is not fair)
  • Testing if a new drug lowers blood pressure:

    • $H_0: \mu = 0$ (the average change in blood pressure is zero, i.e., no effect)
    • $H_a: \mu < 0$ (the average change is negative, i.e., blood pressure decreases)
  • Testing if a tutoring program improves test scores:

    • $H_0: \mu = 500$ (the mean score equals the national average)
    • $H_a: \mu > 500$ (the mean score is higher than the national average)

The null hypothesis is not necessarily what you believe is true. It is the claim you are testing against. Think of it as the claim that needs to be disproven. In a courtroom, “innocent until proven guilty” does not mean everyone believes the defendant is innocent. It means the prosecution must provide sufficient evidence to reject that assumption.

Test Statistics

A test statistic is a single number calculated from your sample data that measures how far your observations are from what the null hypothesis predicts. The test statistic standardizes your evidence, making it possible to determine how unusual your results are.

For many hypothesis tests, the test statistic is a Z-score or something similar. Recall that a Z-score tells you how many standard deviations a value is from the mean.

For testing a population proportion, the test statistic is:

$$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$

where:

  • $\hat{p}$ is the sample proportion
  • $p_0$ is the proportion claimed by the null hypothesis
  • $n$ is the sample size

For testing a population mean (when $\sigma$ is known), the test statistic is:

$$Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

where:

  • $\bar{x}$ is the sample mean
  • $\mu_0$ is the mean claimed by the null hypothesis
  • $\sigma$ is the population standard deviation
  • $n$ is the sample size

The test statistic answers the question: “How many standard errors is my sample result from what the null hypothesis predicts?” A test statistic of 0 means your sample result is exactly what the null hypothesis expected. A test statistic of 2 or 3 means your result is quite far from what the null hypothesis predicts, which would be surprising if the null hypothesis were true.

P-Values: The Probability of Your Evidence

The P-value is the probability of obtaining a test statistic as extreme as (or more extreme than) the one you actually observed, assuming the null hypothesis is true.

Read that definition carefully. The P-value is NOT the probability that the null hypothesis is true. It is the probability of getting your data (or more extreme data) if the null hypothesis were true.

A small P-value means: “If the null hypothesis were true, it would be very unlikely to get results like these by chance alone.”

A large P-value means: “If the null hypothesis were true, results like these would not be unusual.”

How to interpret P-values:

  • A P-value of 0.03 means: “If the null hypothesis is true, there is only a 3% chance of getting results this extreme or more extreme.”
  • A P-value of 0.45 means: “If the null hypothesis is true, there is a 45% chance of getting results this extreme or more extreme.” (This is not unusual at all.)

Small P-values provide evidence against the null hypothesis because they indicate your observed results would be rare if the null hypothesis were true. If something happens that should be rare under your assumption, maybe your assumption is wrong.

Significance Level ($\alpha$)

Before conducting a hypothesis test, you choose a significance level (denoted $\alpha$), which is the threshold for deciding whether your P-value is “small enough” to reject the null hypothesis.

Common choices for $\alpha$ are:

  • $\alpha = 0.05$ (5%) - the most common choice in many fields
  • $\alpha = 0.01$ (1%) - more stringent, used when you want stronger evidence
  • $\alpha = 0.10$ (10%) - less stringent, sometimes used in exploratory research

The significance level represents the maximum probability of making a particular kind of mistake (rejecting the null hypothesis when it is actually true). By choosing $\alpha = 0.05$, you are saying: “I am willing to accept a 5% chance of incorrectly rejecting a true null hypothesis.”

The decision rule is straightforward:

  • If P-value $\leq \alpha$: Reject $H_0$. The result is statistically significant.
  • If P-value $> \alpha$: Fail to reject $H_0$. The result is not statistically significant.

Making Conclusions: Reject or Fail to Reject

The language of hypothesis testing is deliberately careful. You either reject the null hypothesis or fail to reject it. You never “accept” the null hypothesis.

Why such careful wording? Because hypothesis testing can only provide evidence against the null hypothesis. The absence of such evidence does not prove the null hypothesis is true. It just means you do not have enough evidence to conclude it is false.

This is like a court verdict. A jury can find a defendant “guilty” or “not guilty.” There is no verdict of “innocent.” A “not guilty” verdict does not mean the jury believes the defendant is innocent. It means the prosecution did not present enough evidence to convict.

Writing conclusions:

When you reject $H_0$:

  • “There is sufficient evidence to conclude that…” [state the alternative hypothesis in context]
  • “The data provide significant evidence that…”

When you fail to reject $H_0$:

  • “There is insufficient evidence to conclude that…” [state the alternative hypothesis in context]
  • “The data do not provide significant evidence that…”

Always state your conclusion in the context of the problem, not in terms of abstract parameters.

Type I and Type II Errors

Because hypothesis testing relies on probability, you can make mistakes. There are exactly two types of errors you can make.

Type I Error (False Positive): Rejecting $H_0$ when it is actually true.

You conclude there is an effect when there really is not one. For example, convicting an innocent person or approving a drug that does not actually work.

The probability of a Type I error equals $\alpha$, the significance level. If you use $\alpha = 0.05$, you have a 5% chance of making a Type I error when $H_0$ is true.

Type II Error (False Negative): Failing to reject $H_0$ when it is actually false.

You fail to detect an effect that really exists. For example, acquitting a guilty person or failing to approve a drug that actually works.

The probability of a Type II error is denoted $\beta$. Unlike $\alpha$, you do not directly choose $\beta$. It depends on factors like sample size, the true value of the parameter, and your chosen $\alpha$.

The power of a test is $1 - \beta$, the probability of correctly rejecting a false null hypothesis. Higher power means you are more likely to detect a real effect.

The error trade-off:

There is always a trade-off between the two types of errors. If you make it harder to reject $H_0$ (by lowering $\alpha$), you reduce Type I errors but increase Type II errors. If you make it easier to reject $H_0$ (by raising $\alpha$), you reduce Type II errors but increase Type I errors.

Decision $H_0$ True $H_0$ False
Reject $H_0$ Type I Error Correct
Fail to reject $H_0$ Correct Type II Error

One-Tailed vs. Two-Tailed Tests

The alternative hypothesis determines whether you use a one-tailed or two-tailed test.

Two-tailed test (non-directional):

$$H_a: \text{parameter} \neq \text{hypothesized value}$$

Use this when you want to detect a difference in either direction. The P-value is the probability of getting a test statistic as extreme as yours in either tail of the distribution.

Example: Testing whether a coin is unfair (could be biased toward heads OR toward tails).

One-tailed test (directional):

$$H_a: \text{parameter} > \text{hypothesized value} \quad \text{(right-tailed)}$$ $$H_a: \text{parameter} < \text{hypothesized value} \quad \text{(left-tailed)}$$

Use this when you only care about detecting a difference in a specific direction. The P-value is the probability of getting a test statistic as extreme as yours in that one direction.

Example: Testing whether a new drug lowers blood pressure (you only care if it lowers, not raises).

Choosing between one-tailed and two-tailed:

The choice should be made before collecting data, based on the research question. One-tailed tests have more power to detect effects in the specified direction, but they cannot detect effects in the opposite direction.

General guideline: If you would take action only if the effect is in a particular direction, a one-tailed test may be appropriate. If an effect in either direction would be important, use a two-tailed test. When in doubt, use two-tailed.

Statistical Significance vs. Practical Significance

A result can be statistically significant without being practically significant.

Statistical significance just means the result is unlikely to have occurred by chance alone. It says nothing about whether the effect is large enough to matter in the real world.

With a very large sample, even tiny effects become statistically significant. If you test 100,000 people, you might find that a new teaching method increases test scores by 0.2 points (on a 100-point scale) with a P-value of 0.001. This is highly statistically significant, but is a 0.2-point improvement worth the cost and effort of implementing a new teaching method? Probably not.

Conversely, a practically important effect might not reach statistical significance if the sample is too small. A pilot study of 20 patients might show that a drug reduces pain by 30%, but with a P-value of 0.12 due to the small sample. The effect looks substantial, but the evidence is not statistically convincing yet.

Best practice: Always consider both statistical significance (Is the evidence strong?) and practical significance (Is the effect large enough to matter?). Report effect sizes and confidence intervals alongside P-values.

Hypothesis Test for a Proportion

Let us put all these concepts together with a specific procedure: testing a claim about a population proportion.

Conditions:

  1. The data come from a random sample
  2. The sample size is large enough: $np_0 \geq 10$ and $n(1-p_0) \geq 10$, where $p_0$ is the hypothesized proportion

The procedure:

  1. State the hypotheses.

    • $H_0: p = p_0$
    • $H_a: p \neq p_0$ (or $p > p_0$ or $p < p_0$)
  2. Check conditions for the test to be valid.

  3. Calculate the test statistic. $$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$

  4. Find the P-value using the standard Normal distribution.

    • Two-tailed: P-value $= 2 \times P(Z > |z|)$
    • Right-tailed: P-value $= P(Z > z)$
    • Left-tailed: P-value $= P(Z < z)$
  5. Make a decision. Compare the P-value to $\alpha$.

    • If P-value $\leq \alpha$: Reject $H_0$
    • If P-value $> \alpha$: Fail to reject $H_0$
  6. State your conclusion in context.

Notation and Terminology

Term Meaning Example
$H_0$ Null hypothesis; the default claim $H_0: \mu = 100$
$H_a$ (or $H_1$) Alternative hypothesis; what you are testing for $H_a: \mu > 100$
Test statistic Standardized measure of evidence against $H_0$ $Z = 2.35$
P-value Probability of observed result (or more extreme) if $H_0$ is true $P = 0.019$
$\alpha$ Significance level; threshold for rejecting $H_0$ $\alpha = 0.05$
Type I error Rejecting $H_0$ when it is true False positive
Type II error Failing to reject $H_0$ when it is false False negative
$\beta$ Probability of Type II error $\beta = 0.20$
Power Probability of correctly rejecting false $H_0$; equals $1 - \beta$ Power $= 0.80$
Two-tailed test $H_a$ uses $\neq$; looks for difference in either direction $H_a: p \neq 0.5$
One-tailed test $H_a$ uses $<$ or $>$; looks for difference in one direction $H_a: p > 0.5$

Examples

Example 1: Stating Null and Alternative Hypotheses

For each situation, state the null and alternative hypotheses using appropriate notation.

a) A company claims that 80% of customers are satisfied. You suspect the true proportion is lower.

b) The national average score on a standardized test is 500. A school wants to test whether their students perform differently from the national average.

c) A manufacturer claims their batteries last 50 hours on average. Consumer Reports thinks they last longer.

Solution:

a) The claim is $p = 0.80$. You suspect it is lower, so you are testing for $p < 0.80$.

$$H_0: p = 0.80$$ $$H_a: p < 0.80$$

This is a one-tailed (left-tailed) test.

b) The claim is $\mu = 500$. “Differently” means either higher or lower.

$$H_0: \mu = 500$$ $$H_a: \mu \neq 500$$

This is a two-tailed test.

c) The claim is $\mu = 50$ hours. Consumer Reports thinks the batteries last longer.

$$H_0: \mu = 50$$ $$H_a: \mu > 50$$

This is a one-tailed (right-tailed) test.

Example 2: Interpreting a P-Value

A researcher tests whether a coin is fair. After flipping it 200 times, she calculates a test statistic and finds a P-value of 0.03.

a) What does this P-value mean?

b) Using $\alpha = 0.05$, what decision should she make?

c) Using $\alpha = 0.01$, what decision should she make?

Solution:

a) Meaning of the P-value:

The P-value of 0.03 means: “If the coin were truly fair, there is only a 3% probability of getting results as extreme as (or more extreme than) what was observed.”

In other words, if the null hypothesis ($p = 0.5$) is true, data this unusual would occur only about 3 times out of 100.

b) Decision at $\alpha = 0.05$:

Since P-value (0.03) $\leq$ $\alpha$ (0.05), we reject $H_0$.

Conclusion: There is sufficient evidence at the 5% significance level to conclude that the coin is not fair.

c) Decision at $\alpha = 0.01$:

Since P-value (0.03) $>$ $\alpha$ (0.01), we fail to reject $H_0$.

Conclusion: There is insufficient evidence at the 1% significance level to conclude that the coin is not fair.

Key insight: The same data can lead to different conclusions depending on the significance level chosen. This is why $\alpha$ should be chosen before the test is conducted, not after you see the P-value.

Example 3: Complete Hypothesis Test for a Proportion

A college claims that 70% of its graduates find employment in their field within six months of graduation. A skeptical journalist surveys 150 recent graduates and finds that 95 of them found employment in their field within six months. At the 5% significance level, is there evidence that the college’s claim is too high?

Solution:

Step 1: State the hypotheses.

The college claims $p = 0.70$. The journalist suspects this is too high.

$$H_0: p = 0.70$$ $$H_a: p < 0.70$$

This is a one-tailed (left-tailed) test.

Step 2: Check conditions.

  • The sample is a random sample of graduates. (Assumed)
  • $np_0 = 150 \times 0.70 = 105 \geq 10$ ✓
  • $n(1-p_0) = 150 \times 0.30 = 45 \geq 10$ ✓

The conditions are satisfied.

Step 3: Calculate the test statistic.

First, find the sample proportion: $$\hat{p} = \frac{95}{150} = 0.633$$

Then calculate the test statistic: $$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.633 - 0.70}{\sqrt{\frac{0.70 \times 0.30}{150}}}$$

$$Z = \frac{-0.067}{\sqrt{\frac{0.21}{150}}} = \frac{-0.067}{\sqrt{0.0014}} = \frac{-0.067}{0.0374} = -1.79$$

Step 4: Find the P-value.

Since this is a left-tailed test, we need $P(Z < -1.79)$.

Using a standard Normal table or calculator: $P(Z < -1.79) = 0.0367$

P-value $= 0.0367$

Step 5: Make a decision.

Since P-value (0.0367) $<$ $\alpha$ (0.05), we reject $H_0$.

Step 6: State the conclusion in context.

There is sufficient evidence at the 5% significance level to conclude that the proportion of graduates who find employment in their field within six months is less than 70%. The journalist’s skepticism appears warranted.

Example 4: Identifying Type I and Type II Errors

A pharmaceutical company is testing a new medication. The null hypothesis is that the medication has no effect on reducing symptoms. The alternative hypothesis is that the medication does reduce symptoms.

a) Describe what a Type I error would mean in this context.

b) Describe what a Type II error would mean in this context.

c) Which error would be more serious: approving a medication that does not work, or failing to approve a medication that does work? Explain your reasoning.

d) If the company uses $\alpha = 0.01$ instead of $\alpha = 0.05$, how does this affect the probability of each type of error?

Solution:

a) Type I Error (False Positive):

A Type I error means rejecting $H_0$ when it is true. In this context:

The company concludes that the medication works and approves it for use, when in reality the medication has no actual effect on symptoms. Patients would take a useless medication, possibly experiencing side effects without any benefit.

b) Type II Error (False Negative):

A Type II error means failing to reject $H_0$ when it is false. In this context:

The company concludes there is insufficient evidence that the medication works, when in reality it does work. An effective treatment would not reach patients who could benefit from it.

c) Which error is more serious?

This is a judgment call that depends on the specifics of the situation.

Arguments that Type I is more serious:

  • Approving an ineffective drug wastes healthcare resources
  • Patients may experience side effects without benefits
  • An ineffective drug might replace treatments that actually work
  • Public trust in medications could be damaged

Arguments that Type II is more serious:

  • Patients who could be helped do not receive an effective treatment
  • For serious diseases, denying an effective treatment could mean suffering or death
  • The research investment in developing the drug is lost

In pharmaceutical testing, regulators typically prioritize avoiding Type I errors (hence the stringent requirements for drug approval). But the answer depends on factors like the severity of the disease, available alternatives, and the drug’s side effect profile.

d) Effect of lowering $\alpha$ from 0.05 to 0.01:

  • The probability of Type I error decreases from 5% to 1%. It becomes harder to reject $H_0$, so you are less likely to approve an ineffective drug.

  • The probability of Type II error increases. With a stricter standard for rejection, you are more likely to miss a drug that actually works, especially if the effect is modest.

Using $\alpha = 0.01$ means being more conservative: you require stronger evidence before concluding the drug works.

Example 5: Complete Hypothesis Test with Full Interpretation

A website runs an A/B test to see if a new checkout page design increases the conversion rate (the proportion of visitors who make a purchase). Historically, 12% of visitors make a purchase on the old design. After showing the new design to a random sample of 800 visitors, 112 of them made a purchase.

a) Conduct a hypothesis test at the $\alpha = 0.05$ significance level to determine if the new design has a higher conversion rate.

b) Calculate and interpret a 95% confidence interval for the new design’s conversion rate.

c) Discuss both the statistical significance and practical significance of the results.

Solution:

Part a: Hypothesis Test

Step 1: State the hypotheses.

We want to know if the new design increases conversions.

$$H_0: p = 0.12 \text{ (no improvement)}$$ $$H_a: p > 0.12 \text{ (improvement)}$$

This is a one-tailed (right-tailed) test.

Step 2: Check conditions.

  • Random sample: Visitors were randomly assigned to see the new design. ✓
  • $np_0 = 800 \times 0.12 = 96 \geq 10$ ✓
  • $n(1-p_0) = 800 \times 0.88 = 704 \geq 10$ ✓

Step 3: Calculate the test statistic.

Sample proportion: $$\hat{p} = \frac{112}{800} = 0.14$$

Test statistic: $$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.14 - 0.12}{\sqrt{\frac{0.12 \times 0.88}{800}}}$$

$$Z = \frac{0.02}{\sqrt{\frac{0.1056}{800}}} = \frac{0.02}{\sqrt{0.000132}} = \frac{0.02}{0.01149} = 1.74$$

Step 4: Find the P-value.

Since this is right-tailed: P-value $= P(Z > 1.74) = 0.0409$

Step 5: Make a decision.

Since P-value (0.0409) $<$ $\alpha$ (0.05), we reject $H_0$.

Step 6: Conclusion.

There is sufficient evidence at the 5% significance level to conclude that the new checkout design has a higher conversion rate than the old design.


Part b: Confidence Interval

Using $\hat{p} = 0.14$ and $n = 800$:

$$SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{0.14 \times 0.86}{800}} = \sqrt{0.0001505} = 0.01227$$

For 95% confidence, $z^* = 1.96$:

$$\text{Margin of Error} = 1.96 \times 0.01227 = 0.02405$$

$$\text{95% CI: } 0.14 \pm 0.024 = (0.116, 0.164)$$

Interpretation: We are 95% confident that the true conversion rate with the new design is between 11.6% and 16.4%.

Note that this interval is mostly above 12%, which is consistent with our hypothesis test result.


Part c: Statistical vs. Practical Significance

Statistical significance: Yes, the result is statistically significant at $\alpha = 0.05$. The P-value of 0.041 indicates that if the new design had no real effect, we would see an improvement this large only about 4% of the time by chance.

Practical significance: The observed improvement is from 12% to 14%, an increase of 2 percentage points (or about 17% relative increase).

Is this practically significant? It depends on the business context:

  • Volume matters: If the website has 100,000 visitors per month, a 2 percentage point increase means 2,000 additional conversions per month. If each conversion averages $50 in revenue, that is $100,000 per month in additional revenue.

  • Confidence interval: The true improvement could be anywhere from about 0 to 4 percentage points (roughly 11.6% to 16.4%, compared to the baseline 12%). The lower end of this interval suggests the improvement might be minimal, while the upper end suggests it could be substantial.

  • Cost considerations: If implementing the new design is inexpensive, even a small improvement is worthwhile. If it requires significant resources, the company might want stronger evidence of a larger effect.

Recommendation: The evidence suggests the new design likely improves conversions. However, given that the 95% confidence interval nearly includes 12% (it just barely exceeds it at the lower end), the company might consider collecting more data to get a more precise estimate before making a permanent change.

Example 6: Two-Tailed Test for a Mean

A standardized test has historically had a mean score of 500 with a standard deviation of 100. A new curriculum is implemented, and a random sample of 64 students who studied under the new curriculum has a mean score of 518. Is there evidence that the new curriculum changed the average score? Use $\alpha = 0.05$.

Solution:

Step 1: State the hypotheses.

We want to detect any change (higher or lower).

$$H_0: \mu = 500$$ $$H_a: \mu \neq 500$$

This is a two-tailed test.

Step 2: Check conditions.

  • Random sample: Assumed. ✓
  • Sample size: $n = 64 \geq 30$, so the Central Limit Theorem applies even if the population is not Normal. ✓
  • We are told $\sigma = 100$ is known.

Step 3: Calculate the test statistic.

$$Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{518 - 500}{100 / \sqrt{64}} = \frac{18}{100/8} = \frac{18}{12.5} = 1.44$$

Step 4: Find the P-value.

For a two-tailed test:

$$\text{P-value} = 2 \times P(Z > |1.44|) = 2 \times P(Z > 1.44)$$

From the standard Normal table: $P(Z > 1.44) = 0.0749$

$$\text{P-value} = 2 \times 0.0749 = 0.1498$$

Step 5: Make a decision.

Since P-value (0.1498) $>$ $\alpha$ (0.05), we fail to reject $H_0$.

Step 6: Conclusion.

There is insufficient evidence at the 5% significance level to conclude that the new curriculum has changed the average test score.

Important note: This does NOT mean the curriculum has no effect. It means we do not have strong enough evidence to conclude there is an effect. The sample mean of 518 is above 500, suggesting a possible improvement, but with this sample size, an 18-point difference could reasonably occur by chance even if the true mean were still 500.

To detect a real improvement of this size with more confidence, the school would need a larger sample.

Key Properties and Rules

The Hypothesis Testing Procedure

  1. State hypotheses ($H_0$ and $H_a$) before collecting data
  2. Choose significance level $\alpha$
  3. Check conditions for the test to be valid
  4. Calculate the test statistic
  5. Find the P-value
  6. Compare P-value to $\alpha$ and make a decision
  7. State conclusion in context

Decision Rules

P-value vs. $\alpha$ Decision Meaning
P-value $\leq \alpha$ Reject $H_0$ Result is statistically significant
P-value $> \alpha$ Fail to reject $H_0$ Result is not statistically significant

Test Statistic Formulas

For a proportion: $$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$

For a mean (when $\sigma$ is known): $$Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

P-Value Calculation

Test Type P-value
Right-tailed ($H_a: > $) $P(Z > z)$
Left-tailed ($H_a: < $) $P(Z < z)$
Two-tailed ($H_a: \neq $) $2 \times P(Z >

Error Types Summary

$H_0$ is actually TRUE $H_0$ is actually FALSE
Reject $H_0$ Type I Error (prob = $\alpha$) Correct Decision (prob = Power)
Fail to reject $H_0$ Correct Decision Type II Error (prob = $\beta$)

Conditions for Z-Test for a Proportion

  • Random sample from the population
  • $np_0 \geq 10$ (expected successes under $H_0$)
  • $n(1-p_0) \geq 10$ (expected failures under $H_0$)

Factors Affecting Power

To increase the power of a test (making it more likely to detect a real effect):

  • Increase sample size $n$
  • Increase significance level $\alpha$ (trade-off: more Type I errors)
  • Larger true effect size (not under your control)
  • Decrease variability (not always under your control)

Real-World Applications

Drug Approval and Medical Research

Before a new drug can be sold, it must pass rigorous hypothesis tests conducted through clinical trials. The null hypothesis is typically that the drug has no effect compared to a placebo (or no better effect than existing treatments).

The FDA requires strong evidence (typically $\alpha = 0.05$ or stricter) before approving drugs. This conservative approach reflects the high cost of Type I errors: approving an ineffective drug could harm patients and waste resources.

Phase III clinical trials often involve thousands of patients and cost hundreds of millions of dollars. The hypothesis testing framework ensures that only drugs with genuine benefits reach the market.

A/B Testing in Technology and Marketing

Every time you see a slightly different version of a website or app, you might be part of an A/B test. Companies like Google, Amazon, and Netflix run thousands of hypothesis tests to optimize their products.

  • Does a green “Buy Now” button get more clicks than a red one?
  • Does showing related products increase purchase amounts?
  • Does a shorter signup form increase registrations?

Each test compares a new version (treatment) against the current version (control). If the P-value is small enough, the company implements the change. These tests directly impact billions of dollars in revenue.

The American legal standard of “beyond a reasonable doubt” is conceptually similar to hypothesis testing. The null hypothesis is that the defendant is innocent.

  • The prosecution presents evidence against $H_0$
  • The jury must decide if the evidence is strong enough to reject $H_0$
  • “Beyond a reasonable doubt” is like setting a very low significance level

Interestingly, civil cases use a lower standard: “preponderance of evidence” (more likely than not), which is like using a higher $\alpha$. The different standards reflect the different costs of Type I errors in criminal versus civil cases.

Scientific Research and Publication

Scientific journals typically require statistical significance ($\alpha = 0.05$) before publishing results claiming a new discovery or effect. This standard has been remarkably consistent across fields and decades.

However, there is growing concern about the “replication crisis” in science. Many published findings fail to replicate in follow-up studies. This has led to calls for:

  • Stricter significance levels (some propose $\alpha = 0.005$)
  • Greater emphasis on effect sizes and confidence intervals
  • Pre-registration of hypotheses before data collection
  • Larger sample sizes

Understanding hypothesis testing helps you read scientific papers critically and evaluate whether claimed findings are likely to be real.

Quality Control in Manufacturing

Factories use hypothesis testing to monitor product quality. For example, a manufacturer might regularly sample products and test whether the defect rate exceeds an acceptable threshold.

$$H_0: p \leq 0.02 \text{ (defect rate acceptable)}$$ $$H_a: p > 0.02 \text{ (defect rate too high)}$$

Rejecting $H_0$ triggers an investigation into the production process. The significance level is chosen to balance the cost of unnecessary investigations (Type I error) against the cost of shipping defective products (Type II error).

Self-Test Problems

Problem 1: A coffee shop claims that 60% of their customers order espresso drinks. You suspect the proportion is different. In a random sample of 200 customers, 108 ordered espresso drinks.

a) State the null and alternative hypotheses.

b) Is this a one-tailed or two-tailed test?

Show Answer

a) The claim is $p = 0.60$. You suspect it is different (not specifically higher or lower).

$$H_0: p = 0.60$$ $$H_a: p \neq 0.60$$

b) This is a two-tailed test because the alternative hypothesis uses “$\neq$” (we are looking for any difference from 60%, whether higher or lower).

Problem 2: A hypothesis test yields a P-value of 0.08.

a) At $\alpha = 0.10$, would you reject the null hypothesis?

b) At $\alpha = 0.05$, would you reject the null hypothesis?

c) What can you conclude about the evidence against the null hypothesis?

Show Answer

a) At $\alpha = 0.10$: Since P-value (0.08) $<$ $\alpha$ (0.10), we reject $H_0$.

b) At $\alpha = 0.05$: Since P-value (0.08) $>$ $\alpha$ (0.05), we fail to reject $H_0$.

c) There is some evidence against the null hypothesis, but it is not strong. The result would be considered “marginally significant” by many researchers. There is an 8% probability of seeing data this extreme if the null hypothesis were true, which is somewhat unlikely but not rare enough to meet the conventional 5% threshold.

Problem 3: In a criminal trial, the null hypothesis is that the defendant is innocent.

a) What does a Type I error represent in this context?

b) What does a Type II error represent in this context?

c) Why do you think the legal standard of “beyond a reasonable doubt” corresponds to a very small $\alpha$?

Show Answer

a) Type I Error: Convicting an innocent person. The jury rejects $H_0$ (innocence) when it is actually true.

b) Type II Error: Acquitting a guilty person. The jury fails to reject $H_0$ (innocence) when it is actually false (the person is guilty).

c) The legal system considers convicting an innocent person to be a particularly serious error. By requiring evidence “beyond a reasonable doubt” (a very small $\alpha$), the system makes it very difficult to convict, ensuring that Type I errors (false convictions) are rare. This comes at the cost of more Type II errors (some guilty people go free), which the system accepts as preferable to wrongly imprisoning innocent people.

Problem 4: A manufacturer claims that at least 95% of their products meet quality standards. A quality control inspector tests a random sample of 400 products and finds that 372 meet standards.

a) Set up the hypotheses to test whether the manufacturer’s claim is false.

b) Calculate the sample proportion and the test statistic.

c) Find the P-value and make a conclusion at $\alpha = 0.05$.

Show Answer

a) The manufacturer claims $p \geq 0.95$. We test if this claim is false (i.e., the proportion is less than 95%).

$$H_0: p = 0.95$$ $$H_a: p < 0.95$$

(This is a left-tailed test.)

b) Sample proportion: $$\hat{p} = \frac{372}{400} = 0.93$$

Test statistic: $$Z = \frac{0.93 - 0.95}{\sqrt{\frac{0.95 \times 0.05}{400}}} = \frac{-0.02}{\sqrt{\frac{0.0475}{400}}} = \frac{-0.02}{\sqrt{0.000119}} = \frac{-0.02}{0.0109} = -1.83$$

c) P-value: For a left-tailed test, P-value $= P(Z < -1.83) = 0.0336$

Decision: Since P-value (0.0336) $<$ $\alpha$ (0.05), we reject $H_0$.

Conclusion: There is sufficient evidence at the 5% significance level to conclude that less than 95% of products meet quality standards. The manufacturer’s claim appears to be false.

Problem 5: A researcher finds that a new teaching method increases test scores by an average of 2 points (out of 100), with a P-value of 0.001.

a) Is this result statistically significant at $\alpha = 0.05$?

b) Is this result practically significant? Explain your reasoning.

c) What additional information would help you assess practical significance?

Show Answer

a) Yes, the result is statistically significant. Since P-value (0.001) $<$ $\alpha$ (0.05), we reject the null hypothesis of no improvement.

b) Probably not practically significant. A 2-point improvement on a 100-point scale is a very small effect. Whether students score 75 vs. 77 is unlikely to make a meaningful difference in their learning or outcomes. The tiny P-value likely resulted from a very large sample size, which can make even trivial effects statistically significant.

c) Additional helpful information:

  • Sample size: How many students were studied?
  • Cost of implementation: How expensive or difficult is it to implement the new teaching method?
  • Confidence interval: What is the range of plausible effect sizes?
  • Educational context: What is a meaningful improvement in this subject area?
  • Comparison to alternatives: How does this compare to other interventions?

Problem 6: A polling organization surveys 1,000 voters and finds that 53% support a ballot measure. They want to test whether a majority (more than 50%) support it, using $\alpha = 0.05$.

a) Set up and carry out the complete hypothesis test.

b) Explain why the conclusion might differ from what a 95% confidence interval would tell you.

Show Answer

a) Hypothesis Test:

$H_0: p = 0.50$ (no majority support) $H_a: p > 0.50$ (majority support)

Check conditions:

  • $np_0 = 1000 \times 0.50 = 500 \geq 10$ ✓
  • $n(1-p_0) = 1000 \times 0.50 = 500 \geq 10$ ✓

Test statistic: $$Z = \frac{0.53 - 0.50}{\sqrt{\frac{0.50 \times 0.50}{1000}}} = \frac{0.03}{\sqrt{0.00025}} = \frac{0.03}{0.0158} = 1.90$$

P-value: For a right-tailed test, P-value $= P(Z > 1.90) = 0.0287$

Decision: Since P-value (0.0287) $<$ $\alpha$ (0.05), we reject $H_0$.

Conclusion: There is sufficient evidence at the 5% significance level to conclude that a majority of voters support the ballot measure.

b) Why CI might tell a different story:

A 95% confidence interval would be: $$0.53 \pm 1.96 \times \sqrt{\frac{0.53 \times 0.47}{1000}} = 0.53 \pm 0.031 = (0.499, 0.561)$$

This interval includes values just below 50%, even though we rejected $H_0: p = 0.50$.

The difference occurs because:

  • The hypothesis test uses $p_0 = 0.50$ to calculate the standard error
  • The confidence interval uses $\hat{p} = 0.53$ to calculate the standard error
  • A one-tailed test at $\alpha = 0.05$ corresponds to a 90% one-sided CI, not a 95% two-sided CI

The slight discrepancy highlights that 53% is only marginally above 50%, and the evidence for majority support, while statistically significant, is not overwhelming.

Problem 7: A researcher conducts 20 independent hypothesis tests, each at $\alpha = 0.05$. All null hypotheses are actually true.

a) What is the probability of making at least one Type I error?

b) What does this suggest about interpreting multiple hypothesis tests?

Show Answer

a) If each $H_0$ is true and $\alpha = 0.05$, the probability of a Type I error on any single test is 0.05, and the probability of not making a Type I error is 0.95.

For 20 independent tests: $$P(\text{no Type I errors}) = 0.95^{20} = 0.358$$

Therefore: $$P(\text{at least one Type I error}) = 1 - 0.358 = 0.642$$

There is about a 64% chance of making at least one Type I error.

b) Implications for multiple testing:

When conducting many hypothesis tests, the probability of at least one false positive becomes very high, even when all null hypotheses are true. This is called the “multiple comparisons problem” or “multiple testing problem.”

This has important implications:

  • If a researcher tests many hypotheses and reports only the significant ones, some are likely false positives
  • Methods like the Bonferroni correction adjust $\alpha$ to account for multiple tests
  • Scientific findings based on a single significant result among many tests should be viewed skeptically
  • Pre-registration of hypotheses (deciding what to test before seeing data) helps address this issue

Summary

  • Hypothesis testing is a formal procedure for making decisions based on data. You assume a null hypothesis, collect data, and decide whether the evidence is strong enough to reject that assumption.

  • The null hypothesis ($H_0$) represents the status quo or “no effect.” The alternative hypothesis ($H_a$) represents what you are trying to find evidence for.

  • The test statistic measures how far your sample result is from what the null hypothesis predicts, in standardized units.

  • The P-value is the probability of observing results as extreme as yours (or more extreme) if the null hypothesis is true. Small P-values provide evidence against $H_0$.

  • The significance level ($\alpha$) is your threshold for “small enough.” If P-value $\leq \alpha$, reject $H_0$. If P-value $> \alpha$, fail to reject $H_0$.

  • Type I error is rejecting a true $H_0$ (false positive). Type II error is failing to reject a false $H_0$ (false negative).

  • One-tailed tests look for effects in a specific direction. Two-tailed tests look for effects in either direction.

  • Statistical significance means the result is unlikely due to chance. Practical significance means the effect is large enough to matter in the real world. Consider both.

  • For testing a proportion, use: $Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$

  • “Fail to reject” is not the same as “accept.” Absence of evidence is not evidence of absence.

  • Always state conclusions in context, interpreting what the results mean for the real-world question being studied.