Hypothesis Testing Example- Step-by-Step Statistical Guide

What Hypothesis Testing Actually Is

Hypothesis testing is a statistical method for making decisions about a population based on sample data. You start with an assumption, collect evidence, and then either reject or fail to reject that assumption based on probability rules.

That's it. No magic. No interpretation required beyond the numbers.

The assumption you start with is called the null hypothesis (H₀). The alternative you're testing against is the alternative hypothesis (H₁ or Ha). You assume H₀ is true until the data gives you strong enough evidence to踢它一脚.

The Core Logic: Proof by Contradiction

Statistics works like a courtroom. You assume innocence until proven guilty. The null hypothesis is innocent. The data is your evidence. If the evidence is strong enough, you reject innocence.

If the evidence isn't strong enough, you fail to reject the null. You don't accept it as true. You just didn't have enough proof to throw it out.

People mess this up constantly. You never "accept" the null hypothesis. You either reject it or you don't.

Key Terms You Need to Know

P-value: The probability of getting results at least as extreme as yours, assuming the null hypothesis is actually true. Lower p-value = stronger evidence against H₀.
Alpha level (α): Your threshold for "rare enough to reject." Common choice is 0.05, meaning you'd reject H₀ if there's less than a 5% chance of seeing your results if H₀ were true.
Type I Error: Rejecting H₀ when it's actually true. False positive. You convicted an innocent person.
Type II Error: Failing to reject H₀ when it's actually false. False negative. You let a guilty person walk.
Test statistic: A single number that summarizes how far your sample result is from what H₀ predicts.
Critical value: The cutoff point on your test statistic distribution. If your test statistic exceeds this, you reject H₀.

Step-by-Step Hypothesis Testing Example

The Scenario

A coffee shop claims their espresso shots average 30ml. You think they're short-changing customers. You measure 40 randomly selected shots and get an average of 28.5ml with a standard deviation of 4ml.

Is the coffee shop lying?

Step 1: State Your Hypotheses

H₀: μ = 30ml (the claim is correct)

H₁: μ < 30ml (the claim is wrong, they're giving less)

This is a one-tailed test because you're only testing if they're short, not if they're overfilling.

Step 2: Choose Your Significance Level

Use α = 0.05. This is standard practice unless you have a reason to be more or less strict.

Step 3: Calculate the Test Statistic

Since we know the population standard deviation is unknown and we have sample data, use a t-test.

t = (x̄ - μ) / (s / √n)

t = (28.5 - 30) / (4 / √40)

t = -1.5 / (4 / 6.32)

t = -1.5 / 0.633

t = -2.37

Step 4: Find the Critical Value

Degrees of freedom = n - 1 = 39

For a one-tailed t-test at α = 0.05 with df = 39, the critical value is approximately -1.685

Step 5: Make Your Decision

Your calculated t = -2.37

Critical t = -1.685

-2.37 < -1.685

Your test statistic falls in the rejection region. Reject the null hypothesis.

Step 6: State Your Conclusion

At the 0.05 significance level, there's sufficient statistical evidence to conclude the coffee shop is giving less than 30ml per shot.

The p-value for t = -2.37 with df = 39 is approximately 0.011. That's less than 0.05, confirming our decision.

One-Tailed vs Two-Tailed Tests

This matters more than most people realize.

Test Type	When to Use	Rejection Region
Two-tailed	Testing if a parameter differs from value (direction unknown)	Both tails of distribution
Left-tailed	Testing if parameter is less than value	Left tail only
Right-tailed	Testing if parameter is greater than value	Right tail only

Using a two-tailed test when you should have used one-tailed is one of the most common hypothesis testing mistakes. It makes it harder to reject H₀, which might be fine if you're being conservative, but it's technically wrong if your hypothesis has a directional component.

Common Types of Hypothesis Tests

The example above used a one-sample t-test. Here are the others you need:

One-Sample t-test

Test a population mean against a known or hypothesized value. Use when you have one group and know σ is unknown.

Two-Sample t-test (Independent)

Compare means of two independent groups. "Does Group A score higher than Group B?"

Paired t-test

Compare means from the same group at different times or under different conditions. "Did test scores improve after tutoring?"

Z-test

Like the t-test but use when σ is known or your sample is large (typically n > 30). Most real-world scenarios don't give you σ, so t-tests are more common.

Chi-Square Test

Test categorical data. "Is there a relationship between gender and voting preference?"

ANOVA

Compare means across three or more groups. One-way ANOVA tests if at least one group mean differs from the others.

Test	Data Type	Groups	What It Tests
One-sample t	Continuous	1	Mean vs value
Two-sample t	Continuous	2	Mean difference
Paired t	Continuous	1 (repeated)	Before vs after
Chi-square	Categorical	Any	Independence/fit
ANOVA	Continuous	3+	Mean equality

Mistakes That Kill Your Analysis

These errors show up constantly in bad research:

Confusing statistical significance with practical significance. A sample of 10,000 people might detect a 0.01ml difference as significant. That doesn't mean anyone cares.
P-hacking. Running tests until something "significant" appears. Stop when you have your answer.
Ignoring assumptions. Most parametric tests assume normality, independence, and equal variances. Check them.
Using the wrong test. Comparing proportions? Use a z-test for proportions, not a t-test.
Forgetting about effect size. Statistical significance without effect size tells you almost nothing useful.

How to Run a Hypothesis Test in Practice

In Python (scipy.stats)

from scipy import stats

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(data, population_mean)

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

# Paired t-test
t_stat, p_value = stats.ttest_rel(before, after)

In R

# One-sample t-test
t.test(data, mu = population_mean)

# Two-sample t-test
t.test(group1 ~ group2, data = dataframe)

# Paired t-test
t.test(before, after, paired = TRUE)

In Excel

Use the Data Analysis ToolPak. Select "t-Test: Two-Sample Assuming Equal Variances" or similar options depending on your test type.

For a one-sample t-test in Excel: =TTEST(array, 0, 1) where array is your data and 0 is the hypothesized mean.

What Alpha Level Should You Use?

0.05 is convention, not a law. Here's when to deviate:

α = 0.01: When Type I errors are expensive or dangerous. Medical trials, safety testing.
α = 0.10: Exploratory research where you're generating hypotheses, not confirming them.
α = 0.05: Most standard applications. Business decisions, academic research, quality control.

Set your alpha before you collect data. Don't change it after seeing results.

The Honest Truth About Hypothesis Testing

Hypothesis testing is a tool, not a conclusion. A significant result doesn't prove your hypothesis is true. It means the data was inconsistent with the null. That's all.

Replicate your results. Check assumptions. Report effect sizes. Consider confidence intervals alongside p-values.

The p-value tells you whether to be surprised. It doesn't tell you whether something matters.