Hypothesis Testing Example- Step-by-Step Statistical Guide
What Hypothesis Testing Actually Is
Hypothesis testing is a statistical method for making decisions about a population based on sample data. You start with an assumption, collect evidence, and then either reject or fail to reject that assumption based on probability rules.
That's it. No magic. No interpretation required beyond the numbers.
The assumption you start with is called the null hypothesis (H₀). The alternative you're testing against is the alternative hypothesis (H₁ or Ha). You assume H₀ is true until the data gives you strong enough evidence to踢它一脚.
The Core Logic: Proof by Contradiction
Statistics works like a courtroom. You assume innocence until proven guilty. The null hypothesis is innocent. The data is your evidence. If the evidence is strong enough, you reject innocence.
If the evidence isn't strong enough, you fail to reject the null. You don't accept it as true. You just didn't have enough proof to throw it out.
People mess this up constantly. You never "accept" the null hypothesis. You either reject it or you don't.
Key Terms You Need to Know
- P-value: The probability of getting results at least as extreme as yours, assuming the null hypothesis is actually true. Lower p-value = stronger evidence against H₀.
- Alpha level (α): Your threshold for "rare enough to reject." Common choice is 0.05, meaning you'd reject H₀ if there's less than a 5% chance of seeing your results if H₀ were true.
- Type I Error: Rejecting H₀ when it's actually true. False positive. You convicted an innocent person.
- Type II Error: Failing to reject H₀ when it's actually false. False negative. You let a guilty person walk.
- Test statistic: A single number that summarizes how far your sample result is from what H₀ predicts.
- Critical value: The cutoff point on your test statistic distribution. If your test statistic exceeds this, you reject H₀.
Step-by-Step Hypothesis Testing Example
The Scenario
A coffee shop claims their espresso shots average 30ml. You think they're short-changing customers. You measure 40 randomly selected shots and get an average of 28.5ml with a standard deviation of 4ml.
Is the coffee shop lying?
Step 1: State Your Hypotheses
H₀: μ = 30ml (the claim is correct)
H₁: μ < 30ml (the claim is wrong, they're giving less)
This is a one-tailed test because you're only testing if they're short, not if they're overfilling.
Step 2: Choose Your Significance Level
Use α = 0.05. This is standard practice unless you have a reason to be more or less strict.
Step 3: Calculate the Test Statistic
Since we know the population standard deviation is unknown and we have sample data, use a t-test.
t = (x̄ - μ) / (s / √n)
t = (28.5 - 30) / (4 / √40)
t = -1.5 / (4 / 6.32)
t = -1.5 / 0.633
t = -2.37
Step 4: Find the Critical Value
Degrees of freedom = n - 1 = 39
For a one-tailed t-test at α = 0.05 with df = 39, the critical value is approximately -1.685
Step 5: Make Your Decision
Your calculated t = -2.37
Critical t = -1.685
-2.37 < -1.685
Your test statistic falls in the rejection region. Reject the null hypothesis.
Step 6: State Your Conclusion
At the 0.05 significance level, there's sufficient statistical evidence to conclude the coffee shop is giving less than 30ml per shot.
The p-value for t = -2.37 with df = 39 is approximately 0.011. That's less than 0.05, confirming our decision.
One-Tailed vs Two-Tailed Tests
This matters more than most people realize.
| Test Type | When to Use | Rejection Region |
|---|---|---|
| Two-tailed | Testing if a parameter differs from value (direction unknown) | Both tails of distribution |
| Left-tailed | Testing if parameter is less than value | Left tail only |
| Right-tailed | Testing if parameter is greater than value | Right tail only |
Using a two-tailed test when you should have used one-tailed is one of the most common hypothesis testing mistakes. It makes it harder to reject H₀, which might be fine if you're being conservative, but it's technically wrong if your hypothesis has a directional component.
Common Types of Hypothesis Tests
The example above used a one-sample t-test. Here are the others you need:
One-Sample t-test
Test a population mean against a known or hypothesized value. Use when you have one group and know σ is unknown.
Two-Sample t-test (Independent)
Compare means of two independent groups. "Does Group A score higher than Group B?"
Paired t-test
Compare means from the same group at different times or under different conditions. "Did test scores improve after tutoring?"
Z-test
Like the t-test but use when σ is known or your sample is large (typically n > 30). Most real-world scenarios don't give you σ, so t-tests are more common.
Chi-Square Test
Test categorical data. "Is there a relationship between gender and voting preference?"
ANOVA
Compare means across three or more groups. One-way ANOVA tests if at least one group mean differs from the others.
| Test | Data Type | Groups | What It Tests |
|---|---|---|---|
| One-sample t | Continuous | 1 | Mean vs value |
| Two-sample t | Continuous | 2 | Mean difference |
| Paired t | Continuous | 1 (repeated) | Before vs after |
| Chi-square | Categorical | Any | Independence/fit |
| ANOVA | Continuous | 3+ | Mean equality |
Mistakes That Kill Your Analysis
These errors show up constantly in bad research:
- Confusing statistical significance with practical significance. A sample of 10,000 people might detect a 0.01ml difference as significant. That doesn't mean anyone cares.
- P-hacking. Running tests until something "significant" appears. Stop when you have your answer.
- Ignoring assumptions. Most parametric tests assume normality, independence, and equal variances. Check them.
- Using the wrong test. Comparing proportions? Use a z-test for proportions, not a t-test.
- Forgetting about effect size. Statistical significance without effect size tells you almost nothing useful.
How to Run a Hypothesis Test in Practice
In Python (scipy.stats)
from scipy import stats
# One-sample t-test
t_stat, p_value = stats.ttest_1samp(data, population_mean)
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
# Paired t-test
t_stat, p_value = stats.ttest_rel(before, after)
In R
# One-sample t-test
t.test(data, mu = population_mean)
# Two-sample t-test
t.test(group1 ~ group2, data = dataframe)
# Paired t-test
t.test(before, after, paired = TRUE)
In Excel
Use the Data Analysis ToolPak. Select "t-Test: Two-Sample Assuming Equal Variances" or similar options depending on your test type.
For a one-sample t-test in Excel: =TTEST(array, 0, 1) where array is your data and 0 is the hypothesized mean.
What Alpha Level Should You Use?
0.05 is convention, not a law. Here's when to deviate:
- α = 0.01: When Type I errors are expensive or dangerous. Medical trials, safety testing.
- α = 0.10: Exploratory research where you're generating hypotheses, not confirming them.
- α = 0.05: Most standard applications. Business decisions, academic research, quality control.
Set your alpha before you collect data. Don't change it after seeing results.
The Honest Truth About Hypothesis Testing
Hypothesis testing is a tool, not a conclusion. A significant result doesn't prove your hypothesis is true. It means the data was inconsistent with the null. That's all.
Replicate your results. Check assumptions. Report effect sizes. Consider confidence intervals alongside p-values.
The p-value tells you whether to be surprised. It doesn't tell you whether something matters.