Types of Significance Tests Explained
What Significance Tests Actually Are
Significance tests answer one question: is this result real or just random noise? That's it. Nothing mystical about it.
You collect data, run a test, and get a p-value. That number tells you the probability of seeing your results if there's genuinely no effect. Low p-value means your result probably isn't fluke.
Most researchers use p < 0.05 as their cutoff. This means: if there's no real effect, you'd see these results less than 5% of the time by chance alone. That's the industry standard, though it's arbitrary and often misunderstood.
The Main Types of Significance Tests
Different situations call for different tests. Using the wrong one is one of the most common research mistakes you'll see.
Z-Test
The z-test is for when you know the population standard deviation. That's rare in real research because you almost never know the true population parameters.
Use it when:
- Sample size is large (typically n > 30)
- You already know the population variance from prior research
- You're comparing a sample mean to a known population value
It's fast and simple, but unrealistic for most actual research scenarios.
T-Test
The t-test is the workhorse of significance testing. You use it when you don't know the population standard deviation — which is basically always in practice.
One-sample t-test: Compare one group's mean to a known value
Two-sample t-test: Compare means between two independent groups
Paired t-test: Compare means from the same group at different times (before/after, matched subjects)
Example: Testing whether a new drug lowers blood pressure compared to a placebo. You'd likely use a two-sample t-test if comparing two independent groups.
Chi-Square Test
Chi-square tests work with categorical data, not continuous numbers. You count things and check if the observed frequencies match what you'd expect.
Use it for:
- Testing independence between two categorical variables
- Checking if observed frequencies differ from expected frequencies
- Goodness-of-fit tests
Example: Testing whether gender is independent of voting preference. You'd set up a contingency table and see if the relationship is statistically significant.
ANOVA (Analysis of Variance)
ANOVA compares means across three or more groups. A t-test can only handle two groups — trying to run multiple t-tests inflates your error rate.
Types include:
One-way ANOVA: One independent variable with three or more levels
Two-way ANOVA: Two independent variables, can test for interaction effects
Repeated measures ANOVA: Same subjects measured multiple times
Example: Comparing test scores across four different teaching methods. ANOVA tells you if at least one method differs significantly, but won't tell you which ones without follow-up tests.
Correlation Test
A correlation test checks if two continuous variables have a linear relationship. The most common is Pearson's correlation.
The output is a correlation coefficient (r) between -1 and +1, plus a p-value testing whether that relationship is statistically significant.
Remember: correlation doesn't prove causation. That's not just a statistician warning — it's a mathematical fact.
Choosing the Right Test: Quick Reference
Here's where people get lost. Match your test to your data structure:
| Your Data | Groups/Variables | Recommended Test |
|---|---|---|
| Continuous | 1 group vs. known value | One-sample t-test |
| Continuous | 2 independent groups | Two-sample t-test |
| Continuous | 2 paired/matched groups | Paired t-test |
| Continuous | 3+ groups | ANOVA |
| Categorical | 2 categorical variables | Chi-square test |
| Continuous | 2 continuous variables | Pearson correlation |
| Ordinal/ranked | Any | Non-parametric alternatives |
Parametric vs. Non-Parametric Tests
Parametric tests (t-tests, ANOVA, Pearson correlation) assume your data follows a normal distribution. Non-parametric tests make no such assumption.
Non-parametric alternatives:
- Mann-Whitney U test (replaces two-sample t-test)
- Wilcoxon signed-rank test (replaces paired t-test)
- Kruskal-Wallis test (replaces one-way ANOVA)
- Spearman's rho (replaces Pearson correlation)
Non-parametric tests have less statistical power when your data actually is normal. Use them when you have outliers, small samples, or obviously non-normal data.
Getting Started: How to Run a Significance Test
Here's the practical process:
- Define your hypothesis — What are you actually testing? State your null hypothesis (no effect) and alternative hypothesis (there is an effect) before collecting data.
- Check your assumptions — Normality, independence, equal variances. Different tests require different assumptions.
- Choose your significance level — Almost always α = 0.05. This is your threshold for rejecting the null hypothesis.
- Collect data properly — Sample size matters. Small samples lack power. There's no fixing bad data with statistics.
- Run the test — Use software like R, Python (scipy.stats), SPSS, or even Excel for simpler tests.
- Check the p-value — If p < 0.05, reject the null hypothesis. If p ≥ 0.05, you fail to reject it. That's all "statistically significant" means.
- Report effect size — A significant p-value doesn't mean the effect is meaningful. Always report effect size (Cohen's d, R², etc.) alongside p-values.
Common Mistakes That Wreck Your Analysis
These errors show up constantly in published research:
Ignoring assumptions — Running a t-test on highly skewed data is questionable. Check normality first or switch to non-parametric tests.
P-hacking — Running dozens of tests and only reporting the significant ones. This guarantees false positives. Pre-register your analysis plan.
Confusing statistical significance with practical importance — A tiny effect can be statistically significant with a large sample. Ask yourself: does this matter in the real world?
Multiple comparisons without correction — Running multiple t-tests inflates your Type I error rate. Use Bonferroni correction or Tukey's HSD for ANOVA follow-ups.
Reporting p-values without context — Always include confidence intervals and effect sizes. A p-value alone tells you almost nothing useful.
One-Tailed vs. Two-Tailed Tests
This trips people up constantly.
Two-tailed test: You're testing if there's any difference (greater OR less than). Use this unless you have a strong theoretical reason for a directional hypothesis.
One-tailed test: You're testing if the effect goes in a specific direction (greater OR less, but not both). One-tailed tests are more powerful but require justification. Most reviewers will question why you used one.
Unless your hypothesis is explicitly directional ("X will increase Y, not just change it"), stick with two-tailed tests.
Sample Size and Power
You need to calculate your required sample size before collecting data. Running a test without enough participants is pointless — you'll either miss real effects or find spurious ones.
Power analysis tells you how many subjects you need to detect an effect of a given size with a given confidence. Use G*Power or similar tools. It's not optional for rigorous research.
Common power thresholds:
- 80% power = minimum acceptable
- 90% power = preferred for confirmatory studies
The Bottom Line
Significance tests aren't complicated. Pick the test that matches your data structure, check your assumptions, run the test, and report effect sizes alongside p-values.
The p-value is just one piece of information. It tells you whether your result is likely real, not how large or important that result is. Researchers who understand this distinction produce better science than those who chase p < 0.05 like it's a trophy.