Types of Significance Tests Explained

What Significance Tests Actually Are

Significance tests answer one question: is this result real or just random noise? That's it. Nothing mystical about it.

You collect data, run a test, and get a p-value. That number tells you the probability of seeing your results if there's genuinely no effect. Low p-value means your result probably isn't fluke.

Most researchers use p < 0.05 as their cutoff. This means: if there's no real effect, you'd see these results less than 5% of the time by chance alone. That's the industry standard, though it's arbitrary and often misunderstood.

The Main Types of Significance Tests

Different situations call for different tests. Using the wrong one is one of the most common research mistakes you'll see.

Z-Test

The z-test is for when you know the population standard deviation. That's rare in real research because you almost never know the true population parameters.

Use it when:

Sample size is large (typically n > 30)
You already know the population variance from prior research
You're comparing a sample mean to a known population value

It's fast and simple, but unrealistic for most actual research scenarios.

T-Test

The t-test is the workhorse of significance testing. You use it when you don't know the population standard deviation — which is basically always in practice.

One-sample t-test: Compare one group's mean to a known value

Two-sample t-test: Compare means between two independent groups

Paired t-test: Compare means from the same group at different times (before/after, matched subjects)

Example: Testing whether a new drug lowers blood pressure compared to a placebo. You'd likely use a two-sample t-test if comparing two independent groups.

Chi-Square Test

Chi-square tests work with categorical data, not continuous numbers. You count things and check if the observed frequencies match what you'd expect.

Use it for:

Testing independence between two categorical variables
Checking if observed frequencies differ from expected frequencies
Goodness-of-fit tests

Example: Testing whether gender is independent of voting preference. You'd set up a contingency table and see if the relationship is statistically significant.

ANOVA (Analysis of Variance)

ANOVA compares means across three or more groups. A t-test can only handle two groups — trying to run multiple t-tests inflates your error rate.

Types include:

One-way ANOVA: One independent variable with three or more levels

Two-way ANOVA: Two independent variables, can test for interaction effects

Repeated measures ANOVA: Same subjects measured multiple times

Example: Comparing test scores across four different teaching methods. ANOVA tells you if at least one method differs significantly, but won't tell you which ones without follow-up tests.

Correlation Test

A correlation test checks if two continuous variables have a linear relationship. The most common is Pearson's correlation.

The output is a correlation coefficient (r) between -1 and +1, plus a p-value testing whether that relationship is statistically significant.

Remember: correlation doesn't prove causation. That's not just a statistician warning — it's a mathematical fact.

Choosing the Right Test: Quick Reference

Here's where people get lost. Match your test to your data structure:

Your Data	Groups/Variables	Recommended Test
Continuous	1 group vs. known value	One-sample t-test
Continuous	2 independent groups	Two-sample t-test
Continuous	2 paired/matched groups	Paired t-test
Continuous	3+ groups	ANOVA
Categorical	2 categorical variables	Chi-square test
Continuous	2 continuous variables	Pearson correlation
Ordinal/ranked	Any	Non-parametric alternatives

Parametric vs. Non-Parametric Tests

Parametric tests (t-tests, ANOVA, Pearson correlation) assume your data follows a normal distribution. Non-parametric tests make no such assumption.

Non-parametric alternatives:

Mann-Whitney U test (replaces two-sample t-test)
Wilcoxon signed-rank test (replaces paired t-test)
Kruskal-Wallis test (replaces one-way ANOVA)
Spearman's rho (replaces Pearson correlation)

Non-parametric tests have less statistical power when your data actually is normal. Use them when you have outliers, small samples, or obviously non-normal data.

Getting Started: How to Run a Significance Test

Here's the practical process:

Define your hypothesis — What are you actually testing? State your null hypothesis (no effect) and alternative hypothesis (there is an effect) before collecting data.
Check your assumptions — Normality, independence, equal variances. Different tests require different assumptions.
Choose your significance level — Almost always α = 0.05. This is your threshold for rejecting the null hypothesis.
Collect data properly — Sample size matters. Small samples lack power. There's no fixing bad data with statistics.
Run the test — Use software like R, Python (scipy.stats), SPSS, or even Excel for simpler tests.
Check the p-value — If p < 0.05, reject the null hypothesis. If p ≥ 0.05, you fail to reject it. That's all "statistically significant" means.
Report effect size — A significant p-value doesn't mean the effect is meaningful. Always report effect size (Cohen's d, R², etc.) alongside p-values.

Common Mistakes That Wreck Your Analysis

These errors show up constantly in published research:

Ignoring assumptions — Running a t-test on highly skewed data is questionable. Check normality first or switch to non-parametric tests.

P-hacking — Running dozens of tests and only reporting the significant ones. This guarantees false positives. Pre-register your analysis plan.

Confusing statistical significance with practical importance — A tiny effect can be statistically significant with a large sample. Ask yourself: does this matter in the real world?

Multiple comparisons without correction — Running multiple t-tests inflates your Type I error rate. Use Bonferroni correction or Tukey's HSD for ANOVA follow-ups.

Reporting p-values without context — Always include confidence intervals and effect sizes. A p-value alone tells you almost nothing useful.

One-Tailed vs. Two-Tailed Tests

This trips people up constantly.

Two-tailed test: You're testing if there's any difference (greater OR less than). Use this unless you have a strong theoretical reason for a directional hypothesis.

One-tailed test: You're testing if the effect goes in a specific direction (greater OR less, but not both). One-tailed tests are more powerful but require justification. Most reviewers will question why you used one.

Unless your hypothesis is explicitly directional ("X will increase Y, not just change it"), stick with two-tailed tests.

Sample Size and Power

You need to calculate your required sample size before collecting data. Running a test without enough participants is pointless — you'll either miss real effects or find spurious ones.

Power analysis tells you how many subjects you need to detect an effect of a given size with a given confidence. Use G*Power or similar tools. It's not optional for rigorous research.

Common power thresholds:

80% power = minimum acceptable
90% power = preferred for confirmatory studies

The Bottom Line

Significance tests aren't complicated. Pick the test that matches your data structure, check your assumptions, run the test, and report effect sizes alongside p-values.

The p-value is just one piece of information. It tells you whether your result is likely real, not how large or important that result is. Researchers who understand this distinction produce better science than those who chase p < 0.05 like it's a trophy.