Identifying Outliers- Practice Problems and Methods
What Outliers Actually Are (And Why You Can't Ignore Them)
Outliers are data points that deviate significantly from your dataset's normal pattern. They're the values that make your mean jump or your model choke. Sometimes they're errors. Sometimes they're the most interesting thing in your data.
The problem is most people either delete them blindly or pretend they don't exist. Both approaches are wrong. 🔍
The Three Methods That Actually Work
1. The IQR Method – Your First Line of Defense
IQR stands for Interquartile Range. It's the distance between the 25th percentile (Q1) and 75th percentile (Q3). Any point beyond 1.5 × IQR from either quartile gets flagged.
Here's the formula:
Lower bound: Q1 - 1.5 × IQR
Upper bound: Q3 + 1.5 × IQR
This method works well for skewed distributions. It doesn't assume your data is normally distributed, which makes it more robust for real-world messy data.
2. Z-Score Method – When Your Data Behaves
Z-score measures how many standard deviations a point is from the mean. The standard threshold is |z| > 3.
Formula:
z = (x - μ) / σ
This only works properly when your data follows a normal distribution. If your data is skewed, this method will miss outliers or flag too many false positives.
3. Modified Z-Score – The Robust Alternative
The modified Z-score uses the median absolute deviation (MAD) instead of standard deviation. It uses 0.6745 as the constant instead of the standard deviations cutoff.
Formula:
Mi = 0.6745 × (xi - median) / MAD
Points with |Mi| > 3.5 are flagged as outliers. This method handles skewed data better than standard Z-scores.
Visual Methods That Catch What Formulas Miss
Sometimes you need to see your data to understand what's happening.
- Box plots – The whiskers show your IQR bounds. Points outside are outliers. Simple and effective.
- Scatter plots – Essential for multivariate outliers. A point that looks normal in each dimension individually might be an extreme combination.
- Histograms – Reveal the distribution shape. Gaps or isolated bars signal potential outliers.
- Density plots – Show where your data actually concentrates. Lone peaks indicate outliers.
Comparing the Methods
| Method | Best For | Distribution Assumption | Sensitivity |
|---|---|---|---|
| IQR (1.5×) | Skewed data, general use | None | Moderate |
| IQR (3×) | Extreme outliers only | None | Low |
| Z-Score (±3) | Normal distributions | Normal | High |
| Modified Z-Score | Skewed data with extreme points | None | High |
| Isolation Forest | High-dimensional data | None | Very High |
| DBSCAN clustering | Finding outlier clusters | None | Variable |
Practice Problems
Problem 1: The Salary Dataset
Dataset: $45,000, $52,000, $48,000, $51,000, $47,000, $210,000, $49,000
Using IQR:
- Q1 = $48,000, Q3 = $51,000
- IQR = $3,000
- Upper bound = $51,000 + (1.5 × $3,000) = $55,500
- Lower bound = $48,000 - (1.5 × $3,000) = $43,500
Result: $210,000 is an outlier. Obviously.
Problem 2: The Temperature Readings
Dataset: 72°F, 74°F, 71°F, 73°F, 75°F, 69°F, 70°F, 150°F
Using Z-score for 150°F:
- Mean = 72.25°F
- Standard deviation = ~2.77°F
- Z-score = (150 - 72.25) / 2.77 = 28.07
Result: Way beyond |3|. This is a data entry error or sensor malfunction.
Problem 3: The Bimodal Distribution
Dataset with two clusters: [2, 3, 2.5, 3.5, 98, 99, 97.5, 98.5]
Using standard IQR or Z-score here will flag half your data as outliers because you have two legitimate groups.
Result: This is when visualization matters. A box plot would immediately show two distinct clusters. You need to handle these groups separately or use clustering-based outlier detection.
Getting Started: Step-by-Step Process
Here's how to actually identify outliers in your data:
Step 1: Plot First
Don't run statistics blind. Create a box plot or histogram. See what your data looks like before applying any formulas.
Step 2: Choose Your Method
If your data is roughly normal, use Z-scores. If it's skewed or you're unsure, use IQR. For complex datasets, consider machine learning approaches like Isolation Forest.
Step 3: Apply the Thresholds
Flag points beyond your chosen bounds. Document which method you used and why.
Step 4: Investigate, Don't Just Remove
Every outlier has a story. Some are errors. Some are your most valuable data points. A hospital's $2 million medical claim isn't fraud—it's probably a real case that needs separate analysis.
Step 5: Test Your Results
Run your analysis with and without flagged outliers. See how much your results change. If nothing shifts, the outliers probably don't matter for your specific question.
Common Mistakes That Sabotage Your Analysis
- Using Z-scores on skewed data. Your results will be garbage. Check normality first with a Shapiro-Wilk test.
- Removing outliers automatically. This is lazy analysis. Investigate each one.
- Ignoring context. A value that's an outlier in one dataset might be completely normal in another.
- Using the wrong threshold. 1.5× IQR isn't magic. Some situations need 3× IQR. Others need 1×.
- Forgetting multivariate outliers. A data point can look perfectly normal individually but be extreme in combination with other variables.
When to Use Advanced Methods
Standard statistical methods hit their limits fast with high-dimensional data. Here's when you need to level up:
- Many variables: Use Isolation Forest, LOF (Local Outlier Factor), or One-Class SVM
- Time series data: Use rolling statistics or anomaly detection designed for sequences
- Text or categorical data: Use distance-based methods or embeddings to detect anomalies
The Bottom Line
Outlier detection isn't a one-step process. It requires judgment, visualization, and domain knowledge. The method you choose depends on your data distribution, your question, and what you'll do with the results.
Start simple. Plot your data. Apply IQR or Z-scores. Investigate what you find. Only move to advanced methods when the basics aren't enough.