Identifying Outliers- Practice Problems and Methods

What Outliers Actually Are (And Why You Can't Ignore Them)

Outliers are data points that deviate significantly from your dataset's normal pattern. They're the values that make your mean jump or your model choke. Sometimes they're errors. Sometimes they're the most interesting thing in your data.

The problem is most people either delete them blindly or pretend they don't exist. Both approaches are wrong. 🔍

The Three Methods That Actually Work

1. The IQR Method – Your First Line of Defense

IQR stands for Interquartile Range. It's the distance between the 25th percentile (Q1) and 75th percentile (Q3). Any point beyond 1.5 × IQR from either quartile gets flagged.

Here's the formula:

Lower bound: Q1 - 1.5 × IQR
Upper bound: Q3 + 1.5 × IQR

This method works well for skewed distributions. It doesn't assume your data is normally distributed, which makes it more robust for real-world messy data.

2. Z-Score Method – When Your Data Behaves

Z-score measures how many standard deviations a point is from the mean. The standard threshold is |z| > 3.

Formula:

z = (x - μ) / σ

This only works properly when your data follows a normal distribution. If your data is skewed, this method will miss outliers or flag too many false positives.

3. Modified Z-Score – The Robust Alternative

The modified Z-score uses the median absolute deviation (MAD) instead of standard deviation. It uses 0.6745 as the constant instead of the standard deviations cutoff.

Formula:

Mi = 0.6745 × (xi - median) / MAD

Points with |Mi| > 3.5 are flagged as outliers. This method handles skewed data better than standard Z-scores.

Visual Methods That Catch What Formulas Miss

Sometimes you need to see your data to understand what's happening.

Box plots – The whiskers show your IQR bounds. Points outside are outliers. Simple and effective.
Scatter plots – Essential for multivariate outliers. A point that looks normal in each dimension individually might be an extreme combination.
Histograms – Reveal the distribution shape. Gaps or isolated bars signal potential outliers.
Density plots – Show where your data actually concentrates. Lone peaks indicate outliers.

Comparing the Methods

Method	Best For	Distribution Assumption	Sensitivity
IQR (1.5×)	Skewed data, general use	None	Moderate
IQR (3×)	Extreme outliers only	None	Low
Z-Score (±3)	Normal distributions	Normal	High
Modified Z-Score	Skewed data with extreme points	None	High
Isolation Forest	High-dimensional data	None	Very High
DBSCAN clustering	Finding outlier clusters	None	Variable

Practice Problems

Problem 1: The Salary Dataset

Dataset: $45,000, $52,000, $48,000, $51,000, $47,000, $210,000, $49,000

Using IQR:

Q1 = $48,000, Q3 = $51,000
IQR = $3,000
Upper bound = $51,000 + (1.5 × $3,000) = $55,500
Lower bound = $48,000 - (1.5 × $3,000) = $43,500

Result: $210,000 is an outlier. Obviously.

Problem 2: The Temperature Readings

Dataset: 72°F, 74°F, 71°F, 73°F, 75°F, 69°F, 70°F, 150°F

Using Z-score for 150°F:

Mean = 72.25°F
Standard deviation = ~2.77°F
Z-score = (150 - 72.25) / 2.77 = 28.07

Result: Way beyond |3|. This is a data entry error or sensor malfunction.

Problem 3: The Bimodal Distribution

Dataset with two clusters: [2, 3, 2.5, 3.5, 98, 99, 97.5, 98.5]

Using standard IQR or Z-score here will flag half your data as outliers because you have two legitimate groups.

Result: This is when visualization matters. A box plot would immediately show two distinct clusters. You need to handle these groups separately or use clustering-based outlier detection.

Getting Started: Step-by-Step Process

Here's how to actually identify outliers in your data:

Step 1: Plot First

Don't run statistics blind. Create a box plot or histogram. See what your data looks like before applying any formulas.

Step 2: Choose Your Method

If your data is roughly normal, use Z-scores. If it's skewed or you're unsure, use IQR. For complex datasets, consider machine learning approaches like Isolation Forest.

Step 3: Apply the Thresholds

Flag points beyond your chosen bounds. Document which method you used and why.

Step 4: Investigate, Don't Just Remove

Every outlier has a story. Some are errors. Some are your most valuable data points. A hospital's $2 million medical claim isn't fraud—it's probably a real case that needs separate analysis.

Step 5: Test Your Results

Run your analysis with and without flagged outliers. See how much your results change. If nothing shifts, the outliers probably don't matter for your specific question.

Common Mistakes That Sabotage Your Analysis

Using Z-scores on skewed data. Your results will be garbage. Check normality first with a Shapiro-Wilk test.
Removing outliers automatically. This is lazy analysis. Investigate each one.
Ignoring context. A value that's an outlier in one dataset might be completely normal in another.
Using the wrong threshold. 1.5× IQR isn't magic. Some situations need 3× IQR. Others need 1×.
Forgetting multivariate outliers. A data point can look perfectly normal individually but be extreme in combination with other variables.

When to Use Advanced Methods

Standard statistical methods hit their limits fast with high-dimensional data. Here's when you need to level up:

Many variables: Use Isolation Forest, LOF (Local Outlier Factor), or One-Class SVM
Time series data: Use rolling statistics or anomaly detection designed for sequences
Text or categorical data: Use distance-based methods or embeddings to detect anomalies

The Bottom Line

Outlier detection isn't a one-step process. It requires judgment, visualization, and domain knowledge. The method you choose depends on your data distribution, your question, and what you'll do with the results.

Start simple. Plot your data. Apply IQR or Z-scores. Investigate what you find. Only move to advanced methods when the basics aren't enough.