Identifying Outliers- Practice Problems and Methods

What Outliers Actually Are (And Why You Can't Ignore Them)

Outliers are data points that deviate significantly from your dataset's normal pattern. They're the values that make your mean jump or your model choke. Sometimes they're errors. Sometimes they're the most interesting thing in your data.

The problem is most people either delete them blindly or pretend they don't exist. Both approaches are wrong. 🔍

The Three Methods That Actually Work

1. The IQR Method – Your First Line of Defense

IQR stands for Interquartile Range. It's the distance between the 25th percentile (Q1) and 75th percentile (Q3). Any point beyond 1.5 × IQR from either quartile gets flagged.

Here's the formula:

Lower bound: Q1 - 1.5 × IQR
Upper bound: Q3 + 1.5 × IQR

This method works well for skewed distributions. It doesn't assume your data is normally distributed, which makes it more robust for real-world messy data.

2. Z-Score Method – When Your Data Behaves

Z-score measures how many standard deviations a point is from the mean. The standard threshold is |z| > 3.

Formula:

z = (x - μ) / σ

This only works properly when your data follows a normal distribution. If your data is skewed, this method will miss outliers or flag too many false positives.

3. Modified Z-Score – The Robust Alternative

The modified Z-score uses the median absolute deviation (MAD) instead of standard deviation. It uses 0.6745 as the constant instead of the standard deviations cutoff.

Formula:

Mi = 0.6745 × (xi - median) / MAD

Points with |Mi| > 3.5 are flagged as outliers. This method handles skewed data better than standard Z-scores.

Visual Methods That Catch What Formulas Miss

Sometimes you need to see your data to understand what's happening.

Comparing the Methods

Method Best For Distribution Assumption Sensitivity
IQR (1.5×) Skewed data, general use None Moderate
IQR (3×) Extreme outliers only None Low
Z-Score (±3) Normal distributions Normal High
Modified Z-Score Skewed data with extreme points None High
Isolation Forest High-dimensional data None Very High
DBSCAN clustering Finding outlier clusters None Variable

Practice Problems

Problem 1: The Salary Dataset

Dataset: $45,000, $52,000, $48,000, $51,000, $47,000, $210,000, $49,000

Using IQR:

Result: $210,000 is an outlier. Obviously.

Problem 2: The Temperature Readings

Dataset: 72°F, 74°F, 71°F, 73°F, 75°F, 69°F, 70°F, 150°F

Using Z-score for 150°F:

Result: Way beyond |3|. This is a data entry error or sensor malfunction.

Problem 3: The Bimodal Distribution

Dataset with two clusters: [2, 3, 2.5, 3.5, 98, 99, 97.5, 98.5]

Using standard IQR or Z-score here will flag half your data as outliers because you have two legitimate groups.

Result: This is when visualization matters. A box plot would immediately show two distinct clusters. You need to handle these groups separately or use clustering-based outlier detection.

Getting Started: Step-by-Step Process

Here's how to actually identify outliers in your data:

Step 1: Plot First

Don't run statistics blind. Create a box plot or histogram. See what your data looks like before applying any formulas.

Step 2: Choose Your Method

If your data is roughly normal, use Z-scores. If it's skewed or you're unsure, use IQR. For complex datasets, consider machine learning approaches like Isolation Forest.

Step 3: Apply the Thresholds

Flag points beyond your chosen bounds. Document which method you used and why.

Step 4: Investigate, Don't Just Remove

Every outlier has a story. Some are errors. Some are your most valuable data points. A hospital's $2 million medical claim isn't fraud—it's probably a real case that needs separate analysis.

Step 5: Test Your Results

Run your analysis with and without flagged outliers. See how much your results change. If nothing shifts, the outliers probably don't matter for your specific question.

Common Mistakes That Sabotage Your Analysis

When to Use Advanced Methods

Standard statistical methods hit their limits fast with high-dimensional data. Here's when you need to level up:

The Bottom Line

Outlier detection isn't a one-step process. It requires judgment, visualization, and domain knowledge. The method you choose depends on your data distribution, your question, and what you'll do with the results.

Start simple. Plot your data. Apply IQR or Z-scores. Investigate what you find. Only move to advanced methods when the basics aren't enough.