Residual Plot- Statistical Analysis Guide
What the Hell Is a Residual Plot?
A residual plot is a scatter plot that shows your model's predicted values on one axis and the actual residuals (errors) on the other. That's it. Nothing fancy.
The residual for any data point is simply:
Residual = Actual Value - Predicted Value
If your model predicts a house costs $300,000 but it actually sold for $285,000, your residual is -$15,000. Plot all your residuals against predicted values and you've got a residual plot.
Why does this matter? Because your R-squared score can lie to you. A model can have a great R² and still be completely wrong. Residual plots expose those lies.
Why You Should Care About Residual Analysis
Most people check their model fit once, see a decent R², and move on. That's lazy. Residual analysis tells you things R² cannot:
- Whether your model assumptions are actually valid
- Where your model systematically over-predicts or under-predicts
- If you have outliers pulling your entire analysis
- Whether you need a different model type entirely
If you're building regression models and not checking residuals, you're essentially flying blind.
Reading a Residual Plot: The Basics
The Ideal: Random Scatter
A good residual plot looks like random noise scattered evenly around zero. No patterns, no funnels, no curves. Just chaos in the best possible way.
Think of it like this: your model should miss by random amounts, not systematic ones. If residuals show a pattern, your model is leaving money on the table—or making predictions it shouldn't trust.
What You're Actually Looking For
On the horizontal axis, you have predicted values. On the vertical axis, residuals. The horizontal line at zero is your reference point—residuals should hover around it with equal spread.
The goal: No discernible pattern. Points should look like they've been shotgun-blasted across the plot, not arranged in any geometric shape.
Common Residual Patterns and What They Mean
1. The Funnel / Cone Shape
Residuals spread out as predicted values increase. This is called heteroscedasticity—fancy word for "your model's accuracy changes depending on the prediction range."
Your model is great at predicting small values but falls apart for large ones. Or vice versa. Either way, your confidence intervals are garbage.
Fix it: Try transforming your target variable (log, square root), use weighted regression, or switch to a model that handles non-constant variance better.
2. The U-Curve or Parabola
Residuals are negative on both ends and positive in the middle—or the reverse. This screams non-linearity.
Your data has curves. Your linear model can't see them. It's trying to draw a straight line through curved data, which means it's systematically wrong at the extremes.
Fix it: Add polynomial terms, use spline regression, or switch to a non-linear model entirely.
3. The Slanted Line or Trend
Instead of random scatter around zero, you see residuals trending upward or downward. This indicates systematic bias—your model consistently under-predicts or over-predicts across the entire range.
Fix it: Your model specification is wrong. You might be missing a key predictor or your model form doesn't fit your data.
4. The Outlier Cluster
One or two points way out in left field. These are data points your model completely whiffed on.
Before you delete them, figure out why they're different. Sometimes outliers contain your most valuable information. Sometimes they're data entry errors. Know which before you act.
5. The Stacked Horizontal Lines
This happens with discrete or rounded data. Residuals pile up at specific values instead of spreading continuously. Not a model failure—just a visualization quirk that makes interpretation harder.
How to Create a Residual Plot (Practical Guide)
In Python with matplotlib and scikit-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Fit your model
model = LinearRegression()
model.fit(X_train, y_train)
# Get predictions
y_pred = model.predict(X_train)
# Calculate residuals
residuals = y_train - y_pred
# Plot
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
That's the bare minimum. For real analysis, you'll want to add:
- Standardized residuals (residuals divided by their standard deviation)
- A lowess smoother line to detect subtle patterns
- Labels for potential outliers
In R
# Using base R
model <- lm(y ~ x1 + x2, data = mydata)
plot(model$fitted.values, model$residuals)
abline(h = 0)
# Using ggplot2 for better visuals
library(ggplot2)
ggplot(data = NULL, aes(x = model$fitted.values, y = model$residuals)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(method = "loess", se = FALSE)
In Excel
Yes, it works, but it's painful:
- Run regression using Data Analysis ToolPak
- Save residuals to a column
- Create scatter plot with predicted values on x-axis, residuals on y-axis
- Add a horizontal line at zero manually
Excel gets the job done for simple checks. Don't try this for serious modeling work.
Tools for Residual Analysis
| Tool | Best For | Learning Curve | Cost |
|---|---|---|---|
| Python (matplotlib/seaborn) | Custom analysis, automation, production | Medium | Free |
| R | Statistical rigor, academic work | Medium | Free |
| JMP | Quick visual exploration, DOE | Low | Expensive |
| SPSS | Social science, standard regression | Low | Expensive |
| Excel | Quick checks, small datasets | Low | Included in Office |
| Tableau | Interactive dashboards, presentations | Low-Medium | Subscription |
Python or R will handle 95% of what you need. The others have their niches but aren't worth the investment unless you have specific reasons.
Formal Tests to Pair With Your Plot
Visual inspection is good. Numbers are better. Run these tests alongside your residual plot:
- Breusch-Pagan test: Tests for heteroscedasticity (the funnel problem)
- Shapiro-Wilk test: Tests if residuals are normally distributed
- Durban-Watson test: Tests for autocorrelation (critical for time series)
- Jarque-Bera test: Another normality check, less sensitive to sample size
No single test tells the whole story. Use the plot as your primary tool, tests as backup confirmation.
When Your Residual Plot Is Trying to Tell You Something
Here's the quick reference for what patterns mean:
- Random scatter around zero: Your model assumptions hold. You're good.
- Funnel shape: Non-constant variance. Fix with transformation or different model.
- Curved pattern: Missing non-linearity. Add polynomial terms or switch models.
- Trend in residuals: Model specification problem. Rethink your predictors.
- Outliers far from the rest: Investigate. Don't just delete.
- Alternating positive/negative blocks: You might have a time series issue or need to check your data ordering.
The Bottom Line
Residual plots are not optional. They're the difference between checking your work and assuming your work is correct.
Build the plot. Look for patterns. If you see them, your model isn't finished. If you see random scatter, you've still got work to do—checking those formal tests and making sure you're not missing edge cases.
No residual plot is perfect. The goal isn't perfection. The goal is catching the obvious failures before they bite you in production.