Linear Regression Summary- A Practical Example Walkthrough
What Linear Regression Actually Is
Linear regression is a statistical method for finding the straight line that best fits a set of data points. That's it. Nothing fancy.
You have some x values and some y values. Linear regression finds the line y = mx + b that minimizes the distance between itself and all those points. The "best fit" line lets you predict y values when you only know x.
It's one of the most used prediction methods because it works. It's interpretable, fast, and handles most basic forecasting problems without overcomplicating things.
The Two Types You Need to Know
Simple Linear Regression
One independent variable (x) predicts one dependent variable (y). Think: "predict house price based on square footage."
Multiple Linear Regression
Multiple independent variables predict one dependent variable. Think: "predict house price based on square footage, number of bedrooms, and neighborhood crime rate."
Most real problems use multiple regression. Single variable problems are mostly teaching tools.
Core Concepts You Must Understand
Before running linear regression, you need to know what you're looking at:
- Slope (m) — how much y changes when x increases by 1 unit. Positive slope = positive relationship. Negative slope = inverse relationship.
- Intercept (b) — the y value when x equals zero. Sometimes meaningful, sometimes garbage depending on your data.
- R-squared (R²) — tells you what percentage of y's variance your model explains. R² of 0.85 means your line captures 85% of the variation in the data. Higher isn't always better (more on that later).
- Residuals — the differences between actual y values and predicted y values. You want these small and randomly distributed.
- P-values — tell you if each coefficient is statistically significant. Below 0.05 is the common threshold. Above that? The variable might as well be random noise.
A Practical Example With Real Numbers
Let's say you're analyzing advertising spend vs. sales. You have 5 data points:
| Advertising Spend ($1000s) | Sales ($1000s) |
|---|---|
| 1 | 3 |
| 2 | 5 |
| 3 | 7 |
| 4 | 9 |
| 5 | 11 |
Notice a pattern? Every time spend increases by 1, sales increase by 2. This is a perfect linear relationship.
The equation is Sales = 2 × Spend + 1
The slope is 2. For every $1,000 you spend on ads, you get $2,000 in sales. The intercept is 1. If you spent nothing, you'd still make $1,000 (maybe from existing customers).
What Happens With Messier Data
Real data never lines up perfectly. You'd have points scattered around the line. Linear regression finds the line that minimizes the sum of squared residuals — each point's distance from the line, squared so negative and positive errors don't cancel out.
Getting Started: How to Run Linear Regression
In Python with scikit-learn
import numpy as np
from sklearn.linear_model import LinearRegression
# Your data (reshape for sklearn)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([3, 5, 7, 9, 11])
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Get results
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)
print(f"Slope: {slope}")
print(f"Intercept: {intercept}")
print(f"R-squared: {r_squared}")
In R
# Your data
spend <- c(1, 2, 3, 4, 5)
sales <- c(3, 5, 7, 9, 11)
# Run regression
model <- lm(sales ~ spend)
# Get summary with coefficients, R-squared, p-values
summary(model)
In Excel
Select your x and y data, insert a scatter plot, then add a trendline. Check "display equation on chart" and "display R-squared value." Excel gives you slope, intercept, and R² instantly.
Checking If Your Model Is Actually Good
Running linear regression is easy. Running one that's actually valid is harder. Here's what to check:
- Linearity — your data should actually form roughly a line. Scatter plot it first. Curved data needs polynomial regression, not linear.
- Homoscedasticity — residuals should have constant variance across all x values. If variance increases as x increases, your standard errors are wrong.
- Normality of residuals — for small samples, residuals should be roughly normal. For large samples, the central limit theorem saves you.
- No multicollinearity — in multiple regression, your independent variables shouldn't be highly correlated with each other. VIF scores above 5 are a red flag.
Common Mistakes That Ruin Your Model
These will destroy your analysis if you ignore them:
- Ignoring p-values — a variable with p=0.4 isn't helping predict anything. Drop it or question why it's there.
- High R² doesn't mean good model — you can get R² of 0.99 with overfitting. Always check residuals visually.
- Correlation isn't causation — if your model says ice cream sales predict drowning deaths, that's a confounding variable (temperature). The model doesn't know or care.
- Extrapolation is dangerous — your model trained on x values from 1 to 5. Don't trust predictions at x=1000.
When Linear Regression Is the Right Tool
Use it when:
- You want interpretable results (you can explain exactly what each coefficient means)
- Your relationship is actually linear
- You need fast predictions
- You need a baseline to beat with more complex models
Don't use it when:
- Your outcome is categorical (use logistic regression instead)
- Your data has complex non-linear patterns (try random forests, neural networks)
- You have high-dimensional data with sparse features (regularization methods like Lasso or Ridge)
Quick Comparison: Linear Regression vs. Alternatives
| Method | Best For | Interpretability | Complexity |
|---|---|---|---|
| Linear Regression | Linear relationships, baselines | High | Low |
| Polynomial Regression | Curved but simple patterns | Medium | Low-Medium |
| Decision Trees | Non-linear, mixed data types | Medium | Medium-High |
| Random Forest | Complex patterns, less overfitting | Low | High |
| Neural Networks | Images, text, complex interactions | Very Low | Very High |
The Bottom Line
Linear regression is a workhorse. It's been around for over a century because it solves real problems without unnecessary complexity.
Plot your data first. Check assumptions. Read the coefficients. Validate on held-out data. That's the entire process.
Don't overthink it. Don't add layers you don't need. If linear regression fits your data well, use it. More complex models aren't automatically better — they're just harder to explain when someone asks why your prediction is what it is.