Linear Regression- Complete Guide to Analysis
What Linear Regression Actually Is
Linear regression is a statistical method that finds the straight line best fitting a set of data points. That's it. Nothing fancy.
You have some input variable X and some output variable Y. You want to predict Y based on X. Linear regression draws a line through your data that minimizes the distance between the line and every point.
It's the simplest supervised learning algorithm. If you're starting with predictive modeling, this is where you begin.
Why Linear Regression Still Matters
New algorithms appear every month. Random forests, neural networks, gradient boosting—they get attention because they're complex and sound impressive.
Linear regression is still the workhorse of applied statistics for three reasons:
- It's interpretable. You can explain exactly why predictions are made.
- It works well when relationships are actually linear (which happens more than you'd think).
- It sets a baseline. If a complex model doesn't beat a simple linear regression, your feature engineering needs work.
The Two Types You Need to Know
Simple Linear Regression
One independent variable, one dependent variable. You find the line between two variables.
Example: predicting house prices from square footage alone.
The equation is y = mx + b where m is slope and b is intercept. You already know this from high school algebra.
Multiple Linear Regression
Multiple independent variables, one dependent variable. This is what you use in practice.
Example: predicting house prices from square footage, number of bedrooms, neighborhood crime rate, and age of house.
The equation becomes y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Each β represents how much y changes when that specific x increases by one unit, holding all other variables constant. That's the key to interpretation.
The Math Behind It
Linear regression finds coefficients that minimize the sum of squared residuals (also called ordinary least squares or OLS).
For each data point, you calculate the difference between the predicted value and the actual value, then square it. Sum all those squared differences. The best model makes this sum as small as possible.
Squaring serves two purposes: it makes all differences positive (negative and positive errors don't cancel out) and it penalizes larger errors more heavily.
No iteration needed—OLS has a closed-form solution. You can calculate the exact coefficients with matrix algebra in one step.
Assumptions You Can't Ignore
Linear regression makes specific assumptions about your data. Violate them and your results are garbage.
1. Linearity
The relationship between X and Y must be linear. Check this with scatter plots before you run the model. If the relationship is curved, transform your variables or use polynomial regression.
2. Independence
Each observation must be independent of others. Time series data often violates this—consecutive values are correlated. Use specialized methods for time-dependent data.
3. Homoscedasticity
The variance of residuals must be constant across all values of X. If variance increases as X increases (heteroscedasticity), your standard errors are wrong and your significance tests are invalid.
4. Normality
Residuals should follow a normal distribution. This matters most for small samples and for confidence intervals. With large samples (n > 30), the central limit theorem saves you.
5. No Perfect Multicollinearity
In multiple regression, none of your independent variables can be perfect linear combinations of others. If two variables move together perfectly, the model can't separate their individual effects.
How to Build a Linear Regression Model
Step 1: Prepare Your Data
Remove obvious errors. Handle missing values—either drop the rows or impute them. Encode categorical variables (dummy variables work for binary categories; one-hot encoding for multiple categories).
Step 2: Check for Linearity
Plot each independent variable against the dependent variable. Look for patterns. Scatter plots reveal whether relationships are linear or require transformation.
Step 3: Fit the Model
In Python with scikit-learn:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
In R:
model <- lm(y ~ x1 + x2 + x3, data = dataset)
summary(model)
Step 4: Evaluate the Model
Don't just look at R-squared. It's not enough.
Evaluation Metrics You Should Actually Use
| Metric | What It Measures | What Good Looks Like |
|---|---|---|
| R-squared | Proportion of variance explained | Depends on context. 0.7 is strong in social sciences, weak in physics. |
| Adjusted R-squared | R-squared adjusted for number of predictors | Use this instead of R-squared for multiple regression. |
| RMSE | Average prediction error in original units | Lower is better. Compare across models on same dataset. |
| MAE | Average absolute error | More robust to outliers than RMSE. |
R-squared tells you nothing about prediction accuracy. A model with R² = 0.9 might still make predictions that are useless for your application if the prediction range is narrow.
Common Mistakes That Ruin Your Model
Ignoring multicollinearity. When predictors are correlated with each other, coefficient estimates become unstable and hard to interpret. Calculate Variance Inflation Factors (VIF) for each variable. VIF > 5 suggests a problem.
Overfitting with too many variables. Adding variables always increases R-squared, even if they're random noise. Use adjusted R-squared, AIC, or cross-validation to compare models with different numbers of predictors.
Not standardizing before comparing coefficients. Raw coefficients are on different scales. A coefficient of 500 for "dollars" and 0.02 for "interest rate" doesn't tell you which matters more. Standardize your variables first.
Extrapolating beyond your data. The model has no idea what happens outside the range of your training data. Predictions there are guesses, not estimates.
Ignoring outliers. OLS is sensitive to extreme values because of the squaring. Check for points with large residuals and high leverage. Sometimes they're data entry errors. Sometimes they're the most interesting part of your data.
When Linear Regression Is the Wrong Tool
Don't force linear regression onto every problem.
- Classification problems. Predicting categories (spam/not spam, default/no default) requires logistic regression or other classifiers.
- Nonlinear relationships. If your scatter plot shows a clear curve, transform variables or use polynomial regression, splines, or GAMs.
- Complex interactions. When the effect of one variable depends on another, you need interaction terms. At some point, tree-based methods handle this more easily.
- High-dimensional data. When you have more variables than observations, standard OLS breaks down. Use regularization (ridge, lasso) instead.
Getting Started: Your First Linear Regression in Python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load data
df = pd.read_csv('your_data.csv')
# Define features and target
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
# See coefficients
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef:.4f}")
Run this, check your residuals, verify your assumptions, and you'll have a working model.
The Bottom Line
Linear regression isn't sexy. The math is over a century old. Every data scientist will tell you to learn more advanced methods.
They'd be wrong to dismiss it. Linear regression is interpretable, fast, and often performs well enough. Before you reach for complex algorithms, prove you can't beat a straight line.
If your data's relationship is genuinely linear, nothing beats it for clarity and reliability.