Least Squares Regression Line Equation- Complete Guide

What Is the Least Squares Regression Line?

The least squares regression line is the straight line that best fits a scatter plot of data points. It minimizes the vertical distances between the actual data points and the line itself.

That's it. No philosophy. No debate. It's a mathematical tool that finds the one line that comes closest to all your points at once.

You might hear it called the line of best fit, LSRL, or simply "the regression line." Same thing.

Why "Least Squares"?

The word "squares" tells you exactly how the line is chosen. For each data point, you calculate how far it sits from the line vertically. Then you square each distance (to remove negatives) and add them all up.

The regression line is the one that makes this sum as small as possible. Hence: least squares.

The Least Squares Regression Line Equation

The formula looks like this:

ŷ = bx + a

Where:

ŷ (y-hat) = predicted value of y
b = slope of the line
x = the input variable
a = y-intercept (where the line crosses the y-axis)

The Slope Formula (b)

b = Σ[(xi - x̄)(yi - ȳ)] / Σ[(xi - x̄)²]

This tells you how much y changes for each unit change in x. A slope of 2.5 means y goes up by 2.5 whenever x goes up by 1.

The Intercept Formula (a)

a = ȳ - b(x̄)

Simple. Take the mean of y values, subtract the slope times the mean of x values. This pins the line to the correct vertical position.

What Do the Slope and Intercept Actually Mean?

Slope (b): The predicted change in y for each one-unit increase in x. If b is positive, y tends to increase as x increases. If b is negative, y tends to decrease.

Intercept (a): The predicted value of y when x equals zero. Sometimes this is meaningful. Sometimes it's not. If x can never be zero in your context, don't read too much into the intercept.

How to Calculate the Least Squares Regression Line

Let's walk through a real example. You want to predict monthly rent (y) based on apartment square footage (x).

Apartment	Sq Ft (x)	Rent $ (y)
1	500	1200
2	750	1600
3	1000	2100
4	1250	2400
5	1500	2800

Step 1: Calculate the Means

x̄ = (500 + 750 + 1000 + 1250 + 1500) / 5 = 1000

ȳ = (1200 + 1600 + 2100 + 2400 + 2800) / 5 = 2020

Step 2: Calculate the Slope (b)

Build a table with (xi - x̄), (yi - ȳ), their product, and the squared x deviations:

xi	yi	(xi - x̄)	(yi - ȳ)	Product	(xi - x̄)²
500	1200	-500	-820	410,000	250,000
750	1600	-250	-420	105,000	62,500
1000	2100	0	80	0	0
1250	2400	250	380	95,000	62,500
1500	2800	500	780	390,000	250,000
Totals				1,000,000	625,000

b = 1,000,000 / 625,000 = 1.6

Step 3: Calculate the Intercept (a)

a = 2020 - 1.6(1000) = 2020 - 1600 = 420

Step 4: Write the Equation

ŷ = 1.6x + 420

Interpretation: For every additional square foot, rent goes up by $1.60. A zero-square-foot apartment would theoretically rent for $420 (which makes no real-world sense, but that's the math).

How to Use It in Practice

Plug in any x value to get a predicted y:

800 sq ft → ŷ = 1.6(800) + 420 = $1,700
1100 sq ft → ŷ = 1.6(1100) + 420 = $2,180
2000 sq ft → ŷ = 1.6(2000) + 420 = $3,620

What Makes a Good Regression Line?

Two key metrics tell you whether your line fits well:

R-squared (R²)

This tells you what percentage of the variation in y is explained by x. R² of 0.85 means 85% of y's movement is captured by the line. The rest is noise or other factors.

Range: 0 to 1. Higher is better, but not always. An R² of 0.9 in one context might be weak in another.

Standard Error of the Estimate (s)

This is the typical prediction error. If s = 200, your predictions are usually off by about $200 on average.

Lower is better. You want predictions to be close to actual values.

Common Mistakes to Avoid

Extrapolating beyond your data range. The line is only reliable within the x-values you trained it on. Predicting for x = 5000 when your data maxes out at 1500 is guessing, not statistics.
Ignoring outliers. One extreme point can dramatically tilt the line. Check your data.
Assuming correlation means causation. The regression line shows association, not that x causes y to change.
Using it for small samples. Five data points give you a rough line. Fifty gives you something more trustworthy.

When to Use Least Squares Regression

This method works when:

You have two continuous variables
You want to predict one variable from another
The relationship between x and y looks roughly linear
Errors (residuals) are randomly scattered, not patterned

It doesn't work well when the relationship is curved, when you have categorical variables, or when your data has heavy outliers.

Least Squares vs. Other Methods

Method	Best For	Drawback
Least Squares	Linear relationships, prediction	Sensitive to outliers
Median Regression	Data with extreme values	Harder to interpret
Polynomial Regression	Curved relationships	Can overfit easily
Robust Regression	Data with outliers	More complex calculations

Getting Started: Quick Checklist

Plot your data first. Scatter plot. Does a straight line look reasonable? If the pattern is curved, linear regression isn't your answer.
Calculate x̄ and ȳ. Your starting point for everything else.
Compute the slope (b). Use the formula or let software do it. Excel, Google Sheets, R, Python—all have built-in functions.
Find the intercept (a). One subtraction.
Write the equation. ŷ = bx + a.
Check R² and standard error. Does the line actually explain your data?
Validate. Hold out some data points. See how well your line predicts them.

The Bottom Line

The least squares regression line is a straightforward tool: find the straight line that comes closest to all your points. The math is simple, the interpretation is direct, and the formula has been battle-tested for over a century.

Use it when the relationship is linear. Check your R². Don't extrapolate beyond your data. That's all you need.