Introduction to linear regression
Simple regression
Goals: use data to
to infer relationships between a single response variable and one or more other predictor variables.
to improve predictions of our response variable by leveraging information contained in the predictors.
In regression analysis, we seek to predict values of the ...
...as a function of one or more...
When performing a regression, we say that we regress the values of the $y$ variable on the values of the $x$ variables.
Sir Francis Galton, 19th century
Studied the relationship between heights of fathers and their sons and noted that the
children "regressed" to the population mean
Developed further by Yule, Pearson, Fisher in the late 1800s/early 1900s
Explosion in applied use with advent of modern computing in the 2nd half of the 20th century.
Goal: to infer population-level relationships between a response variable (denoted by $Y$) and one or more other explanatory variables (denoted by $X$) from data.
Data: values $(x_1,y_1),\dots,(x_n,y_n)$ of $(X,Y)$ observed on each of $n$ units or cases.
| Son.Height | Father.Height |
|---|---|
| 59.7 | 55.1 |
| 58.2 | 56.5 |
| ... | ... |
| 60.6 | 56.0 |
| 60.7 | 56.8 |
| 61.8 | 56.0 |
| 55.5 | 57.9 |
| 55.4 | 57.1 |
| 56.8 | 57.6 |
| ... | ... |
Data used for regression analysis is stored in tabular form, with a column for the response variable and one or more columns for the predictors
Data set: fatherson.csv. Contains height data for fathers and their sons.
If we have $p$ predictor variables along with our response...
| $y$ | $x_{1}$ | $x_{2}$ | $\cdots$ | $x_p$ |
|---|---|---|---|---|
| $y_{1}$ | $x_{1,1}$ | $x_{1,2}$ | $\cdots$ | $x_{1,p}$ |
| $y_2$ | $x_{2,1}$ | $x_{2,2}$ | $\cdots$ | $x_{2,p}$ |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| $y_n$ | $x_{n,1}$ | $x_{n,2}$ | $\cdots$ | $x_{n,p}$ |
We use scatterplots for visualizing between two numerical variables
plot(father, son,...)
Informally, when saying that a relationship is linear, we mean that "The best description of the relationship between the two variables is drawing a line through the scatterplot."
The points in the scatterplot appear to be clustered about a line
Stronger relationship $\Rightarrow$ tighter cluster about the line
Positive and Negative relationships possible
Talking about a linear relationship makes sense if the point cloud is shaped like an ellipse
There's a positive, linear relationship here.
Can we come up with a numerical measure describing the strength and direction of the linear relationship?
How do we exploit this relationship? We should be able to use a father's height to predict the height of his son by way of a linear equation.
It would be nice to have a numerical measure of the strength/direction of a linear relationship. How should it behave?
Positive relationship $\Rightarrow$ Positive number; Negative relationship $\Rightarrow$ Negative number
Stronger relationships stand out numerically
Shouldn't depend on units
For an observation $x_i$ of a variable $x$, the $z$-score of $x_i$, $z(x_i)$, answers the following question:
"How many standard deviations above (+) or below (-) the mean was the observed value $x_i$?"
Can write observation $x_i$ for the $i$-th individual as the mean, plus some number ($z(x_i)$) times the sd: $x_i= \bar{x} + z(x_i)\times \text{sd}(x)$. Solve for $z(x_i)$:
z-score
\[\textstyle z(x_i) \triangleq \frac{x_i - \bar{x}}{\text{sd}(x)}, \]
where
\[\textstyle \bar{x} = \frac{1}{n}\sum_{i=1}^nx_i;\quad\text{sd}(x) \triangleq(\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2)^{\frac12} \]If we form $z$-scores of any continuous variable, $x$, then we have:
Mean, Standard Deviation of $Z$-Scores
\[ \begin{aligned} \text{mean}(z(x))&= 0\\ \text{sd}(z(x))&=1 \end{aligned} \]Now, suppose I have a variable $x$, and I change units by $x^* = a + bx$:
Changing Units
If $b$ is positive, then $z(x^*) = z(x)$.
If $b$ is negative, then $z(x^*) = -z(x)$.
That is, the magnitude of $z$-scores are not affected by additive and multiplicative shifts (but the sign flips if $b$ negative)
Begin by taking $z$-scores of the $y$ and $x$ variables separately. Let $z(x_i) = \frac{x_i - \bar{x}}{\text{sd}(x)}$ denote the $z$-score of the $i$-th cases' $x$ variable value. Define $z(y_i)$ similarly.
Correlation Coefficient
\[ \begin{aligned} r(x,y) &\textstyle= \frac{1}{n-1}[z(x_1)z(y_1) + z(x_2)z(y_2) + \dots + z(x_n)z(y_n)]\\ &\textstyle= \frac{1}{n-1}\sum_{i=1}^nz(x_i)z(y_i) \end{aligned} \]Don't worry about the $n-1$...think of it instead as an average:
In English... Correlation is the mean of the products of the $z$-scores
$-1\leq r(x,y) \leq 1$
Positive number implies positive relationship, negative number implies negative relationship
Unit free - consequence of taking product of $z$-scores.
Correlation between $X$ and $Y$ is equal to the correlation between $Y$ and $X$. $r(x,y) = r(y,x)$.
Larger the value of $|r(x,y)|$, the stronger the linear relationship
$r(x,y) = 1$ implies points perfectly on a positive line, $r(x,y) = -1$ means points perfectly on a negative line
$r(x,y) = 0$ implies absence of a LINEAR association
Only tells us about overall linear relationship. Says nothing about nonlinear relationships
From the formula, we see it is dependent on both the mean and the standard deviation of the $x$ and $y$ variables. Hence, it can very easily by influenced by outliers
Correlation near zero doesn't mean there isn't a relationship, and furthermore it doesn't necessarily mean there isn't a linear relationship
Have to plot the data to check for outliers, and to check for a nonlinear relationship

All have the same correlation! 0.816. VISUALIZE YOUR DATA
In our father's and son's height example: Suppose I pick an individual at random from the data set...
First, I simply ask you to guess what the son's height is, without giving you further information. What would your best guess be?
Instead, suppose I tell you the individual I selected had a father who was 74 inches tall (6'2"). Should we still guess that this individual's height is equal to the average height?
As we'd expect, there's a positive linear association here. Tall fathers tend to have tall sons, short fathers tend to have short sons.
If I give you a father's height, what's your best prediction for his son's height?
What if we predict that the son's height will be as extreme as the father's height
Suppose we know a father's height is 75'' tall. The mean of father's height is 70.1'' with a sd of 2.5''
The sons have an average height of 70.9 inches, with a sd of 2.4. If we predict the son to be as extreme, he should also have a $z$-score of roughly 2
$2 = \frac{y - 70.9}{2.4}\Rightarrow$ we should predict the son to have a height of 75.7''
Let's actually look at the data
On the scatter diagram, the points which have an $x$-value in common (at least approximately) make up a vertical strip of the data
They are color coded
A good guess for the height of a son whose father is 75 inches is the average of the heights for all the sons whose fathers are (roughly) 75 inches. This is the AVERAGE inside a vertical strip
For this particular strip, the average son's height is 74 inches
The average of two of the vertical strips are marked: they roughly fall on a line! More on that later...
Earlier, I suggested that a good guess for the average height of the son would be 75.7'' (same $z$-score as father); now, I'm suggesting 74
Which is right? According to the actual data, 75.7 is usually way off
Again, sons overall have an average height of 70.9 in, with an SD of 2.4 inches.
We are predicting the sons to be taller than average, but not as tall as their fathers on a relative scale ($z$-score of 1.3 vs 2)
On average, a son will regress towards the mean
Sir Francis Galton discovered this fact by staring at the scatter diagram of father's heights and sons's heights.
Observed that fathers who were tall had tall sons, and fathers who were short had short sons. This is due to correlation
Observed that fathers who were unusually tall/short had sons who were taller/shorter than average, just not as unusually tall/short on a relative ($z$-score) scale
The predicted heights were moving towards the mean. They were regressing!
Regression Effect
Given two correlated variables, if one variable is observed to be extreme, then you can expect the other variable to also be extreme—just not as extreme