STATS 413: Lecture 8

Aanalysis of Variance (ANOVA)
The partial $\mathcal{F}$- test

How Much Better Are We Doing?

We now know how to, given a set of explanatory variables $x_1,\dots,x_p$, create a multiple regression model to try to predict the value of the response $y$.

Is there evidence that the model helped? That is, is there evidence that by exploiting information about our predictor variables, we can provide improved predictions in $y$?

Improved relative to what?
If I didn't have information on these explanatory variables, what would my best guess be for any individual? The overall mean
We'd certainly hope we're doing better!

How Much Better Are We Doing?

As we now show, the test will be based on the magnitude of $R$-squared: how much variation in $y$ could we explain from our regression?

Compare the variation in $y$ we can attribute to $X$ relative to the overall variation in $y$.

The Overall $\mathcal{F}$-Test

Under the assumptions of the multiple regression model, the null hypothesis for testing "was multiple regression worth it?" takes on the form:

\[\mathbf{H_o}: \beta_1 = \beta_2 = \dots = \beta_p = 0\]

The alternative is

\[\mathbf{H_a}: \text{at least one } \beta_j \neq 0, \quad j=1,\dots,p\]

The Overall $\mathcal{F}$-Test

Another way to think about this: under the null, the only thing that matters for predicting a response is the intercept, $\beta_0$.

If this were true, $\text{E}(y_i \mid x_i) = \beta_0$. That is, every point has the same mean regardless of the values of the explanatory variables
As we showed: in a regression that only contains an intercept, $\hat{\beta}_0$, the estimated intercept coefficient, equals $\bar{y}$

A Test Statistic

We'll form a test statistic whose value is large when $x_1,\dots,x_p$ are strong predictors of $y$ (large values $\Rightarrow$ strong evidence against the null). This corresponds to large values for $R$-squared!

The $F$-Statistic

Our test statistic takes on the form:

\[F_{\text{stat}} = \frac{R^2}{1 - R^2}\left(\frac{n - p - 1}{p}\right)\]

This is what R reports as the F-statistic in its output from multiple regression

If the Null Were True...

If, in reality, the null is true and none of the $p$ covariates had predictive power, would we observe $R$-squared exactly equal zero?

$R^2$ will be greater than zero due to chance correlations, even if the explanatory variables don't actually matter
Also: the more predictors you add to your regression model, the larger the $R^2$ value becomes, even if your predictor variables are unrelated to the response (see the Lecture 3 code for a demonstration).

If the Null Were True...

We need to account for this when assessing statistical significance
We need to find: what does an "extreme" value for the $F_{\text{stat}}$ look like.

Under the null hypothesis, what is the distribution of $F_{\text{stat}}$? Let's first simulate...

Visualizing the Distribution of $F_{\text{stat}}$

The $\mathcal{F}$ Distribution

Under the null hypothesis, the distribution of $F_{\text{stat}}$ is a member of a family of probability distributions known as $\mathcal{F}$-distributions.

These distributions are not symmetric
Just like with $t$-distributions, the $\mathcal{F}$ is parameterized by "degrees of freedom."
Unlike the $t$, the $\mathcal{F}$ has two separate degrees of freedom: $\mathcal{F}_{df_1, df_2}$

The $\mathcal{F}$ Distribution

Under the null hypothesis, $F_{\text{stat}}$ follows a $\mathcal{F}_{p,\, n-p-1}$ distribution. Once again:

$p$ is the number of predictor variables in our model
$n$ is the sample size

We calculate a $p$-value as $P(\mathcal{F}_{p,\, n-p-1} \geq F_{\text{stat}})$

Here, extremeness is always in the right tail. Do you see why?
In R: 1-pf(Fstat, df1=p, df2=n-p-1)

Back to the Output...

summary(lm.medicorp)
  ...
  Multiple R-squared:  0.8872,	Adjusted R-squared:  0.8774
  F-statistic: 90.47 on 2 and 23 DF,  p-value: 1.261e-11

$\mathtt{F\text{-}statistic} = 90.47$

$F_{\text{stat}} = 0.8872/(1-0.8872) \times (23/2) = 90.47$

on 2 and 23 DF

$F_{\text{stat}}$ follows a $\mathcal{F}_{2, 23}$ distribution

p-value: 1.261e-11

$P(\mathcal{F}_{2, 23} \geq 90.47) = \mathtt{1\text{-}pf(90.47, 2, 23)} \approx 0$

Back to the Output...

summary(lm.medicorp)
  ...
  Multiple R-squared:  0.8872,	Adjusted R-squared:  0.8774
  F-statistic: 90.47 on 2 and 23 DF,  p-value: 1.261e-11

Hence, we reject the null hypothesis that $\beta_{\text{advert}} = \beta_{\text{bonus}} = 0$ and suggest that our regression model provides a statistically significant improvement in predictions for sales.

An Equivalent Form for the Overall $\mathcal{F}$-statistic

We'll soon introduce other hypothesis tests which also make use of the $\mathcal{F}$-distribution to compute $p$-values

Will be called Partial $\mathcal{F}$-tests

More natural to view these tests as comparisons between the sum of squared errors in models with different predictor variables

Let's try to understand the overall $\mathcal{F}$ test when viewed through this lens...

Comparing $SSE$ to $SST$

The overall $\mathcal{F}$-test can be viewed as a comparison between:

The sum of squared errors in a model including all of the predictor variables;
The sum of squared errors in a model only including those variables whose coefficients don't equal zero under the null

Under the null hypothesis tested by the overall $\mathcal{F}$ test, $\beta_1 = \beta_2 = \dots = \beta_p = 0$

Compare a regression (a) including all predictor variables; to (b) a regression only including an intercept term.

Comparing $SSE$ to $SST$

Due to chance correlations, $SSE < SST$ even if the null hypothesis is true!

Objective: Quantify how large of a discrepancy between $SST$ and $SSE$ we should expect under the null.
If what we observe is much larger than expected, we reject the null.

The Overall $\mathcal{F}$-Test

The $F$-Statistic

Our test statistic may be expressed as:

\[F_{\text{stat}} = \frac{(SST - SSE)/p}{SSE/(n - p - 1)}\]

The Overall $\mathcal{F}$-Test

Numerator: difference between the sum of squared total and the sum of squared errors, divided by the differences in the degrees of freedom for the residual terms in those two regressions
- Regression with only an intercept: $e$ has $n-1$ degrees of freedom
- Regression with all variables: $e$ has $n-p-1$ degrees of freedom
- $(n-1)-(n-p-1) = p$
Denominator: sum of squared errors in the full model, divided by degrees of freedom for residual vector in the full model
- Note: $SSE/(n-p-1) = \widehat{\sigma}_\varepsilon^2$
- $SSE/(n-p-1)$ often called the mean squared error.

Categorical Variables

Definition

Categorical (also called nominal or qualitative) variables are ones in which the "values" are labels for group memberships. These values are often (but not always) coded as words.

The variable "colors" can take on values 'red', 'blue,'...
The variable "Nationality" can take on values 'American', 'Canadian,'...

Categorical Variables

Definition

Categorical (also called nominal or qualitative) variables are ones in which the "values" are labels for group memberships. These values are often (but not always) coded as words.

Categorical variables that can only take on two distinct values are called binary variables

The variable "College Graduate?" can take on values 'yes' or 'no'

Categorical Predictor Variables

So far we've talked about conducting multiple regression where our predictor and response variables are both continuous.

What does R do when we feed it a categorical predictor variable?

lm(Sales~Region)$coefficients

(Intercept) RegionSOUTH  RegionWEST
  1498.6355   -461.0997   -336.1042

Categorical Predictor Variables

If I have a categorical variable that takes on $G$ values, R creates $G-1$ binary variables (also called dummy variables) and includes them as regressors

RegionSOUTH $= \mathbb{1}\{\text{Region = South}\}$
RegionSOUTH: 1 if territory was southern, 0 otherwise.

Referred to as "one-hot encoding" of a categorical variable.

Regression with Categorical Predictors

Equation is then of the form

\[\widehat{\text{Score}} = \hat{\beta}_0 + \hat{\beta}_1 \mathbb{1}\{\text{Region = South}\} + \hat{\beta}_2 \mathbb{1}\{\text{Region = West}\}\]

My region-specific prediction from the regression line would then be

Midwest: $\widehat{\text{Sales}} = \hat{\beta}_0 = 1498.6$
South: $\widehat{\text{Sales}} = \hat{\beta}_0 + \hat{\beta}_1 = 1498.6 - 461.1 = 1037.5$
West: $\widehat{\text{Sales}} = \hat{\beta}_0 + \hat{\beta}_2 = 1498.6 - 336.1 = 1162.5$

Regression with Categorical Predictors

Compare to...

tapply(Sales, Region, mean)
 MIDWEST    SOUTH     WEST
1498.635 1037.536 1162.531

Interpreting the Slopes

The group that R doesn't include an indicator for (in this case, Midwest) is known as the reference category.

The reason: the interpretation of the slope coefficients is made relative to (in reference to) the excluded category

Interpreting the Slopes

Slopes of Categorical Predictors

Suppose my categorical predictor variable takes on $G$ values, $c_1,\dots,c_G$, and my equation is of the form

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 \mathbb{1}\{x_i = c_1\} + \dots + \hat{\beta}_{G-1} \mathbb{1}\{x_i = c_{G-1}\},\]

such that $c_G$ is the reference category. My expected difference in values of $y$ for two individuals, one in category $c_j$ and one in $c_G$, is $\hat{\beta}_j$.

Why not $G$ Binary Variables?

Why can't we include $G$ dummy variables, one for each category? Recall the normal equations:

\[X^\top X \widehat{\beta} = X^\top y\]

To express our vector of slope coefficients as $\widehat{\beta} = (X^\top X)^{-1} X^\top y$, we needed to make sure $(X^\top X)^{-1}$ was invertible

Need $p + 1 < n$
Need linear independence between the columns of $X$ (avoidance of collinearity)

Why not $G$ Binary Variables?

If I include $G$ dummy variables, $d_1,\dots,d_G$, what would $d_1 + d_2 + \dots + d_G$ equal?

Implication: since first column of the design matrix $X$ is the intercept column, the resulting design matrix would not have linearly independent columns.

Testing Overall Equality of Means

> summary(lm(Sales~Region))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1498.64      45.22  33.141  < 2e-16 ***
RegionSOUTH  -461.10      72.51  -6.359 1.73e-06 ***
RegionWEST   -336.10      69.69  -4.823 7.24e-05 ***

Residual standard error: 150 on 23 degrees of freedom
Multiple R-squared:  0.669,	Adjusted R-squared:  0.6402
F-statistic: 23.24 on 2 and 23 DF,  p-value: 3.005e-06

What does the (stronger) regression model corresponding to the output above look like, specialized to the case of these predictors?
What would it look like under the null hypothesis tested by the overall $\mathcal{F}$-test?

The Regression Model with Categorical Variables

The population expectation under the multiple regression model would be

\[\text{E}(Y_i \mid x_i) = \beta_0 + \beta_1 \mathbb{1}\{x_i = c_1\} + \dots + \beta_{G-1} \mathbb{1}\{x_i = c_{G-1}\}\]

Under the null hypothesis tested by the overall $\mathcal{F}$,

\[\mathbf{H_o}: \beta_1 = \dots = \beta_{G-1} = 0,\]

the expectation function would instead be

\[\text{E}(Y_i \mid x_i) = \beta_0,\]

i.e. the expectation is the same for all categories.

The Regression Model with Categorical Variables

So, the overall $\mathcal{F}$-test from this regression can be used to assess the null $\mathbf{H_0:}$ $\mu_g = \beta_0$ for all $g$, where $\mu_g$ is the population mean in the $g$th category.
In our example: $p$-value was $3 \times 10^{-6}$, so reject the null.

Analysis of Variance

This regression-based approach for testing the equality of means through use of the overall $\mathcal{F}$ test has its own name in the literature: Analysis of Variance, or ANOVA for short.

Note that the test is assuming the truth of the (stronger) regression model. Most importantly here...

Homoskedasticity
Normality

Analysis of Variance

In this context, homoskedasticity means $\sigma_1 = \dots = \sigma_G$. Do you see why?

When you conducted two-sample $t$-tests in your introductory statistics class, did you assume the standard deviations were equal between groups?

Recap: Multiple Regression

             Estimate Std. Error t value Pr(>|t|)
(Intercept) -682.3699   179.3558  -3.805 0.000913 ***
Bonus          2.0970     0.6904   3.037 0.005855 **
Advert         2.6949     0.2563  10.513 2.95e-10 ***

F-statistic: 90.47 on 2 and 23 DF,  p-value: 1.261e-11

What lurking variables aren't accounted for in this regression?

Combining Continuous and Categorical

In the same way, our test for differences in means across regions did not account for differences on the basis of bonus and advertisement

Maybe there were more sales in the midwest because there was more money spent on advertisements there!

Can we try to control for all variables through a multiple regression?

lm(Sales~Region+Bonus+Advert)

(Intercept)  RegionSOUTH   RegionWEST        Bonus       Advert
   -151.586     -155.291     -150.234        1.534        2.132

Interpreting the New Categorical Slopes

Compare the slope on South before and after adding Bonus and Advert

Only Categoricals

(Intercept)  RegionSOUTH   RegionWEST
     1498.6       -461.1       -336.1

Comparing two territories, one in the midwest and one in the south, we'd expect the one in the south to have $461.1 thousand fewer sales.

Interpreting the New Categorical Slopes

Include Bonus and Advert

(Intercept)  RegionSOUTH   RegionWEST
   -151.586     -155.291     -150.234

Comparing two territories, one in the midwest and one in the south with the same bonus and advertisement spending we'd expect the one in the south to have $155.29 thousand fewer sales.

Motivating a Further Step

Once again, here's our regression output

Coefficients:
  (Intercept)  RegionSOUTH   RegionWEST        Bonus       Advert
    -151.586     -155.291     -150.234        1.534        2.132

In the south, for two territories with the same advertising spend but who differ in total bonus paid by 1 hundred dollars, what's my expected difference in sales (in thousands of dollars)?
In the midwest, for two territories with the same advertising spend but who differ in total bonus paid by 1 hundred dollars, what's my expected difference in sales (in thousands of dollars)?

Motivating a Further Step

Once again, here's our regression output

Coefficients:
  (Intercept)  RegionSOUTH   RegionWEST        Bonus       Advert
    -151.586     -155.291     -150.234        1.534        2.132

In general, how do the regression lines vary across groups? Different intercepts, but same slopes

Is it obvious that this should be the case? Might the impact depend on region? Might the two interact?

Interaction Terms

It may be that the slope of a predictor variable changes as a function of another categorical variable.

The continuous variable may impact the expectation in different groups differently

We can capture these differences through the incorporation of interaction terms.

Interaction Terms

Interaction Term
To form an interaction term, we simply take the product of the two covariates whose interaction we are investigating: $X_i\times X_j$

The interaction term between a continuous and categorical covariate takes on the following form:

\begin{align*} X_i \mathbf{1}\{X_j = c\} &= \begin{cases} 0 & X_j \neq c\\ X_i & X_j = c\end{cases} \end{align*}

Including Interactions

In R, including interactions is straightforward:

summary(lm(Sales~Region*Bonus + Region*Advert))
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)        -555.5274   212.5735  -2.613  0.01817 *
RegionSOUTH         837.0791   292.4465   2.862  0.01079 *
RegionWEST          527.3892   393.8955   1.339  0.19823
Bonus                 1.8994     0.6424   2.957  0.00883 **
Advert                2.6585     0.2132  12.469 5.59e-10 ***
RegionSOUTH:Bonus    -0.8742     0.9571  -0.913  0.37382
RegionWEST:Bonus      0.7477     1.0145   0.737  0.47117
RegionSOUTH:Advert   -1.5546     0.4800  -3.239  0.00483 **
RegionWEST:Advert    -1.6964     0.5644  -3.006  0.00796 **

The last four coefficients are the interactions.

Understanding the Prediction Equations

At this extreme, each group has its own slopes and intercepts

Midwest: $-555.5 + 1.90(\texttt{Bonus}) + 2.66(\texttt{Advert})$
South: $(-555.5 + 837.1) + (1.90-0.87)\texttt{Bonus} + (2.66-1.55)\texttt{Advert}$
West: $(-555.5 + 527.4) + (1.90+0.75)\texttt{Bonus} + (2.66-1.70)\texttt{Advert}$

What do you think would happen if I fit three separate regressions, one each for Midwest, South, and West? How would the output compare to what I'm seeing above?

Was the Extra Work Necessary?

Through this process we've built up from a fairly simple model for sales to one that separately models sales according to region.

Did we need to do this to the full extent we've provided here?
For example: perhaps the effectiveness of advertising varies regionally, but not the impact of bonus...

How do we assess this?

Unfortunately, this won't generally correspond to a simple $t$-test...

Partial $\mathcal{F}$-test

The required test is called a partial $\mathcal{F}$-test.

Partial in the sense that it compares large (full) models to some partial subset of the large model

General use: suppose we fit a model $Y_i = \beta_0 + \beta_1x_{i1} + \dots+\beta_px_{ip} + \varepsilon_i$, and are interested in the null hypothesis that some subset of slopes $\mathcal{I}\subseteq \{1,\dots,p\}$ are zero.

For example, suppose we want to test if the slopes on the third and fifth predictors are both zero, versus the alternative that at least one is nonzero.

Partial $\mathcal{F}$-test

The test compares the predictive performance in the larger model to the predictive performance in the smaller model, which uses only a subset of the predictor variables.

Natural extension of the $t$-tests on individual coefficients to dealing with sets of coefficients.
In fact: if the subset is a single coefficient, the two will be identical

Full and Reduced Models

Suppose I am interested in the null hypothesis

\begin{align*} \mathbf{H_o}:& \;\beta_j = 0 \text{ for some subset of covariates } \mathcal{I}\\ \mathbf{H_a}:& \;\text{at least one slope in the subset $\mathcal{I}$ is nonzero} \end{align*}

A natural way to test this is to compare the $SSE$ from a regression including all variables to the $SSE$ from a regression excluding the variables whose slopes are zero under the null

Full Model: The model including the full set of covariates
Reduced Model: The model only including the covariates whose slopes aren't set to zero under the null

Full and Reduced Models

Ex: if we wanted to see if we needed interactions between region and bonus, we'd compare a model including them (the full model) to a model excluding them (the reduced model)

The Partial $\mathcal{F}$-Test

Full Model (includes all the predictor variables)

$SSE_{Full}$: the $SSE$ in the full model, including all candidate predictor variables
$df_{Full}$: the degrees of freedom for the residual vector under the full model
- If the full model has $p_{full}$ predictors and an intercept, this will be $n-p_{full}-1$

The Partial $\mathcal{F}$-Test

Reduced Model (includes subset where slopes aren't equal to zero under $H_0$)

$SSE_{Red}$: the $SSE$ in the reduced model, including only the predictor variables whose slope coefficients aren't equal to zero under the null
$df_{Red}$: the degrees of freedom for the residual vector under the reduced model.
- If the reduced model has $p_{red}$ predictors whose slopes don't equal zero under the null along with an intercept, this will be $n-p_{red}-1$.
- Note $p_{full} > p_{red}$

The Partial $\mathcal{F}$-Test

The $F$-Statistic
Our test statistic may be expressed as:

\begin{align*} F_{stat} &= \frac{(SSE_{red}-SSE_{full})/(df_{red} - df_{full})}{SSE_{full}/df_{full}} \end{align*}

Numerator: difference between the SSE in reduced and full models, divided by the differences in the degrees of freedom
- If both regressions have intercept terms, this will equal $p_{full} - p_{red}$.

The Partial $\mathcal{F}$-Test

The $F$-Statistic
Our test statistic may be expressed as:

\begin{align*} F_{stat} &= \frac{(SSE_{red}-SSE_{full})/(df_{red} - df_{full})}{SSE_{full}/df_{full}} \end{align*}

Denominator: SSE in the full model divided by degrees of freedom for residual vector in the full model
- With an intercept, $df_{full} = n-p_{full}-1$
- Denominator called the mean squared error in the full model.

Under the Null: $F_{stat}\sim \mathcal{F}_{(df_{red}-df_{full}),\; df_{full}}$

An Example: The Overall $\mathcal{F}$-Test

Under the assumptions of the multiple regression model, the null hypothesis for testing "was multiple regression worth it at all?" with our predictor variables $x_1,\dots,x_p$ takes on the form:

\begin{align*} \mathbf{H_o}: & \;\beta_1 =\beta_2=\dots=\beta_p = 0 \end{align*}

The alternative is

\begin{align*} \mathbf{H_a}: & \;\text{at least one } \beta_j\neq 0\;,\; j=1,\dots,p \end{align*}

An Example: The Overall $\mathcal{F}$-Test

Another way to think about this: under the null, the only thing that matters for predicting a response is the intercept, $\beta_0$.

If this were true, $\textbf{E}(Y_i|x_1,\dots,x_p) = \beta_0$. That is, every point has the same mean regardless of the values of the explanatory variables
One can show: in a regression that only contains an intercept, $\hat{\beta}_0$, the estimated intercept coefficient, equals $\bar{y}$

The Overall $\mathcal{F}$-Test

Full Model

Contains all $p_{full}$ of the predictor variables
$df_{full} = n-p_{full}-1$

Reduced Model

Only contains the intercept. $p=0$ for this regression
$df_{red} = n-1$
$SSE_{red} = SST$

The Overall $\mathcal{F}$-Test

\begin{align*} &\frac{(SSE_{red}-SSE_{full})/(df_{red} - df_{full})}{SSE_{full}/df_{full}} \\ &\quad=\frac{(SST-SSE_{full})/p_{full}}{SSE_{full}/(n-p_{full}-1)}\\ &\quad= \left(\frac{R^2}{1-R^2}\right)\left(\frac{n-p_{full}-1}{p_{full}}\right) \end{align*}

Back to the Output...

> summary(lm(Sales~Region*Bonus + Region*Advert))
...
Residual standard error: 48.57 on 17 degrees of freedom
Multiple R-squared:  0.9743,	Adjusted R-squared:  0.9623
F-statistic: 80.68 on 8 and 17 DF,  p-value: 5.371e-12

$p$-value for the Overall $\mathcal{F}$-test is given in the very last line: $5.371\times 10^{-12}$

Hence, at $\alpha=0.05$ we reject the null hypothesis and suggest that our regression model provides a statistically significant improvement in prediction in sales relative to predicting the overall mean for all individuals.

Acts as a "proof of concept." There's at least something going on that makes doing regression useful in the first place.

Interactions and Partial $\mathcal{F}$'s

Take as our full model the model including interactions between Bonus and Region and Advertisement and Region. Let's test the null hypothesis that there isn't an interaction between Bonus and Region.

Full Model: Interactions between Bonus and Region, and between Advertisement and Region
Reduced Model: Interaction between Advertisement and Region, but no interaction between Bonus and Region
- Implication: Slope coefficient on Bonus doesn't vary from one region to the next under the null hypothesis.

Conducting the Test

To conduct the test, we fit the full and reduced linear models, and feed the results into the anova command.

lm.full = lm(Sales~Bonus*Region + Advert*Region)
lm.reduced = lm(Sales~Bonus + Advert*Region)

anova(lm.reduced, lm.full)

Model 1: Sales ~ Bonus + Advert * Region
Model 2: Sales ~ Bonus * Region + Advert * Region
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1     19 45723
2     17 40109  2    5614.1 1.1898 0.3284

Conclusion: Fail to reject the null. It suggests that we don't need to include interactions between region and bonus in the model.

Continuous Interactions

Interaction effects are not solely limited to interactions between continuous variables and categoricals.

Let's go back to the mall and consider a regression of sales on income and competitors

(Intercept)       income  competitors
      91.422        7.495      -25.135

For two stores with median income equal to 47K, what's the expected difference in sales if one store has 1 more competitor? -25.135
For two stores with median income equal to 74K, what's the expected difference in sales if one store has 1 more competitor? -25.135

Continuous Interactions

Interaction effects are not solely limited to interactions between continuous variables and categoricals.

Let's go back to the mall and consider a regression of sales on income and competitors

(Intercept)       income  competitors
      91.422        7.495      -25.135

That is, the particular value of median income doesn't affect the expected difference in sales.

Is this obviously true? Might competitors and income not interact?

Continuous Interaction

The interaction term between two continuous variables takes on the form

\begin{align*} X_i:X_j &= X_i\times X_j \end{align*}

lm(formula = sales ~ income * competitors)
  (Intercept)  income  competitors  income:competitors
  205.0266     5.6312  -61.6539     0.5841

The variable income:competitors is found by, for each row, taking the product of the income of that store and the number of competitors within that store.

If income was 47K and there were two competitor stores, the interaction variable would take on the value $47\times 2 = 94$

Consequences

lm(formula = sales ~ income * competitors)
  (Intercept)  income  competitors  income:competitors
  205.0266     5.6312  -61.6539     0.5841

What's the predicted sales/sqft for a store with three competitors and 47K in income?

$205 + 5.63\times 47 + (-61.65)\times 3 + 0.584\times 47\times 3$

For two stores with median income equal to 47K, what's the expected difference in sales per square foot if one store has 1 more competitor?

$-61.65 + 0.585\times 47 = -34$

Consequences

lm(formula = sales ~ income * competitors)
  (Intercept)  income  competitors  income:competitors
  205.0266     5.6312  -61.6539     0.5841

For two stores with median income equal to 74K, what's the expected difference in sales if one store has 1 more competitor?

$-61.65 + 0.585\times 74 = -18$

Was the Complication Worth It?

                   Estimate Std. Error t value Pr(>|t|)
(Intercept)        205.0266   128.7509   1.592   0.1165
income               5.6312     2.1195   2.657   0.0101 *
competitors        -61.6539    38.2112  -1.614   0.1119
income:competitors   0.5841     0.6026   0.969   0.3363

How can we use this output to see if including the interaction improved predictive performance relative to a model without an interaction? What's the conclusion at $\alpha=0.05$?

Relationship B/W $t_{df}$ and $\mathcal{F}_{1,df}$ Distributions

Suppose that $T \sim t_{df}$. Then,

\begin{align*} T^2 & \sim \mathcal{F}_{1, df} \end{align*}

In words, the square of a $t_{df}$-distributed random variable follows an $\mathcal{F}$ distribution, with one numerator degree of freedom and $df$ denominator degrees of freedom.

With this in mind, let's consider an interesting correspondence between

$t$-tests for individual slope parameters
Partial $\mathcal{F}$-tests where only one slope parameter equals zero under the null

An Illustration

Consider two approaches to testing whether or not we need the interaction between income and competitors:

Partial $\mathcal{F}$-test

Model 1: sales ~ income + competitors
  Model 2: sales ~ income * competitors
    Res.Df    RSS Df Sum of Sq      F Pr(>F)
  1     61 277232
  2     60 272958  1    4274.1 0.9395 0.3363

An Illustration

Consider two approaches to testing whether or not we need the interaction between income and competitors:

$t$-test

                   Estimate Std. Error t value Pr(>|t|)
  (Intercept)        205.0266   128.7509   1.592   0.1165
  income               5.6312     2.1195   2.657   0.0101 *
  competitors        -61.6539    38.2112  -1.614   0.1119
  income:competitors   0.5841     0.6026   0.969   0.3363

$0.97^2 = 0.94$. $p$-values are identical.

Another Illustration: Simple Regression

Consider a simple regression of Sales on Income

lm(formula = sales ~ income)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 120.4583    58.6327   2.054   0.0442 *
income        6.0807     0.9066   6.707 6.89e-09 ***
---
F-statistic: 44.99 on 1 and 62 DF,  p-value: 6.888e-09

Compare the $p$-values on the overall F test and for the slope on income. Why are they equal?

An Aside: What's the $t$ Distribution?

A random variable $T_\nu$ follows a $t_\nu$ distribution if it can be expressed as follows:

\begin{align*} T_\nu &= \frac{Z}{\sqrt{\sum_{i=1}^\nu Z_i^2/\nu}}, \end{align*}

where $Z,Z_1,\dots,Z_\nu\overset{iid}{\sim}N(0,1)$.

All of our $t$-statistics can be expressed in this form under the stronger linear model!

An Aside: What's the $\mathcal{F}$ distribution?

A random variable $F_{\nu_1, \nu_2}$ follows a $\mathcal{F}_{\nu_1,\nu_2}$ distribution if it can be expressed as follows:

\begin{align*} F_{\nu_1,\nu_2} &= \frac{\sum_{i=1}^{\nu_1} W_i^2/\nu_1}{\sum_{j=1}^{\nu_2} Z_j^2/{\nu_2}}, \end{align*}

where $W_1,\dots,W_{\nu_1}, Z_1,\dots,Z_{\nu_2}\overset{iid}{\sim}N(0,1)$.

All of our $F$-statistics can be expressed in this form under the stronger linear model!