STATS 413: Lecture 17

Bootstrapping regression models
Log transformations
Log-Log-transformations

Bootstrapping Regression Models

We will now apply the same principles to better approximating the distributions of

\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \;\; \text{or} \;\; \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)} \end{align*}

Bootstrapping Regression Models

We will try to devise a schema to approximate the distribution of $t_{stat}$. In analogy to what we saw when bootstrapping the mean, in the bootstrap world:

The true population slope coefficients will be $\hat{\beta}$.
Each bootstrap sample will generate a value $\hat{\beta}^*$, an estimate of $\hat{\beta}$
Each bootstrap sample will also generate a standard error estimator.

Bootstrapping Regression Models

Finally, each bootstrap sample will generate a value for $t^*_{stat}$.
The distribution of $t^*_{stat}$ across simulations will serve as our estimate for the distribution of $t_{stat}$ (rather than using the $t_{n-p-1}$ distribution).

Bootstrapping Regression Models

There are two prevalent bootstrap schema, depending upon whether one is willing to assume homoskedasticity.

Residual Bootstrap (assumes homoskedasticity)
Pairs Bootstrap (valid even if heteroskedastic)

Bootstrapping Regression Models

The residual bootstrap will use the test statistic:

\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}, \end{align*}

While the pairs bootstrap will use the test statistic:

\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)} \end{align*}

Pairs Bootstrap

The pairs bootstrap instead resamples pairs of observations $(y_i, x_i)$, rather than the residuals. Let $\hat{\beta}$ be the OLS coefficients from a regression of $y$ on $X$.

Draw $n$ pairs $\{(y_1^*, x_1^*),\ldots,(y_n^*, x_n^*)\}$ $iid$ from the empirical distribution of $\{(y_1,x_1),\ldots,(y_n,x_n)\}$ (that is, sample $n$ times with replacement from $\{(y_1,x_1),\ldots,(y_n,x_n)\}$)
Construct $y^* = (y_1^*,\ldots,y_n^*)^\top$ and $X^*$, the $n\times(p+1)$ matrix whose $i$th row contains $x_i^*$.

Pairs Bootstrap

Run a regression of $y^*$ on $X^*$. Call this result $\mathtt{lm^*}$
Calculate $\hat{\beta}_j^*$ and $\text{se}_{HC2}(\hat{\beta}_j^*)$ using the regression $\mathtt{lm^*}$.

Pairs Bootstrap

Form $t^*_{stat,b}$ as

\begin{align*} t^*_{stat,b} &= \frac{\hat{\beta}_j^* - \hat{\beta}_j}{\text{se}_{HC2}(\hat{\beta}_j^*)} \end{align*}

Use the quantiles of $t^*_{stat,b}$ to construct confidence intervals.

Residual Bootstrap

Let $\hat{\beta}$ be the OLS coefficients from a regression of $y$ on $X$. As the name suggests, the residual bootstrap is based upon the residuals $e = y - X\hat{\beta}$. For $b=1,\ldots,B$ bootstrap samples:

Draw $n$ observations $\{e_1^*,\ldots,e_n^*\}$ $iid$ from the empirical distribution of $\{e_1,\ldots,e_n\}$ (that is, sample $n$ times with replacement from $\{e_1,\ldots,e_n\}$.)
Construct $y_i^* = x_i^\top\hat{\beta} + e_i^*$ for $i=1,\ldots,n$

Residual Bootstrap

Run a regression of $y^*$ on $X$. Call this result $\mathtt{lm^*}$
Calculate $\hat{\beta}_j^*$ and $\text{se}(\hat{\beta}_j^*)$ using the regression $\mathtt{lm^*}$.

Residual Bootstrap

Form $t^*_{stat,b}$ as

\begin{align*} t^*_{stat,b} &= \frac{\hat{\beta}_j^* - \hat{\beta}_j}{\text{se}(\hat{\beta}_j^*)} \end{align*}

Use the quantiles of $t^*_{stat,b}$ to construct confidence intervals.

Residual vs Pairs Bootstrap

If homoskedasticity appears reasonable, the residual bootstrap is generally preferred

Treats $X$ as fixed, induces randomness through resampling residuals
More aligned with the data generating mechanism we've considered
Performs better in small samples (more stable) than the pairs bootstrap

Residual vs Pairs Bootstrap

Under heteroskedasticity, the pairs bootstrap is your only option

Note that the residual bootstrap breaks the association between $e_i$ and $x_i$. This is reasonable under homoskedasticity, but not under heteroskedasticity.
For this reason, residual bootstrap doesn't provide valid inference under heteroskedasticity.
Pairs bootstrap maintains correspondence between pairs $(y_i, x_i)$, and hence preserves heteroskedasticity in the bootstrap world.

Addressing Departures from Assumptions

In the first part of this course, we

Derived certain consequences of the ordinary least squares slope coefficients under the assumptions of the weaker and stronger linear models
Discussed diagnostics to assess whether these assumptions seem reasonable

Addressing Departures from Assumptions

Now: What happens if certain assumptions seem unreasonable? Can we proceed? If so, how?

What if the true trend doesn't seem linear? (Today)
What happens if the data is actually heteroskedastic? (Lecture 14)
What if normality seems violated? (Lecture 15)

Nonlinearity

If we suspect nonlinearity is present when running a regression of $y$ on $X$, we are concerned that $E(y\mid \tilde{x})$ cannot be written as a linear function of the predictors. For some nonlinear function $g(\cdot)$:

\begin{align*} E(y\mid \tilde{x}) &= g(\tilde{x}) \end{align*}

If the conditional expectation is nonlinear in the predictors, but we nonetheless conduct an ordinary least squares regression of $y$ on $X$,

\begin{align*} E(\tilde{x}^\top\hat{\beta}) &\neq E(y\mid \tilde{x})\\ E(e_i\mid x_i) &\neq 0 \end{align*}

Transformation

If the true conditional expectation is nonlinear, must we abandon linear regression? No...

Consider transformations of the response variable
Consider transformations of the predictor variables

These fall under two main approaches

Transformation

Transformations driven by intuition / understanding of the data generating process
- Here, we might still try to interpret slope coefficients after transformation
Transformations affording flexibility to approximate nonlinear truth
- Here, our main focus will be on the accuracy of predictions made using our model.

Log Transformations

Perhaps the most commonly applied transformation of predictors and/or response variables is the log transformation

Variable needs to take on strictly positive value
Commonly used with financial data (salaries, GDP, expenditures,...)
Also commonly used for volume measurements
In general, used for variables whose distributions tend to be right skewed (and positive)

Log Transformations

Useful when changes in a particular variable occur on a multiplicative / percentage scale, rather than an additive scale

Natural to think of differences in salaries on a percentage basis (salary increases tend to be percentage based)

Log Transformations

We'll focus on simple regression today, but everything described here extends naturally to multiple regression.

Might transform certain predictors, and leave others unchanged

Fun Facts about Logarithms

Properties

\begin{align*} \log(a\times b) &= \log(a) + \log(b)\\ \log(a^b) &= b\log(a)\\ \log(a_1\times a_2\times...\times a_n) &= \sum_{i=1}^n \log(a_i) = \log(a_1) + .... + \log(a_n)\\ \log(a + b) &\neq \log(a) + \log(b) \end{align*}

Statistical Consequences

It would be nice if we could relate the mean and standard deviation of a variable on the $\log$ scale to its mean on the original, untransformed scale.

No such relationship exists!

Statistical Consequences

An Inconvenient Truth

Let $y = \{y_1,...,y_n\}$ be positive. Unfortunately, in terms of arithmetic relations,

\begin{align*} \text{mean}(\log(y)) &\neq \log(\text{mean}(y))\\ \text{sd}(\log(y)) &\neq \log(\text{sd}(y)) \end{align*}

Statistical Consequences

In English, the mean of the logarithms of $y$ is not equal to the logarithm of the mean of $y$. Same for $sd$.

If $Z$ is a positive random variable, then $E(\log(Z)) \neq \log(E(Z))$

Logs and Percent Changes

Suppose that the difference in salaries between two individuals is 10%:

\begin{align*} x_1/x_2 &= 1.1 \end{align*}

Applying logarithms to both sides,

\begin{align*} \log(x_1/x_2) = \log(x_1)-\log(x_2) &= \log(1.1) \end{align*}

Logs and Percent Changes

Differences that were multiplicative before become additive after taking a log transform.

Let $\ell = \log(x)$
$\ell_1 - \ell_2 = \log(1.1)$.

In linear regression, slope coefficients relate additive differences in predictor values to additive differences in responses.

Wine Sales and Display Space

Data set: wine_space.csv. For a given bottle of wine, shows how much display space (in feet) the bottle took up, and the corresponding volume of sales.

Is the relationship linear? Look at the behavior for smaller and larger spaces...
The relationship is positive, but it seems like there may be concave curvature

Wine Sales and Display Space

Diminishing Returns to Space

It makes sense intuitively that there are diminishing returns to display space for promoting sales

Clearly: the more space, the more sales
Comparing one foot to two feet: massive jump
Comparing five to six feet: not so much
Perhaps it's best to think of increases in display space on a percentage scale...

Diminishing Returns to Space

Log Transformation

The log transformation provides comparison on a percentage scale automatically: multiplicative differences on the original scale become additive on the log scale

Diminishing returns to scale when comparing 1 vs 2 feet to 5 vs 6 feet of display space.
On a log scale, the differences between these two values aren't the same! $\log(2)$ versus $\log(6/5)$

Log Transformation

On the $\log$ scale: distances between points on the $x$ axis represent factor changes

A note going forward: the above is true of any base logarithm. Yet for reasons to be discussed, we will only use base-$e$ logarithms when transforming variables for regression analysis

Logarithmic Growth

If percentage differences in $x$ relate to additive differences in $y$, we could consider the following generative model for logarithmic growth:

\begin{align*} y_i &= \beta_0 + \beta_1 \log_e(x_i) + \varepsilon_i, \end{align*}

If we define the transformed variable $\ell_i = \log_e(x_i)$, we have a linear model on the transformed scale, and we could proceed with a linear regression of $y$ on $\ell$.

Comparing the Fits

Let's now show a scatterplot of Sales against $\log_e$-display space

Looks linear!
The log transformation spread out the sales corresponding to smaller display sizes
Clumped the sales corresponding to larger display sizes closer together

Comparing the Fits

The Log in Action

Now, we compare our fits with $x$ on the original scale

The linear approximation wasn't terrible, but the trend is much better approximated by the log transform

The Log in Action

The Slope with Log Transformations

Suppose we perform a regression with $\log_e(x)$ as the predictor variable and $y$ as the response. How do we interpret the slope coefficient on the variable $\log_e(x)$?

When the relationship is linear, we said that the slope coefficient, $\hat{\beta}_1$ is the estimated difference in $y$ for two individuals who are different in their predictor variables by one unit
Usual interpretation: comparing $y$ values for two individuals, $i$ and $j$, for whom $\log_e(x_i) - \log_e(x_j) = 1$.

The Slope with Log Transformations

This doesn't provide much intuition in terms of the predictor variable itself.
Would be nice to relate a comparison of $x$ itself (without transformation) to a comparison of predicted values for $y$

Interpretation as a Percentage Change

Our fitted equation is

\begin{align*} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1\log_e(x_i) \end{align*}

Let's relate the slope coefficient to a multiplicative difference in $x$

Suppose we compared two individuals who differ in their $x$ values by 1 percent. What we we predict the difference in their predicted $y$ values to be?

Interpretation as a Percentage Change

\begin{align*} \hat{y}_1 - \hat{y}_2 &= \hat{\beta}_1(\log_e(1.01x) - \log_e(x))\\ &= \hat{\beta}_1(\log_e(1.01) - \log_e(1))\\ &= \hat{\beta}_1\log_e(1.01) \end{align*}

If the two units differed by $\Delta$%, the expected difference in $y$ would be a factor of $\hat{\beta}_1\log_e(1+\Delta/100)$.

Approximate Interpretation

Note that for $|a|$ small (say, $|a| \leq 0.1$), $a \approx \log_e(1+a)$. Why? Taylor Expansion of $\log_e(\cdot)$! This holds for base $e$ only.

Suppose we compared two individuals who differ in their $x$ values by 1 percent. What we we predict the difference in their $y$ values to be?

Approximate Interpretation

\begin{align*} \hat{y}_1 - \hat{y}_2 &= \hat{\beta}_1(\log_e(1.01x) - \log_e(x))\\ &= \hat{\beta}_1(\log_e(1.01) - \log_e(1))\\ &\approx \hat{\beta}_1(0.01) \end{align*}

Approximate Interpretation

For a regression of the form $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\log_e(x)$, we can roughly interpret the slope as follows:

Approximate Interpretation under Logarithmic Growth

Two observations who differ in $x$ by $\Delta$% are predicted to differ in $y$ by roughly $(\hat{\beta}_1/100)\Delta$ (provided $|\Delta| < 10\%$)

In Our Example

Call:
lm(formula = sales ~ ldisplay)

Coefficients:
(Intercept)     ldisplay
      107.5        124.1

Interpretation of the slope: two bottles of wine whose display footage differ by 1% are predicted to differ in number of sales by roughly 1.241. (Exact number: $124.1\times \log(1.01)= 1.235)$

Exponential Growth

In a system undergoing exponential growth as a function of $x$, one may observe that additive differences in $x$ correspond to percentage (multiplicative) differences in $y$. This is consistent with the model

\begin{align*} y_i &= \exp\{\beta_0 + \beta_1 x_i + \varepsilon_i\} \end{align*}

Exponential Growth

If we take the log transformation of the response, we'd return to a linear model

\begin{align*} \log_e(y_i) &= \beta_0 + \beta_1x_i + \varepsilon_i, \end{align*}

from whence we can run a linear regression of $\log(y_i)$ on $x_i$ to estimate $\beta$.

Slope Coefficients Under Exponential Growth

Linear regression produces the following prediction equation:

\begin{align*} \log_e(\hat{y}) &= \hat{\beta}_0 + \hat{\beta}_1 x_i \end{align*}

Under exponential growth, consider two units who differ in $x$ by one unit:

Slope Coefficients Under Exponential Growth

\begin{align*} \log_e(\hat{y}_1) - \log_e(\hat{y}_2) &= \hat{\beta}_1\\ \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\\ \frac{\hat{y}_1}{\hat{y}_2} &= \exp\{\hat{\beta}_1\} \end{align*}

That is, two units who differ in $x$ by one unit differ in their predicted values for $y$ by a factor of $\exp\{\hat{\beta}_1\}$. If they differ by $\Delta$ units, their predictions would differ by a factor of $\exp\{\Delta \hat{\beta}_1\}$.

Approximate Interpretation: Exponential Growth

Now, note that by Taylor approximation we also have that $\exp\{a\} \approx 1 + a$, for $|a|$ small (say, $|a| \leq 0.1)$. Consider two units who differ in $x$ by $\Delta$ units under exponential growth:

\begin{align*} \log_e(\hat{y}_1) - \log_e(\hat{y}_2) &= \hat{\beta}_1\Delta\\ \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\Delta\\ \frac{\hat{y}_1}{\hat{y}_2} &= \exp\{\hat{\beta}_1\Delta\}\\ &\approx (1+\hat{\beta}_1\Delta), \end{align*}

Approximate Interpretation: Exponential Growth

So the two differ by $(100\hat{\beta}_1)\Delta$%.

Approximate Interpretation under Exponential Growth

Two observations who differ in $x$ by $\Delta$ units are predicted to differ in $y$ by roughly $(100\hat{\beta}_1)\Delta$% (provided $|\hat{\beta}_1\Delta| < 0.1$)

Other Log Transformations

Example: suppose we were looking at increases in web servers by year during the late 1990s:

\begin{align*} \log_e(\widehat{webserver}) = 13.4 + 0.9\times Year \end{align*}

For two dates differing by a month (1/12 Year), we'd the number of web-servers to differ by a factor of $\exp\{0.9/12\} = 1.078$.
Using the Taylor approximation, we'd estimate a difference of $100\times 0.9\times (1/12) = 7.5\%$

Log-Log Transformations

In other applications, relationships may be best thought of as percentage (multiplicative) changes in $x$ being predictive of percentage (multiplicative) changes in $y$. Elasticity in economics. In this case, we'd anticipate the following functional form:

\begin{align*} y_i &= x_i^{\beta_1}\exp\{\beta_0 + \varepsilon_i\} \end{align*}

Log-Log Transformations

Taking logs of both sides,

\begin{align*} \log_e(y_i) &= \beta_0 + \beta_1\log_e(x_i) + \varepsilon_i, \end{align*}

so that after log transformation of $x$ and $y$ we have a linear model. We can now run a linear regression of $\log_e(y)$ on $\log_e(x)$.

Slope Coefficients with Log-Log Transformations

Our fitted equation is

\begin{align*} \log_e(\hat{y}) &= \hat{\beta}_0 + \hat{\beta}_1\log_e(x) \end{align*}

Suppose two units differ in $x$ by 1%

\begin{align*} \log_e(\hat{y}_1) - \log_e(\hat{y}_2) &= \hat{\beta}_1(\log_e(1.01x) - \log_e(x))\\ \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\log_e(1.01)\\ \frac{\hat{y}_1}{\hat{y}_2} &= 1.01^{\hat{\beta}_1} \end{align*}

Slope Coefficients with Log-Log Transformations

Two units that differ in $x$ by 1% are predicted to differ in $y$ by a factor of $1.01^{\hat{\beta}_1}$.
If they differ by $\Delta$% in $x$, we'd expect them to differ in $y$ by a factor of $(1 + \Delta/100)^{\hat{\beta}_1}$

Approximate Interpretation

Again using Taylor approximation $\log_e(1+x)\approx x$, note that for $(\hat{y}_1-\hat{y}_2)/\hat{y}_2$ and $\Delta/100$ small, we have that if the units differ in $x$ by $\Delta$ percent:

\begin{align*} \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\log_e(1 + \Delta/100)\\ \log_e\left(1+ \frac{\hat{y}_1-\hat{y_2}}{\hat{y}_2}\right) &= \hat{\beta}_1\log_e(1 + \Delta/100)\\ \frac{\hat{y}_1-\hat{y_2}}{\hat{y}_2} &\approx (\hat{\beta}_1\Delta)/100 \end{align*}

Approximate Interpretation

Approximate Interpretation as Percentage

Two observations who differ in $x$ by $\Delta$% are predicted to differ in $y$ by $\hat{\beta}_1 \Delta$% (provided $|\hat{\beta}_1\cdot \Delta| < 10\%$ and $|\Delta| < 10\%$).

Example of Log-Log Regression

Example: volume of demand for Fedex Courier Service as a function of its Price:

\begin{align*} \log_e(\widehat{Volume}) = 8.0 + -0.33\times \log_e(Price) \end{align*}

When comparing prices for the courier service where one is 1% higher, we'd expect a difference by a factor of $1.01^{-0.33}$, or 0.997
Using our approximate interpretation from our Taylor Expansion, we'd expect a difference by -0.33%

Prediction Intervals

Suppose that we've run a linear regression using $\log_e(y)$ as a response variable, and that we'd like to form a 95% prediction interval for $y$ at a given value for the predictors $\tilde{x}$.

We could certainly construct prediction intervals for $\log_e(y)$ if the stronger linear model held:

\begin{align*} \log_e(y) &= X\beta + \varepsilon;\;\; \varepsilon \sim MVN(0, \sigma^2_\varepsilon) \end{align*}

Prediction Intervals

Treating $\log_e(y)$ as the response, we'd assess:

Linearity (pattern in the residuals?)
Homoskedasticity (does variability in standardized residuals change with $x$?)
Normality (are the standardized residuals normally distributed?)

[Note that these residuals are of the form $e_i = \log_e(y_i) - x_i^\top\hat{\beta}$. They are based on the log-responses.]

Prediction Intervals by Exponentiating Endpoints

A $100(1-\alpha)\%$ prediction interval for $\log_e(y)$ under the stronger linear model takes the form:

\begin{align*} \tilde{x}^\top\hat{\beta} \pm t_{1-\alpha/2, n-p-1}\hat{\sigma}_\varepsilon \sqrt{1+\tilde{x}^\top(X^\top X)^{-1}\tilde{x}} \end{align*}

As we'll now illustrate, we can use this to directy get a prediction interval for $y$ by exponentiating the endpoints. That is, letting $lb$ and $ub$ be the lower and upper bounds for the $100(1-\alpha)\%$ PI for $\log_e(y)$, a $100(1-\alpha)\%$ PI for $y$ is

\begin{align*} [\exp\{lb\}, \exp\{ub\}] \end{align*}

Quantiles and Logarithms

Although $E(\log(Z)) \neq \log(E(Z))$ for a positive random variable $Z$, we do have a nice correspondence between quantiles.

Quantiles and Logrithms

Let $Z$ be any (positive!) variable, and let $q_{p}(\cdot)$ be the $p$th quantile/percentile of a distribution. Then,

\begin{align*} q_p(\log(Z)) &= \log(q_p(Z)) \end{align*}

Quantiles and Logarithms

Quantiles and Logrithms

For example:

\begin{align*} q_{0.25}(\log(Z)) &= \log(q_{0.25}(Z))\\ \text{Median}(\log(Z)) &= \log(\text{Median}(Z)) \end{align*}

Why Does This Work?

The logarithm is a monotone increasing transformation. This means it preserves order!

Why Does This Work?

The $x$ coordinates of the points shown have the sorting red < purple < blue. The corresponding $y$ coordinates have the same ordering!

Monotonicity

Because the log is a monotone transformation, we know that for $y>0$

\begin{align*} lb\leq \log(y)\leq ub \Leftrightarrow \exp\{lb\} \leq y \leq \exp\{ub\} \end{align*}

Monotonicity

This doesn't hold for arbitrary functions $f(\cdot)$ (for example, consider $x^2$ with $-3 \leq x \leq 2$. Certainly isn't true that $9\leq x^2\leq 4$).

To justify prediction intervals, note that if $1-\alpha = P(lb \leq \log_e(y)\leq ub)$, then it has to be the case that $1-\alpha = P(\exp\{lb\} \leq y \leq \exp\{ub\})$ by the equivalence shown above.