Bootstrapping regression models
Log transformations
Log-Log-transformations
We will now apply the same principles to better approximating the distributions of
\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \;\; \text{or} \;\; \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)} \end{align*}We will try to devise a schema to approximate the distribution of $t_{stat}$. In analogy to what we saw when bootstrapping the mean, in the bootstrap world:
The true population slope coefficients will be $\hat{\beta}$.
Each bootstrap sample will generate a value $\hat{\beta}^*$, an estimate of $\hat{\beta}$
Each bootstrap sample will also generate a standard error estimator.
Finally, each bootstrap sample will generate a value for $t^*_{stat}$.
The distribution of $t^*_{stat}$ across simulations will serve as our estimate for the distribution of $t_{stat}$ (rather than using the $t_{n-p-1}$ distribution).
There are two prevalent bootstrap schema, depending upon whether one is willing to assume homoskedasticity.
Residual Bootstrap (assumes homoskedasticity)
Pairs Bootstrap (valid even if heteroskedastic)
The residual bootstrap will use the test statistic:
\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}, \end{align*}While the pairs bootstrap will use the test statistic:
\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)} \end{align*}The pairs bootstrap instead resamples pairs of observations $(y_i, x_i)$, rather than the residuals. Let $\hat{\beta}$ be the OLS coefficients from a regression of $y$ on $X$.
Draw $n$ pairs $\{(y_1^*, x_1^*),\ldots,(y_n^*, x_n^*)\}$ $iid$ from the empirical distribution of $\{(y_1,x_1),\ldots,(y_n,x_n)\}$ (that is, sample $n$ times with replacement from $\{(y_1,x_1),\ldots,(y_n,x_n)\}$)
Construct $y^* = (y_1^*,\ldots,y_n^*)^\top$ and $X^*$, the $n\times(p+1)$ matrix whose $i$th row contains $x_i^*$.
Run a regression of $y^*$ on $X^*$. Call this result $\mathtt{lm^*}$
Calculate $\hat{\beta}_j^*$ and $\text{se}_{HC2}(\hat{\beta}_j^*)$ using the regression $\mathtt{lm^*}$.
Form $t^*_{stat,b}$ as
Use the quantiles of $t^*_{stat,b}$ to construct confidence intervals.
Let $\hat{\beta}$ be the OLS coefficients from a regression of $y$ on $X$. As the name suggests, the residual bootstrap is based upon the residuals $e = y - X\hat{\beta}$. For $b=1,\ldots,B$ bootstrap samples:
Draw $n$ observations $\{e_1^*,\ldots,e_n^*\}$ $iid$ from the empirical distribution of $\{e_1,\ldots,e_n\}$ (that is, sample $n$ times with replacement from $\{e_1,\ldots,e_n\}$.)
Construct $y_i^* = x_i^\top\hat{\beta} + e_i^*$ for $i=1,\ldots,n$
Run a regression of $y^*$ on $X$. Call this result $\mathtt{lm^*}$
Calculate $\hat{\beta}_j^*$ and $\text{se}(\hat{\beta}_j^*)$ using the regression $\mathtt{lm^*}$.
Form $t^*_{stat,b}$ as
Use the quantiles of $t^*_{stat,b}$ to construct confidence intervals.
If homoskedasticity appears reasonable, the residual bootstrap is generally preferred
Treats $X$ as fixed, induces randomness through resampling residuals
More aligned with the data generating mechanism we've considered
Performs better in small samples (more stable) than the pairs bootstrap
Under heteroskedasticity, the pairs bootstrap is your only option
Note that the residual bootstrap breaks the association between $e_i$ and $x_i$. This is reasonable under homoskedasticity, but not under heteroskedasticity.
For this reason, residual bootstrap doesn't provide valid inference under heteroskedasticity.
Pairs bootstrap maintains correspondence between pairs $(y_i, x_i)$, and hence preserves heteroskedasticity in the bootstrap world.
In the first part of this course, we
Derived certain consequences of the ordinary least squares slope coefficients under the assumptions of the weaker and stronger linear models
Discussed diagnostics to assess whether these assumptions seem reasonable
Now: What happens if certain assumptions seem unreasonable? Can we proceed? If so, how?
What if the true trend doesn't seem linear? (Today)
What happens if the data is actually heteroskedastic? (Lecture 14)
What if normality seems violated? (Lecture 15)
If we suspect nonlinearity is present when running a regression of $y$ on $X$, we are concerned that $E(y\mid \tilde{x})$ cannot be written as a linear function of the predictors. For some nonlinear function $g(\cdot)$:
\begin{align*} E(y\mid \tilde{x}) &= g(\tilde{x}) \end{align*}If the conditional expectation is nonlinear in the predictors, but we nonetheless conduct an ordinary least squares regression of $y$ on $X$,
\begin{align*} E(\tilde{x}^\top\hat{\beta}) &\neq E(y\mid \tilde{x})\\ E(e_i\mid x_i) &\neq 0 \end{align*}If the true conditional expectation is nonlinear, must we abandon linear regression? No...
Consider transformations of the response variable
Consider transformations of the predictor variables
These fall under two main approaches
Transformations driven by intuition / understanding of the data generating process
Transformations affording flexibility to approximate nonlinear truth
Perhaps the most commonly applied transformation of predictors and/or response variables is the log transformation
Variable needs to take on strictly positive value
Commonly used with financial data (salaries, GDP, expenditures,...)
Also commonly used for volume measurements
In general, used for variables whose distributions tend to be right skewed (and positive)
Useful when changes in a particular variable occur on a multiplicative / percentage scale, rather than an additive scale
Natural to think of differences in salaries on a percentage basis (salary increases tend to be percentage based)
We'll focus on simple regression today, but everything described here extends naturally to multiple regression.
Might transform certain predictors, and leave others unchanged
Properties
\begin{align*} \log(a\times b) &= \log(a) + \log(b)\\ \log(a^b) &= b\log(a)\\ \log(a_1\times a_2\times...\times a_n) &= \sum_{i=1}^n \log(a_i) = \log(a_1) + .... + \log(a_n)\\ \log(a + b) &\neq \log(a) + \log(b) \end{align*}It would be nice if we could relate the mean and standard deviation of a variable on the $\log$ scale to its mean on the original, untransformed scale.
No such relationship exists!
An Inconvenient Truth
Let $y = \{y_1,...,y_n\}$ be positive. Unfortunately, in terms of arithmetic relations,
\begin{align*} \text{mean}(\log(y)) &\neq \log(\text{mean}(y))\\ \text{sd}(\log(y)) &\neq \log(\text{sd}(y)) \end{align*}In English, the mean of the logarithms of $y$ is not equal to the logarithm of the mean of $y$. Same for $sd$.
If $Z$ is a positive random variable, then $E(\log(Z)) \neq \log(E(Z))$
Suppose that the difference in salaries between two individuals is 10%:
\begin{align*} x_1/x_2 &= 1.1 \end{align*}Applying logarithms to both sides,
\begin{align*} \log(x_1/x_2) = \log(x_1)-\log(x_2) &= \log(1.1) \end{align*}Differences that were multiplicative before become additive after taking a log transform.
Let $\ell = \log(x)$
$\ell_1 - \ell_2 = \log(1.1)$.
In linear regression, slope coefficients relate additive differences in predictor values to additive differences in responses.
Data set: wine_space.csv. For a given bottle of wine, shows how much display space (in feet)
the bottle took up, and the corresponding volume of sales.
Is the relationship linear? Look at the behavior for smaller and larger spaces...
The relationship is positive, but it seems like there may be concave curvature
It makes sense intuitively that there are diminishing returns to display space for promoting sales
Clearly: the more space, the more sales
Comparing one foot to two feet: massive jump
Comparing five to six feet: not so much
Perhaps it's best to think of increases in display space on a percentage scale...
The log transformation provides comparison on a percentage scale automatically: multiplicative differences on the original scale become additive on the log scale
Diminishing returns to scale when comparing 1 vs 2 feet to 5 vs 6 feet of display space.
On a log scale, the differences between these two values aren't the same! $\log(2)$ versus $\log(6/5)$
On the $\log$ scale: distances between points on the $x$ axis represent factor changes
A note going forward: the above is true of any base logarithm. Yet for reasons to be discussed, we will only use base-$e$ logarithms when transforming variables for regression analysis
If percentage differences in $x$ relate to additive differences in $y$, we could consider the following generative model for logarithmic growth:
\begin{align*} y_i &= \beta_0 + \beta_1 \log_e(x_i) + \varepsilon_i, \end{align*}If we define the transformed variable $\ell_i = \log_e(x_i)$, we have a linear model on the transformed scale, and we could proceed with a linear regression of $y$ on $\ell$.
Let's now show a scatterplot of Sales against $\log_e$-display space
Looks linear!
The log transformation spread out the sales corresponding to smaller display sizes
Clumped the sales corresponding to larger display sizes closer together
Now, we compare our fits with $x$ on the original scale
The linear approximation wasn't terrible, but the trend is much better approximated by the log transform
Suppose we perform a regression with $\log_e(x)$ as the predictor variable and $y$ as the response. How do we interpret the slope coefficient on the variable $\log_e(x)$?
When the relationship is linear, we said that the slope coefficient, $\hat{\beta}_1$ is the estimated difference in $y$ for two individuals who are different in their predictor variables by one unit
Usual interpretation: comparing $y$ values for two individuals, $i$ and $j$, for whom $\log_e(x_i) - \log_e(x_j) = 1$.
This doesn't provide much intuition in terms of the predictor variable itself.
Would be nice to relate a comparison of $x$ itself (without transformation) to a comparison of predicted values for $y$
Our fitted equation is
\begin{align*} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1\log_e(x_i) \end{align*}Let's relate the slope coefficient to a multiplicative difference in $x$
Suppose we compared two individuals who differ in their $x$ values by 1 percent. What we we predict the difference in their predicted $y$ values to be?
If the two units differed by $\Delta$%, the expected difference in $y$ would be a factor of $\hat{\beta}_1\log_e(1+\Delta/100)$.
Note that for $|a|$ small (say, $|a| \leq 0.1$), $a \approx \log_e(1+a)$. Why? Taylor Expansion of $\log_e(\cdot)$! This holds for base $e$ only.
Suppose we compared two individuals who differ in their $x$ values by 1 percent. What we we predict the difference in their $y$ values to be?
For a regression of the form $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\log_e(x)$, we can roughly interpret the slope as follows:
Approximate Interpretation under Logarithmic Growth
Two observations who differ in $x$ by $\Delta$% are predicted to differ in $y$ by roughly $(\hat{\beta}_1/100)\Delta$ (provided $|\Delta| < 10\%$)
Call:
lm(formula = sales ~ ldisplay)
Coefficients:
(Intercept) ldisplay
107.5 124.1
Interpretation of the slope: two bottles of wine whose display footage differ by 1% are predicted to differ in number of sales by roughly 1.241. (Exact number: $124.1\times \log(1.01)= 1.235)$
In a system undergoing exponential growth as a function of $x$, one may observe that additive differences in $x$ correspond to percentage (multiplicative) differences in $y$. This is consistent with the model
\begin{align*} y_i &= \exp\{\beta_0 + \beta_1 x_i + \varepsilon_i\} \end{align*}If we take the log transformation of the response, we'd return to a linear model
\begin{align*} \log_e(y_i) &= \beta_0 + \beta_1x_i + \varepsilon_i, \end{align*}from whence we can run a linear regression of $\log(y_i)$ on $x_i$ to estimate $\beta$.
Linear regression produces the following prediction equation:
\begin{align*} \log_e(\hat{y}) &= \hat{\beta}_0 + \hat{\beta}_1 x_i \end{align*}Under exponential growth, consider two units who differ in $x$ by one unit:
That is, two units who differ in $x$ by one unit differ in their predicted values for $y$ by a factor of $\exp\{\hat{\beta}_1\}$. If they differ by $\Delta$ units, their predictions would differ by a factor of $\exp\{\Delta \hat{\beta}_1\}$.
Now, note that by Taylor approximation we also have that $\exp\{a\} \approx 1 + a$, for $|a|$ small (say, $|a| \leq 0.1)$. Consider two units who differ in $x$ by $\Delta$ units under exponential growth:
\begin{align*} \log_e(\hat{y}_1) - \log_e(\hat{y}_2) &= \hat{\beta}_1\Delta\\ \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\Delta\\ \frac{\hat{y}_1}{\hat{y}_2} &= \exp\{\hat{\beta}_1\Delta\}\\ &\approx (1+\hat{\beta}_1\Delta), \end{align*}So the two differ by $(100\hat{\beta}_1)\Delta$%.
Approximate Interpretation under Exponential Growth
Two observations who differ in $x$ by $\Delta$ units are predicted to differ in $y$ by roughly $(100\hat{\beta}_1)\Delta$% (provided $|\hat{\beta}_1\Delta| < 0.1$)
Example: suppose we were looking at increases in web servers by year during the late 1990s:
\begin{align*} \log_e(\widehat{webserver}) = 13.4 + 0.9\times Year \end{align*}For two dates differing by a month (1/12 Year), we'd the number of web-servers to differ by a factor of $\exp\{0.9/12\} = 1.078$.
Using the Taylor approximation, we'd estimate a difference of $100\times 0.9\times (1/12) = 7.5\%$
In other applications, relationships may be best thought of as percentage (multiplicative) changes in $x$ being predictive of percentage (multiplicative) changes in $y$. Elasticity in economics. In this case, we'd anticipate the following functional form:
\begin{align*} y_i &= x_i^{\beta_1}\exp\{\beta_0 + \varepsilon_i\} \end{align*}Taking logs of both sides,
\begin{align*} \log_e(y_i) &= \beta_0 + \beta_1\log_e(x_i) + \varepsilon_i, \end{align*}so that after log transformation of $x$ and $y$ we have a linear model. We can now run a linear regression of $\log_e(y)$ on $\log_e(x)$.
Our fitted equation is
\begin{align*} \log_e(\hat{y}) &= \hat{\beta}_0 + \hat{\beta}_1\log_e(x) \end{align*}Suppose two units differ in $x$ by 1%
\begin{align*} \log_e(\hat{y}_1) - \log_e(\hat{y}_2) &= \hat{\beta}_1(\log_e(1.01x) - \log_e(x))\\ \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\log_e(1.01)\\ \frac{\hat{y}_1}{\hat{y}_2} &= 1.01^{\hat{\beta}_1} \end{align*}Two units that differ in $x$ by 1% are predicted to differ in $y$ by a factor of $1.01^{\hat{\beta}_1}$.
If they differ by $\Delta$% in $x$, we'd expect them to differ in $y$ by a factor of $(1 + \Delta/100)^{\hat{\beta}_1}$
Again using Taylor approximation $\log_e(1+x)\approx x$, note that for $(\hat{y}_1-\hat{y}_2)/\hat{y}_2$ and $\Delta/100$ small, we have that if the units differ in $x$ by $\Delta$ percent:
\begin{align*} \log_e(\hat{y}_1/\hat{y}_2) &= \hat{\beta}_1\log_e(1 + \Delta/100)\\ \log_e\left(1+ \frac{\hat{y}_1-\hat{y_2}}{\hat{y}_2}\right) &= \hat{\beta}_1\log_e(1 + \Delta/100)\\ \frac{\hat{y}_1-\hat{y_2}}{\hat{y}_2} &\approx (\hat{\beta}_1\Delta)/100 \end{align*}Approximate Interpretation as Percentage
Two observations who differ in $x$ by $\Delta$% are predicted to differ in $y$ by $\hat{\beta}_1 \Delta$% (provided $|\hat{\beta}_1\cdot \Delta| < 10\%$ and $|\Delta| < 10\%$).
Example: volume of demand for Fedex Courier Service as a function of its Price:
\begin{align*} \log_e(\widehat{Volume}) = 8.0 + -0.33\times \log_e(Price) \end{align*}When comparing prices for the courier service where one is 1% higher, we'd expect a difference by a factor of $1.01^{-0.33}$, or 0.997
Using our approximate interpretation from our Taylor Expansion, we'd expect a difference by -0.33%
Suppose that we've run a linear regression using $\log_e(y)$ as a response variable, and that we'd like to form a 95% prediction interval for $y$ at a given value for the predictors $\tilde{x}$.
We could certainly construct prediction intervals for $\log_e(y)$ if the stronger linear model held:
\begin{align*} \log_e(y) &= X\beta + \varepsilon;\;\; \varepsilon \sim MVN(0, \sigma^2_\varepsilon) \end{align*}Treating $\log_e(y)$ as the response, we'd assess:
Linearity (pattern in the residuals?)
Homoskedasticity (does variability in standardized residuals change with $x$?)
Normality (are the standardized residuals normally distributed?)
[Note that these residuals are of the form $e_i = \log_e(y_i) - x_i^\top\hat{\beta}$. They are based on the log-responses.]
A $100(1-\alpha)\%$ prediction interval for $\log_e(y)$ under the stronger linear model takes the form:
\begin{align*} \tilde{x}^\top\hat{\beta} \pm t_{1-\alpha/2, n-p-1}\hat{\sigma}_\varepsilon \sqrt{1+\tilde{x}^\top(X^\top X)^{-1}\tilde{x}} \end{align*}As we'll now illustrate, we can use this to directy get a prediction interval for $y$ by exponentiating the endpoints. That is, letting $lb$ and $ub$ be the lower and upper bounds for the $100(1-\alpha)\%$ PI for $\log_e(y)$, a $100(1-\alpha)\%$ PI for $y$ is
\begin{align*} [\exp\{lb\}, \exp\{ub\}] \end{align*}Although $E(\log(Z)) \neq \log(E(Z))$ for a positive random variable $Z$, we do have a nice correspondence between quantiles.
Quantiles and Logrithms
Let $Z$ be any (positive!) variable, and let $q_{p}(\cdot)$ be the $p$th quantile/percentile of a distribution. Then,
\begin{align*} q_p(\log(Z)) &= \log(q_p(Z)) \end{align*}Quantiles and Logrithms
For example:
\begin{align*} q_{0.25}(\log(Z)) &= \log(q_{0.25}(Z))\\ \text{Median}(\log(Z)) &= \log(\text{Median}(Z)) \end{align*}The logarithm is a monotone increasing transformation. This means it preserves order!
The $x$ coordinates of the points shown have the sorting red < purple < blue. The corresponding $y$ coordinates have the same ordering!
Because the log is a monotone transformation, we know that for $y>0$
\begin{align*} lb\leq \log(y)\leq ub \Leftrightarrow \exp\{lb\} \leq y \leq \exp\{ub\} \end{align*}This doesn't hold for arbitrary functions $f(\cdot)$ (for example, consider $x^2$ with $-3 \leq x \leq 2$. Certainly isn't true that $9\leq x^2\leq 4$).
To justify prediction intervals, note that if $1-\alpha = P(lb \leq \log_e(y)\leq ub)$, then it has to be the case that $1-\alpha = P(\exp\{lb\} \leq y \leq \exp\{ub\})$ by the equivalence shown above.