Chapter 5 Descriptive Modeling: Basic Regressions
The previous chapter presented several descriptive plots of raw data. In the next two chapters I would like to show some simple modeling methods of the same data. By no means is any of the modeling presented here considered breakthrough modeling. I call this descriptive modeling because all I do is describe the patterns using statistical methods. In a later chapter (Chapter 6) I’ll give examples of explanatory models that attempt to capture mechanism and underlying process.
The difference between descriptive models and explanatory models is that the former are summaries of the data and, whie they can be used to forecast, but they do not provide understanding of the underling mechanisms. Descriptive models do not give insight into what will happen if a change is imposed on the system. Descriptive models don’t tell us the effect social distancing may have, for example, on the future rate of covid-19. Descriptive models are fine for describing what has happened in the past but can be limited in helping us understand how the future will play out under various scenarios including interventions. I sometimes refer to descriptive modeling as “models of data.” Explanatory, or process, models focus on modeling underlying mechanism. Explanatory models also use data to estimate parameters and make predictions, but the focus of process models is the underlying mechanism.
I’ll try to use the minimum R features necessary to make the points I want to highlight. There are more complicated approaches using time series packages that format the data differently (e.g., they use special time series objects in R). There are benefits of doing this because there are automatic plotting and estimation routines that can take time series objects and do relevant computations. But I believe that as you are learning this material it makes more sense to learn how to do some of the programming. This will put you in a better position to learn more advanced approaches like those for time series data, which use even more complex programming capabilities. It is best to learn how to program with simpler problems and then gradually add complexity rather than jump immediately to complex cases. So I’ll start with standard R objects and slowly introduce some elements unique to time series data.
For more general treatments on working with time series data see Hyndman & Athanasopoulos; Holmes, Scheuerell & Ward and the classic Box & Jenkins, with additional authors in newer editions of the book.
5.1 Regression on US State Percentages: Linear Mixed Models on Logs
In the previous chapter Descriptive Plots I showed that on the log scale the curves do not follow a straight line by state for the count/population variable, at least through the entire time series and likely not even in the early days of the pandemic. There is a belief that pandemics tend to follow an exponential process early on but not necessarily deeper into the pandemic. It isn’t clear that belief is consistent with the Covid-19 data.
I’ll illustrate one type of exponential growth with this simple example. Image each infected person infects 2 people and transmission occurs in the first day. The first person who contracts covid in the US infects 2 people for a total of 3 infected people. The next day those two newly infected individuals each transmit to 2 people, so on that day there are 4 newly infected people plus the 3 infected people from the previous day. This process continues each day, so that each newly infected person transmits to 2 new individuals. What is surprising is that if this process continues unchecked, it will take merely 28 days for the entire US population, approximately 330 million, to be infected with covid. Clearly, this simple exponential process did not play out because 10 months into the pandemic, the US is at 14 million known positive cases. I think 28 days would qualify as falling into the “early days of the pandemic,” so other processes, constraints and factors must be contributing to the spread of the disease. As we will see later in this chapter, exponential growth characterizes a family of growth rates, not just the specific one used in this example.
Still, for the time being let’s continue applying the belief that pandemics follow exponential growth. An exponential growth curve can be transformed into a straight line (i.e., linear growth) by taking logs of the data (this will be described later in this chapter). It is much easier to fit linear functions than nonlinear functions such as exponential curves, so many researchers follow the heuristic of applying a log transformation when dealing with these types of nonlinear curves.
Our ongoing example will focus on state-level effects so they are a natural candidate for a random effects linear regression. That is, each state will have its own intercept and slope on the log scale. The value of the state’s intercept can be interpreted as the percentage of the population infected with covid-19 on day 0; it can be tricky to interpret the intercept as I have formulated it so far, but I’ll present a different interpretation as we develop more machinery. The slope can be interpreted as the growth rate of the state.
We could use standard regression model (ordinary least squares) to create a complicated interaction between day and state, so that we have a state factor with 50 levels (i.e., 49 effect or contrast codes), giving us 49 intercepts (main effect of the state factor) and 49 slopes (interaction of the state factor with time) along with a main effect of time, but that is overkill. The random effects approach is a more efficient implementation. Intuitively, each state is modeled as a random draw from a multivariate normal distribution such that we draw 50 pairs of slopes and intercepts from that multivariate distribution. Once we have those 50 slope and intercept pairs, then we compute the predicted scores for each state. The algorithm finds the mean and covariance matrix of that multivariate distribution to minimize the discrepancy (i.e., squared residuals) between the observed and the predicted data. The multivariate normal distribution also allows a covariance between the intercept and slope to account for possible associations such as ``states that start early have more rapid growth rates.’’ As shown in the previous chapter, the log of the percentage of the state population follows an approximate line, though there was some evidence of curvature in the descriptive plots suggesting a deviation from a straight line, implying a deviation from a simple exponential growth curve.
In the following analyses, the variable day is entered into the model as a linear time variable (1, 2, 3, etc). In subsequent analyses we will consider more complicated treatments of the day variable to allow for additional nonlinearity over time.
I decided to drop rows from the data set that have counts of 0, in part because the log of 0 is undefined. Other strategies such as re-centering the time variable such that for each state day 1 corresponds to the first day the state showed a case as opposed to calendar day. In this strategy, day 1 would be assigned to March 18 for a state that showed its first case on that calendar day, whereas day 1 would be assigned to March 23 for the state that showed its first case on March 23. A drawback of this strategy is that it makes it a little trickier to model the date at which each state begins to show cases. I call this the ``liftoff’’ of the curve because it is when cases move away from 0. One may want to include predictors of liftoff, or attempt to explain liftoff differences across states or units, and it is easier to model these effects if the day variable follows calendar time. There are more complex modeling approaches, such as a structural equations modeling approach, that allow for more flexibility.
# drop all 0 counts from the allstates.long data base.
allstates.long <- allstates.long %>% mutate(count = na_if(count,
"0"))
# save percent variable and log10 version
allstates.long$percent <- allstates.long$count/allstates.long$population *
100
allstates.long$percent.log10 <- log10(allstates.long$percent)
5.1.1 Examine distributional assumptions
It is good practice to check the distribution of the log percent variable to make sure there aren’t major issues with symmetry. The log of the percent of the state population that tests positive will be the dependent variable in the regressions we will conduct.
# histogram of percent.log10; not a bad looking
# distribution
hist(allstates.long$percent.log10, xlab = "log percent",
main = "Histogram")
There is evidence of skewness in this variable, which raises concern about the distributional assumptions when conducting subsequent regressions. We will address this issue in later analyses but for now let’s plug along with the simpleminded idea that these data are a exponential function of day so a log transformation will be used.
5.1.2 Linear mixed model
Here are the results of this simple linear mixed model on the log scale.
# run linear mixed model default optimizer didn't
# converge so switched to Nelder_Mead
out.lmer <- lmer(percent.log10 ~ 1 + day.numeric + (1 +
day.numeric | state.abb), allstates.long, control = lmerControl(optimizer = "Nelder_Mead"))
if (is_html_output()) tab_model(out.lmer, show.se = T, show.stat = T)
percent log 10 | |||||
---|---|---|---|---|---|
Predictors | Estimates | std. Error | CI | Statistic | p |
(Intercept) | -1.26 | 0.05 | -1.35 – -1.17 | -27.92 | <0.001 |
day.numeric | 0.01 | 0.00 | 0.01 – 0.01 | 45.16 | <0.001 |
Random Effects | |||||
σ^{2} | 0.20 | ||||
τ_{00} _{state.abb} | 0.10 | ||||
τ_{11} _{state.abb.day.numeric} | 0.00 | ||||
ρ_{01} _{state.abb} | -0.82 | ||||
ICC | 0.20 | ||||
N _{state.abb} | 50 | ||||
Observations | 20648 | ||||
Marginal R^{2} / Conditional R^{2} | 0.709 / 0.767 |
The estimates, se(est), CI, t and pvalue for the fixed effects are printed first, then the parameters of the random effect part are printed in the second part of the table. The random effect parameters, in order, are the variance of the residuals, the variance of the random effect intercept, the variance of the random effect slope, the correlation between the random intercept and the random slope, the adjusted ICC, number of states, number of observations (recall 0s have been removed from this data file so the N refers to the number of days by state for which there was a nonzero number of positive covid-19 cases), and the R\(^2\) values. For more documentation on the computation of these terms see the help page for function tab_model().
5.1.3 Diagnostics
Now let’s examine plots of residuals to diagnose potential issues with this model. The plot of the residuals against the fitted values should look like a random point cloud in the ideal case and the QQ plot should show a relatively straight line. For a discussion of using residuals to diagnose model assumptions, see my lecture notes.
The residuals against the fitted values suggest a nonlinear pattern remains in the data. This suggests that we may not have completely linearized the data with the log transformation. Recall the logic rule modus tollens that states “if p, then q is accepted and not-q holds, then we can infer not-p.” Applied to this situation, if a functional relation is exponential (p), then a log transformation will linearize the relation (q). We observe from the residuals that a log transform did not linearlize the relation (not-q), so we conclude that the original functional relation was not exponential (not-p). While it may be disappointing that the residuals suggest the assumptions of regression are violated, we can make a relatively strong conclusion that the original relation, whatever it is, cannot be purely an exponential.
The qqplot also suggests a deviation from a typical normal distribution; that is, the residuals appear to have longer tails than expected by a normal distribution. Both of these plots suggest there are issues that need to be addressed. We probably don’t want to jump right into doing another transformation to deal with the skewness as that may create new issues with interpretation. We selected the log transformation because we are checking for exponential growth and on the log scale the exponential growth should become linear. So maybe the exponential growth model may not be appropriate here, or there may be other issues with the data that are contaminating the exponential growth pattern. Without additional modeling we cannot adjudicate between these possibilities just from this present regression alone.
Even though the residual plots raise concern about this model, let’s continue with this example and examine the state-level slopes and intercepts more directly with a table listing the slope and intercept for each state and with some plots. The the following plot, the light gray points have confidence intervals that overlap 0 so those parameter estimates are not significantly different from zero. We did not perform a multiple test correction such as Bonferroni or Scheffe to address the 100 tests we just did (50 states on each of slope and intercept). This would just require a small modification to the code to adjust the level of the CI to correspond, say, to the Bonferroni corrected levels. There has been some extensive debate in the literature as to whether multilevel models require additional corrections for multiple tests.
## (Intercept) day.numeric
## Alabama -1.193 0.00675
## Alaska -1.732 0.00791
## Arizona -1.341 0.00734
## Arkansas -1.314 0.00712
## California -1.312 0.00679
## Colorado -1.154 0.00602
## Connecticut -0.818 0.00505
## Delaware -0.927 0.00561
## Florida -1.221 0.00675
## Georgia -1.102 0.00635
## Hawaii -1.847 0.00667
## Idaho -1.378 0.00721
## Illinois -0.978 0.00596
## Indiana -1.248 0.00675
## Iowa -1.207 0.00693
## Kansas -1.444 0.00746
## Kentucky -1.560 0.00754
## Louisiana -0.661 0.00496
## Maine -1.557 0.00591
## Maryland -1.013 0.00560
## Massachusetts -0.616 0.00447
## Michigan -1.000 0.00547
## Minnesota -1.445 0.00734
## Mississippi -1.044 0.00627
## Missouri -1.479 0.00746
## Montana -1.851 0.00845
## Nebraska -1.234 0.00694
## Nevada -1.229 0.00675
## New Hampshire -1.406 0.00616
## New Jersey -0.572 0.00453
## New Mexico -1.322 0.00680
## New York -0.404 0.00384
## North Carolina -1.404 0.00705
## North Dakota -1.382 0.00769
## Ohio -1.438 0.00703
## Oklahoma -1.598 0.00797
## Oregon -1.633 0.00663
## Pennsylvania -1.150 0.00596
## Rhode Island -0.828 0.00567
## South Carolina -1.333 0.00711
## South Dakota -1.189 0.00700
## Tennessee -1.257 0.00706
## Texas -1.439 0.00740
## Utah -1.331 0.00726
## Vermont -1.504 0.00530
## Virginia -1.282 0.00639
## Washington -1.076 0.00513
## West Virginia -1.701 0.00749
## Wisconsin -1.371 0.00731
## Wyoming -1.577 0.00756
# include 95% CI, could do instead +/- 1 standard error,
# which is roughly .68 CIs light grey points are not
# statistically different from 0
raeff <- REsim(out.lmer)
plotREsim(raeff, labs = T, stat = "mean", level = 0.95)
It turns out that computing standard errors for the prediction is not straightforward for linear mixed models and quite complicated for nonlinear mixed models that we will perform later. You can read about the complexity here.
Here is one method for computing the predicted value and a prediction interval that relies on sampling. To maintain clarity I’ll focus just on two states, but one could do these plots for all 50 states. I’ll print the data for the first 30 days in the state of New York. The table includes the day, the fit from the regression equation, the confidence interval around the fit, the fit placed back on the original scale (i.e., \(10^{\mbox{fit}}\)) and the original data on the percentage scale.
# First New York
nydf <- allstates.long %>% filter(state.abb == "New York")
pred <- data.frame(day.numeric = nydf$day.numeric, predictInterval(out.lmer,
nydf, which = "full", level = 0.95), fit.per = 10^predictInterval(out.lmer,
nydf)$fit, percap100 = nydf$percap100)
head(round(pred, 3), 30)
## day.numeric fit upr lwr fit.per percap100
## 1 0 -0.410 0.454 -1.30 0.376 0.001
## 2 1 -0.391 0.493 -1.30 0.408 0.001
## 3 2 -0.441 0.518 -1.31 0.376 0.001
## 4 3 -0.397 0.500 -1.33 0.395 0.002
## 5 4 -0.335 0.482 -1.23 0.409 0.002
## 6 5 -0.394 0.518 -1.25 0.435 0.003
## 7 6 -0.398 0.551 -1.31 0.430 0.003
## 8 7 -0.335 0.489 -1.29 0.430 0.005
## 9 8 -0.404 0.523 -1.31 0.418 0.007
## 10 9 -0.368 0.497 -1.28 0.422 0.013
## 11 10 -0.374 0.554 -1.25 0.445 0.022
## 12 11 -0.375 0.507 -1.31 0.417 0.039
## 13 12 -0.361 0.543 -1.25 0.435 0.054
## 14 13 -0.352 0.559 -1.27 0.456 0.078
## 15 14 -0.333 0.580 -1.15 0.457 0.107
## 16 15 -0.342 0.517 -1.23 0.451 0.132
## 17 16 -0.333 0.505 -1.22 0.443 0.159
## 18 17 -0.328 0.600 -1.19 0.429 0.192
## 19 18 -0.312 0.565 -1.21 0.476 0.230
## 20 19 -0.326 0.522 -1.16 0.429 0.269
## 21 20 -0.291 0.600 -1.22 0.479 0.307
## 22 21 -0.346 0.508 -1.19 0.516 0.343
## 23 22 -0.319 0.563 -1.17 0.494 0.390
## 24 23 -0.317 0.570 -1.12 0.479 0.465
## 25 24 -0.274 0.617 -1.18 0.521 0.518
## 26 25 -0.302 0.604 -1.17 0.498 0.573
## 27 26 -0.316 0.645 -1.18 0.490 0.626
## 28 27 -0.313 0.578 -1.22 0.472 0.682
## 29 28 -0.314 0.583 -1.17 0.515 0.733
## 30 29 -0.285 0.633 -1.14 0.541 0.771
ggplot(pred) + geom_line(aes(day.numeric, fit)) + geom_ribbon(aes(day.numeric,
ymin = lwr, ymax = upr), alpha = 0.2) + geom_point(aes(day.numeric,
y = log10(percap100))) + ggtitle("New York")
Note that the linear pattern (line and prediction band) doesn’t do a good job of capturing the actual data points. Some points are over predicted in the early days, then under predicted, then over predicted again. This would produce a pattern where the residuals are negative, then positive, then negative again as we saw in one of the earlier residual plots. The data looked approximately linear but upon close inspection a model that is actually linear in the log scale seems to systematically miss the underlying pattern.
To check that this pattern is not just a fluke for New York, let’s look at another state. For Michigan a similar story, where we see over, under and over prediction. These plots suggest that the linear model may show some systematic deviations from linearity.
# similarly for michigan
midf <- allstates.long %>% filter(state.abb == "Michigan")
pred <- data.frame(day.numeric = midf$day.numeric, predictInterval(out.lmer,
midf), fit.per = 10^predictInterval(out.lmer, midf)$fit,
percap100 = midf$percap100)
head(round(pred, 3), 30)
## day.numeric fit upr lwr fit.per percap100
## 1 0 -1.003 -0.488 -1.53 0.098 0.000
## 2 1 -1.009 -0.417 -1.56 0.099 0.000
## 3 2 -1.012 -0.428 -1.54 0.101 0.000
## 4 3 -0.957 -0.387 -1.53 0.097 0.000
## 5 4 -0.995 -0.379 -1.57 0.103 0.000
## 6 5 -0.960 -0.370 -1.56 0.110 0.000
## 7 6 -0.987 -0.370 -1.54 0.113 0.000
## 8 7 -0.934 -0.370 -1.55 0.106 0.001
## 9 8 -0.973 -0.392 -1.53 0.116 0.001
## 10 9 -0.953 -0.389 -1.54 0.116 0.002
## 11 10 -0.951 -0.379 -1.54 0.111 0.003
## 12 11 -0.951 -0.411 -1.53 0.111 0.005
## 13 12 -0.948 -0.346 -1.52 0.117 0.006
## 14 13 -0.921 -0.372 -1.50 0.114 0.011
## 15 14 -0.933 -0.366 -1.52 0.124 0.015
## 16 15 -0.950 -0.367 -1.56 0.117 0.019
## 17 16 -0.899 -0.322 -1.47 0.129 0.025
## 18 17 -0.905 -0.330 -1.50 0.121 0.030
## 19 18 -0.889 -0.269 -1.50 0.123 0.039
## 20 19 -0.884 -0.318 -1.45 0.124 0.049
## 21 20 -0.919 -0.312 -1.48 0.126 0.058
## 22 21 -0.880 -0.292 -1.49 0.131 0.068
## 23 22 -0.893 -0.305 -1.46 0.127 0.080
## 24 23 -0.886 -0.315 -1.42 0.134 0.097
## 25 24 -0.905 -0.333 -1.47 0.127 0.112
## 26 25 -0.887 -0.316 -1.42 0.134 0.132
## 27 26 -0.864 -0.285 -1.43 0.142 0.148
## 28 27 -0.861 -0.286 -1.46 0.141 0.163
## 29 28 -0.884 -0.308 -1.40 0.147 0.178
## 30 29 -0.861 -0.267 -1.41 0.143 0.197
ggplot(pred) + geom_line(aes(day.numeric, fit)) + geom_ribbon(aes(day.numeric,
ymin = lwr, ymax = upr), alpha = 0.2) + geom_point(aes(day.numeric,
y = log10(percap100))) + ggtitle("Michigan")
These plots suggest a major issue with this linear mixed model regression. The observation that the trajectories do not follow a straight line in the log scale suggests these data do not follow an exponential growth model. It is more complicated than that. We see similar violations of straight lines in the death data (e.g., see the New York Times graphs on the number of deaths for a similar curvature away from a straight line). The violation of the exponential could be due to many factors, such as interventions of social distancing, diseases tend to follow an exponential curve early but then switch to a different process (see below for logistic growth curves), additional sources of noise contaminating the data such as reporting bias, etc.
Now that we understand the slopes and intercepts, let’s make some predictions. I’ll go 5 days out. The first plot will be on the log scale, but be mindful that these predictions are based on an exponential growth model that the residual analysis suggests may not hold for these data.
max.day.numeric <- max(allstates.long$day.numeric)
max.day <- max(allstates.long$day)
# predict day.ahead number of days
days.ahead <- 5
# be sure order of states is the same across all parts
# of this command
predmat <- data.frame(day.numeric = rep(c(0:(max.day.numeric +
days.ahead)), 50), state.abb = rep(rownames(coef(out.lmer)$state.abb),
each = max.day.numeric + days.ahead + 1), prettyval = rep(allstates.long$prettyval[order(match(allstates$`Province/State`,
state.population$NAME))], each = max.day.numeric + days.ahead +
1), population = rep(allstates.long$population[order(match(allstates$`Province/State`,
state.population$NAME))], each = max.day.numeric + days.ahead +
1))
# compute predictions
predmat$prediction <- predict(out.lmer, predmat)
# linear plot of predictions as estimated
predmat %>% mutate(label = if_else(day.numeric == max(day.numeric),
as.character(prettyval), NA_character_)) %>% ggplot(aes(day.numeric,
prediction, group = prettyval, color = prettyval)) +
geom_line() + geom_label_repel(aes(label = label), nudge_x = 2,
na.rm = T, segment.color = "grey50", label.size = 0.01,
size = 2.5, show.legend = F) + coord_cartesian(clip = "off") +
scale_x_continuous(limits = c(0, max.day.numeric + days.ahead +
3)) + theme(legend.position = "none", plot.margin = margin(0.1,
1, 0.1, 0.1, "cm")) + geom_vline(xintercept = max.day.numeric +
0.5, color = "red") + annotate(geom = "text", x = max.day.numeric -
30, y = -5, label = "Days with observations", color = "red") +
annotate(geom = "text", x = max.day.numeric + 8, y = -5,
label = "Forecast", color = "red")
Using the same data I take the exponential of those predictions to see the percentages with their predicted curves. The Y axis is on the percentage scale (count/population * 100).
# plot of same predictions as 10^prediction to put back
# on original percentage scale
predmat %>% mutate(label = if_else(day.numeric == max(day.numeric),
as.character(prettyval), NA_character_)) %>% ggplot(aes(day.numeric,
10^prediction, group = prettyval, color = prettyval)) +
geom_line() + geom_label_repel(aes(label = label), nudge_x = 2,
na.rm = T, segment.color = "grey50", label.size = 0.01,
size = 2.5, show.legend = F) + coord_cartesian(clip = "off") +
scale_x_continuous(limits = c(0, max.day.numeric + days.ahead +
3)) + theme(legend.position = "none", plot.margin = margin(0.1,
1, 0.1, 0.1, "cm")) + geom_vline(xintercept = max.day.numeric +
0.5, color = "red") + annotate(geom = "text", x = max.day.numeric -
30, y = 16, label = "Days with observations", color = "red") +
annotate(geom = "text", x = max.day.numeric + 8, y = 16,
label = "Forecast", color = "red")
Other approaches I could have taken include using the state’s population as a weight and computed weighted linear mixed models on the log of the count data, or, because I’m working in log scale, I could use the state population as an offset in the linear mixed model (as in offset(log(population))) and use log counts as the dependent variable. These are alternative ways of taking population size into account. Further checks include taking into the account possible correlated residuals. For example, it isn’t reasonable to assume that the residual at day t is independent from the residual at day t+1 for a given state, which is what my model assumed in what I presented above. I’ll show how to include a correlation pattern on the residuals later in this chapter. There is clearly a lot more one would need to do in order to justify the use of the analytic strategy in this section.
Of course, once the basic structure of this linear mixed model on the log scale is set (e.g., error structure on residuals is determined, how best to use the state’s population in the model, etc), then one could bring in additional predictors. We could, for example, take some of the data sets we downloaded that have the dates of when states took different measures and see if any of those state mandates changed, say, the slopes for the states. At this point, once the complex error structure, linearity, and random effect structure are determined, then the rest is just routine regression that follows the analogous criteria in regression. For example, one could include interactions between predictors or controls using other predictors.
5.1.4 Comparison with polynomial regression
In the previous model I showed that the linear mixed model regression (allowing for each state to have their own slope and intercept) on the log of the state-level percentages can be used to model these data. Here I want to go on a little tangent and illustrate how well a polynomial can mimic the exponential pattern we see in the data.
Let’s take the case that in a particular locale the rate of covid-19 follows this general curve. In this subsection I made up the data so I know the curve follows exponential growth with an additive error term (the plus \(\epsilon\)) because I added random normally distributed noise with mean 0 and standard deviation .25.
I’ll run several regressions in order and produce one plot at the end. The points on the graph are the observed data points, the colored curves emerge from different regression models. The first regression is a linear regression with just x (days 1-25) as a linear predictor. I added a violet straight line to indicate the best fitting straight line—not a good fit. The second regression adds a quadratic term and will appear as a red curve. This is an improvement, but it is nonmonotonic and shows a decrease in the first few days (which is not a pattern displayed in the raw data). The third regression adds a cubic term and fits the points relatively well.
Next, I estimate the exponential directly using nonlinear regression. Instead of the parameter being a multiplier of the predictor (like in linear regression), the parameter serves as the growth factor. The green curve follows the data closely. Here only one parameter was estimated and it was .1502, very close to the actual .15 I used to estimate the data. The polynomial did quite well but needed 4 parameters (intercept and three slopes); those parameters are not easy to interpret. The single parameter of the exponential yields a very good fit with only one interpretable parameter. This makes sense because these hypothetical data were generated from an exponential model. Note that here I am using the nls() command, which estimates a nonlinear regression.
Finally, for comparison, I run a linear regression through these data after transforming the Y variable into the log, then re-expressing the predicted scores back to the original scale like I did in the predictions using the linear mixed model with the lmer() command above. I’ll draw that curve in black—very similar result to the nonlinear regression with the exponential and the third order polynomial.
# set random seed to results appear the same each run
set.seed(101202019)
x <- 1:25
# here is the true model that generated the data
y <- 1 * exp(0.15 * x) + rnorm(length(x), 0, 1.5)
plot(x, y)
# linear regression with one predictor, draw violet
# curve
out.lm <- lm(y ~ x)
summary(out.lm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.150 -3.676 -0.611 1.977 16.006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.13 2.23 -3.19 0.0041 **
## x 1.47 0.15 9.76 1.2e-09 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.42 on 23 degrees of freedom
## Multiple R-squared: 0.806, Adjusted R-squared: 0.797
## F-statistic: 95.3 on 1 and 23 DF, p-value: 1.2e-09
lines(x, predict(out.lm), col = "violet")
# add quadratic term, draw red curve
out.q <- lm(y ~ x + I(x^2))
summary(out.q)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.269 -2.061 0.498 1.971 7.055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2569 1.7863 2.38 0.0262 *
## x -1.0626 0.3166 -3.36 0.0029 **
## I(x^2) 0.0973 0.0118 8.23 3.7e-08 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.74 on 22 degrees of freedom
## Multiple R-squared: 0.952, Adjusted R-squared: 0.948
## F-statistic: 220 on 2 and 22 DF, p-value: 2.86e-15
lines(x, predict(out.q), col = "red")
# add cubic term, draw blue curve
out.c <- lm(y ~ x + I(x^2) + I(x^3))
summary(out.c)
##
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.326 -1.077 -0.044 0.807 4.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.89469 1.85025 -1.02 0.31748
## x 1.52620 0.60425 2.53 0.01965 *
## I(x^2) -0.14682 0.05345 -2.75 0.01208 *
## I(x^3) 0.00626 0.00135 4.63 0.00015 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.98 on 21 degrees of freedom
## Multiple R-squared: 0.976, Adjusted R-squared: 0.973
## F-statistic: 290 on 3 and 21 DF, p-value: <2e-16
lines(x, predict(out.c), col = "blue")
# nonlinear exponential directly estimated, draw green
# curve
out.exp <- nls(y ~ exp(beta * x), start = list(beta = 0.5))
summary(out.exp)
##
## Formula: y ~ exp(beta * x)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## beta 0.150017 0.000941 159 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.76 on 24 degrees of freedom
##
## Number of iterations to convergence: 12
## Achieved convergence tolerance: 5.2e-06
lines(x, predict(out.exp), col = "green")
# log transform followed by linear regression
out.ln <- lm(log(y) ~ x)
summary(out.ln)
##
## Call:
## lm(formula = log(y) ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1274 -0.1167 -0.0457 0.2062 0.6039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.24687 0.14114 -1.75 0.094 .
## x 0.16446 0.00949 17.32 1.1e-14 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.342 on 23 degrees of freedom
## Multiple R-squared: 0.929, Adjusted R-squared: 0.926
## F-statistic: 300 on 1 and 23 DF, p-value: 1.08e-14
Polynomials can do a reasonable job at capturing functions but they are not always easy to interpret without some additional work (e.g., Cudek & du Toit, 2002). The third order polynomial got close but good luck in interpreting the four parameters (intercept and three \(\beta\)s). The quadratic though while it was close predicts that the counts in the early days will drop (the red curves is decreasing from day 1 to day 4) so that’s not good because the data points don’t see to be showing that pattern.
The nonlinear model (nls) estimated the parameters quite closely (the true multiplier was .15 and that is what we estimated; the standard error of the residuals was .25 and it was estimated as .29). Another detail to highlight is that the linear model on the log data (black curve) was almost indistinghisable from the true exponential pattern as estimated by the fits of the green line. The estimates from the linear model (lm) on the log transformed data are also quite close to the true value. The intercept in the log linear model should correspond to the multiplier in the true model, which is 1, and the exponential of the intercept is .98 (exp(-.0183)). The value of the multiplier in the log linear model (.1512) is close to the true multiplier in the model I created .15.
While polynomials may do a good job of fitting data in hand, they may not be the useful for prediction and forecast—basically, extrapolating out of the range of the data. Here I take the same plot just examined but extend the prediction out 25 days beyond the data. The vertical black line marks the region of data (left) and prediction (right). The green curve is the estimated exponential growth fit to the data and extended 25 days out. The blue curve is the cubic form that fits the data relatively well (left side) but in extrapolation out 25 days from the data may not do a good job of predicting the actual curvature of the exponential function. The gray bands refer to the 95% prediction bands. Note that the bands for the exponential curve appear to be too narrow, and this is an outgrowth that the theory for such intervals in the nonlinear case is not clear cut. A simulation-based approach to estimate the prediction interval may be better.
x <- 1:50
# pad y with missing data from day 26 to 50
y <- c(y, rep(NA, 25))
plot(x, y, ylim = c(0, 1800))
# lines(x,exp(predict(out.ln,newdata= data.frame(x=x))),
# col='black')
title("Failure of polynomial extrapolation")
abline(v = 25.5, col = "black")
text(20, 500, "Data")
text(30, 500, "Prediction")
text(45, 1800, "True Exponential", col = "green")
text(45, 100, "Estimated Polynomial", col = "blue")
pred.exp <- predFit(out.exp, newdata = data.frame(x = x[26:50]),
se.fit = T, interval = "prediction", level = 0.95)
polygon(c(x[26:50], rev(x[26:50])), c(pred.exp$fit[, "lwr"],
rev(pred.exp$fit[, "upr"])), col = "gray", border = NA)
pred.c <- predFit(out.c, newdata = data.frame(x = x[26:50]),
se.fit = T, interval = "prediction", level = 0.95)
polygon(c(x[26:50], rev(x[26:50])), c(pred.c$fit[, "lwr"],
rev(pred.c$fit[, "upr"])), col = "gray", border = NA)
lines(x, predict(out.exp, newdata = data.frame(x = x)),
col = "green")
lines(x, predict(out.c, newdata = data.frame(x = x)), col = "blue")
Polynomials are frequently used. I found this website that uses a 4th order polynomial to model covid-19 data from Spain, Italy and US. This page also focuses on the daily incidence rate, a topic I mentioned earlier. Personally, I’m suspicious of using such high order terms to model such curves especially if they are used to forecast out from the data (i.e., extrapolation). [As of June 17, the website no longer shows polynomials.]
The polynomial idea to approximate functions can be extended in more complex ways. Here is a tutorial using the MARS package in R; this is probably the direction I’d go if I was interested in taking this type of approximation approach. But I’d rather stick to more interpretable functional forms for most of the remaining sections.
5.1.5 Thoughts on the log transformation
This type of model where we transform the dependent variable by converting to logs, creates an additive error term in the log scale; that is, log(counts) = f(days) + \(\epsilon\). But maybe that isn’t the right error structure. Maybe the error is added to the counts rather than the log of the counts; that is, the data may exhibit a counts = f(days) + \(\epsilon\) structure, which is a different error structure than the one on the log scale. The point here is that while one may think about using a transformation to linearize a model, that transformation also has an implicit set of assumptions and has potential side-effects on the modeling such as introducing particular error structure.
We now turn to modeling the curvature directly using nonlinear regression methods that not only permits complicated functioanl forms such as exponentials but also permits modeling the error as an additive effect on the counts as in the form: counts = f(days) + \(\epsilon\) model.
5.2 Exponential Nonlinear Mixed Model Regression
In the previous section I illustrated how to run a nonlinear regression using the nls() command to estimate exponential growth.
5.2.1 Exponential Background
For a good explanation of exponential growth and logistic growth (which will be described in the next subsection) see this video.
You can think of this exponential form as being similar to a balance in a savings account that builds with compound interest. At each time step, a percentage is added to the current principal thus increasing the balance in the savings account. At the subsequent time step, the principal is now higher so the interest earned on that day will also be higher than the previous time step. In the case of the covid-19 positive test counts, if each day the total count grows by some percentage, analogous to the principal in the savings account, then the growth is exponential.
allstates.long.new <- groupedData(count ~ day.numeric |
state.abb, data = allstates.long)
# this is a more general function that can estimate
# liftoff l (point at which counts begin to be nonzero)
# for for now set l=0
richfu <- function(d, l, rate, y, b1) {
ifelse(d < l, 0, y * rate^(b1 * (d - l)))
}
# example exponential growth model
plot(0:max.day.numeric, richfu(0:max.day.numeric, 0, 1.15,
0.000555, 1) * 19500000/100, type = "l", xlab = "days",
ylab = "counts", main = "Hypothetical plot of counts following an exponential growth \n(tied to population of New York)")
I’m using one form of the exponential where the “rate” parameter is the base and it is raised to the power t. I did this to maintain a connection to the savings account analogy where the principal grows at rate x (where, say, a 5% interest rate has x = 1.05). The form is
\[ \begin{aligned} \mbox{count} &= y r^t \end{aligned} \] where t is the day, r is the rate (i.e., an “interest rate” of 5% has rate = 1.05), and y is a scaling parameter. The hypothetical plot above used a rate of 1.15 and a multiplier of .000555, which is in part based on the population of NY of 19.5M. I estimated these numbers from data collected in mid March, early into the covid-19 pandemic, and extrapolated out 60 days. Back in mid-March the idea of having 350,000 cases in New York within a couple of months seemed crazy. On May 7 New York reported 327,000 cases.
There are other forms of the exponential that are equivalent, just on a different scale. For example, in the natural log scale, the form I’m using \(yr^t\) is equivalently expressed as \(y*\exp{(log(r)t)}\) and in base 10 it is equivalently expressed as \(y*10^{log10(r)t}\). These are equivalent formulations, one just needs to keep track of the scale of the rate across these different choices of the base.
5.2.2 Exponential: The Liftoff Parameter
The exponential function can vary in terms of liftoff, where the curve breaks away from 0.
Liftoff is the parameter “l” I included in the richfu function in the previous code chunk.
Here I plot two hypothetical states that have the identical set of parameters except that one state (black) has a liftoff of 0 and the other state (blue) has a liftoff of 4. It looks like the counts in the state depicted by the black curve is shooting up and by the end of the sequence almost double the count than the state represented by the blue curve.
plot(0:max.day.numeric, richfu(0:max.day.numeric, 1, 1.23,
0.000555, 1) * 19500000/100, type = "l", xlab = "days",
ylab = "counts", main = "Hypothetical plot of counts for two states\n with different liftoffs")
lines(0:max.day.numeric, richfu(0:max.day.numeric, 4, 1.23,
0.000555, 1) * 19500000/100, col = "blue")
To illustrate how time locking can show structure, let me use the previous plot, where the two states look to have different patterns. Time locking the two states to the first day when they each have nonzero cases shows that both states have identical trajectories. It is just that one state started earlier than the other state and so the patterns in the previous plot look (dramatically) different, and just by a delay of 4 days! Much like singing “Row, row, row your boat” in rounds—voices sing the same melody but start at different times creating a harmonious whole.
# kludgy code to impose time locking
plot(c(0:max.day.numeric)[-1], richfu(c(0:max.day.numeric)[-1],
1, 1.23, 0.000555, 1) * 19500000/100, type = "l", xlab = "days",
ylab = "counts", main = "Hypothetical plot of counts for two states \n with different liftoffs but time locked")
lines(c(0:max.day.numeric)[-1 * c(1:6)] - 5, richfu(c(0:max.day.numeric)[-1 *
c(1:6)] - 2, 4, 1.23, 0.000555, 1) * 19500000/100, col = "blue")
There are better ways to time lock using specific tools written for time series data, but those involve much more explanation that I want to go into just for illustrating the usefulness of time locking time series. For additional information see R packages written for time series data such as functional data analysis R packages.
A good example of controlling for different liftoff patterns is by John Burn-Murdoch of the Financial Times. These visualization are unique from the ones I have seen on the web in that their curves are not plotted on calendar time but in terms of when the country hit X cases (sometimes X is 3 or 10 depending on the graph). This normalizes the onset and allows us to compare across units that had different starting points. A common phrase for this is “time locking” to a particular event, in this case time locking to when a country hit, say, 10 cases.
There is a sense in which we didn’t have to worry about liftoff in the linear mixed model on the log scale we ran in the previous subsection because the intercept parameter essentially took care of that. We could complicate the exponential model by including a random effect intercept, but I’ll keep things simple in these notes.
5.2.3 Estimating the exponential with the nonlinear mixed model
Let’s estimate an exponential growth model using a nonlinear mixed model regression. This mimics the LMER model we ran earlier in these notes but rather than transforming with a log to convert the data into straight lines, this time we model the curvature directly with an exponential function. To simplify things I constrain the liftoff parameter to 0 and the multiplier b1 to 1. To estimate all four parameters simultaneously we need to move to a different R package saemix that uses a stochastic EM algorithm to estimate this nonlinear mixed model because nlme can’t handle very complex models like this one without doing more ground work in setting up a self-starting function that has information about the gradient of the function. But, some additional work needs to be done to check that all four parameters can be estimated. There may be a redundancy with the rate and the b1 parameters (related to the base of the exponent that I described earlier in the Exponential Growth Background subsection when moving to different bases like base 10 or \(\exp\)). Sometimes parameters cannot be identified.
For this estimation of the exponential function I’m using percent of positive cases in the state (i.e., count/population) rather than the raw count.
currentversion <- system("git describe --tags", intern = T)
save.image(file = paste0("data-archive/", currentversion,
"-dataworkspace.RData"))
print(paste0("File name associated with this run is ", currentversion,
"-dataworkspace.RData"))
## [1] "File name associated with this run is v.95-123-g5a7aba6-dataworkspace.RData"
out <- nlme(percent ~ richfu(day.numeric, l = 0, rate, y,
b1 = 1), fixed = rate + y ~ 1, random = rate + y ~ 1,
data = na.omit(allstates.long.new), start = c(rate = 1.01,
y = 0.35), verbose = T, control = nlmeControl(msMaxIter = 500,
opt = "nlminb", pnlsTol = 0.01, msVerbose = T, minScale = 0.01))
Value | Std.Error | DF | t-value | p-value | |
---|---|---|---|---|---|
rate | 1.007 | 9.558e-05 | 20597 | 10540 | 0 |
y | 0.5937 | 0.03331 | 20597 | 17.82 | 1.559e-70 |
Min | Q1 | Med | Q3 | Max |
---|---|---|---|---|
-4.284 | -0.803 | -0.3264 | 0.2745 | 4.104 |
Observations | Groups | Log-restricted-likelihood | |
---|---|---|---|
state.abb | 20648 | 50 | -29663 |
if (is_latex_output()) stargazer(out.exponential, header = F,
type = "latex", report = "vcst", single.row = T, omit.stat = "n",
omit.table.layout = "n", digits = 2, digits.extra = 2)
# intervals(out.exponential, which='fixed')
knitr::kable(intervals(out.exponential)[[1]], caption = "Confidence Intervals for Exponential Model Estimates")
lower | est. | upper | |
---|---|---|---|
rate | 1.007 | 1.007 | 1.008 |
y | 0.528 | 0.594 | 0.659 |
# idea for boostrapping to get predicted interval but be
# careful about the time aspect; maybe better to
# simulate data
# https://stackoverflow.com/questions/55552080/how-to-obtain-confidence-intervals-for-non-linear-mixed-model-logistic-growth-c
The fixed effects parameters refer to state-level parameters. You can think of these as the typical values for the states. The random effects part refers to the variability across states in those parameters. For example, while the fixed effects y parameter is estimated at 0.59, the states have a standard deviation of 0.233.
Let’s plot the fitted values of each state. These are the predicted exponential curves according to the model. Note that this plot represents the model estimates rather than the actual data.
predmatnonlin <- data.frame(day.numeric = rep(c(0:max.day.numeric),
50), state.abb = rep(rownames(coef(out.exponential)),
each = max.day.numeric + 1))
predmatnonlin$prediction <- predict(out.exponential, list(day.numeric = predmatnonlin$day.numeric,
state.abb = predmatnonlin$state.abb), level = 1)
predmatnonlin.exponential <- predmatnonlin
nonlinreg <- ggplot(predmatnonlin, aes(day.numeric, prediction,
group = state.abb, color = state.abb)) + geom_line()
nonlinreg
We can examine specific states. The best fitting exponential curve for New York is shown in the next figure (observed data points are in blue).
nydf <- predmatnonlin[predmatnonlin$state.abb == "New York",
]
nydf$percent <- allstates.long$percent[allstates.long$state.abb ==
"New York"]
nydf.exponential <- nydf
ggplot(nydf, aes(day.numeric, prediction, group = state.abb,
color = state.abb)) + geom_line() + geom_point(aes(x = day.numeric,
y = percent), col = "blue") + ggtitle("New York: Observed data in blue; exponential growth estimate in red")
Consistent to what we observed in the log transformed data and the linear mixed model, the points do not appear to follow an exponential curve, with the case of New York being one example.
We now examine the residuals to evaluate whether there is any structure the model is failing to pick up. The overall plot of the residuals suggests some major issues. For example, I colored in blue the residuals corresponding to New York and connected them with line segments to help them stand out. There is other structure in those residuals.
plot(fitted(out.exponential), resid(out.exponential), main = "New York in blue")
# suspect weird points are new york; so put NY in blue
# points and connected dots with a curve
points(fitted(out.exponential)[names(fitted(out.exponential)) ==
"New York"], resid(out.exponential)[names(resid(out.exponential)) ==
"New York"], col = "blue")
lines(fitted(out.exponential)[names(fitted(out.exponential)) ==
"New York"], resid(out.exponential)[names(resid(out.exponential)) ==
"New York"], col = "blue")
Here is a zoomed in version of the residual plot focusing in on the major “clump” of points.
[eval off]
# let's drop outliers and look at the resid plot (zoom
# in for fits < .08 and abs(resid) < .01) this helps us
# see the majority of the residual plot
plot(fitted(out.exponential)[fitted(out.exponential) < 0.08 &
abs(resid(out.exponential)) < 0.01], resid(out.exponential)[fitted(out.exponential) <
0.08 & abs(resid(out.exponential)) < 0.01], xlab = "fits < .08",
ylab = "abs(resid) < .01", main = "zoom in ")
The QQ plot raises further suspicion on this model due to the long tails.
Overall, the exponential curve does not appear to offer a solid representation of these data. This is consistent with the conclusion we reached from the linear mixed model on the log data. We will move to a different form that may be more appropriate for these data.
5.3 Logistic Nonlinear Mixed Regression
Let’s switch to a logistic growth model (which is not the same as logistic regression for binary data). The nlme package has some additional functions to simplify the fitting of logistic growth models (e.g., SSlogis), which make it easier to set up starting values. In other work I’ve seen more efficient estimation (and less fussiness around starting values) with the stochastic EM approach to fitting nonlinear regression models such as in the package saemix.
The nlme() command has a difficult time with the three parameters of the logistic growth model. I did a little bit of tinkering with starting values and the three parameters. I was able to get a reasonable fit with fitting three parameters of the logistic growth model and having two of them be random effects (i.e., they are allowed to vary by state). A more careful analysis would need to be done to make this publishable, including exploring other parameterizations of logistic growth that may behave better with nlme. I give hints in the commented out sections of the R code.
The logistic growth model starts off being close to an exponential at first but eventually tapers off. Many biological systems follow a logistic growth function. Here is a link to a relatively simple explanation of logistic growth.
This is the form of logistic growth I’m using
\[ \begin{aligned} count &= \frac{asymptote}{1 + ye^{\beta_1(day.mid - day)}} \end{aligned} \] The asymptote parameter controls the maximum value the function reaches, the day.mid parameter is both the day at which the function is half way to the asymptote and is also the inflection point and \(\beta_1\) is a scaling parameter defined to be the distance between the inflection point day.mid and point at which the count is 73% of the asymptote (basically, the difference between day.mid and the day corresponding to the count asymptote*\(\frac{1}{1+exp(-1)}\), see Pinhero & Bates). It is possible to have an extra parameter y, which is a multiplier, but be careful that the model is identified because \(y e^{x_1}\) can be reexpressed as \(e^{u + x1}\) where \(u=\log{(y)}\) creating issues in separately estimating y from additive parameters that are in the exponential. For most of the following I’ll constrain y to 1, thus setting the scale, and not estimate the y parameter.
The next three plots examine how each of the three parameters control the shape of the logistic growth curve. I’m making use of the SSlogis function to define these logistic growth curves. The SSlogis() function labels these parameters asym, xmid and scal, respectively.
# vary asymp
x <- 1:80
xmid <- 25
scal <- 10
ggplot(data.frame(day.numeric = x, y = SSlogis(x, 10000,
xmid, scal)), aes(x = x)) + stat_function(fun = SSlogis,
args = list(12000, xmid, scal), aes(colour = "12000")) +
stat_function(fun = SSlogis, args = list(10000, xmid,
scal), aes(colour = "10000")) + stat_function(fun = SSlogis,
args = list(8000, xmid, scal), aes(colour = " 8000")) +
scale_colour_manual(name = "Asymp", values = c("deeppink",
"dodgerblue3", "green"), labels = c(`12000` = "12000",
`10000` = "10000", `8000` = "8000")) + guides(color = guide_legend(reverse = T)) +
xlab("day.numeric") + ylab("Cumulative Count") + ggtitle("Vary the asymptote parameter (how high)")
# vary xmid
x <- 1:80
xmid <- c(25, 45, 65)
scal <- 10
ggplot(data.frame(day.numeric = x, y = SSlogis(x, 10000,
xmid, scal)), aes(x = x)) + stat_function(fun = SSlogis,
args = list(10000, xmid[1], scal), aes(colour = as.character(xmid[1]))) +
stat_function(fun = SSlogis, args = list(10000, xmid[2],
scal), aes(colour = as.character(xmid[2]))) + stat_function(fun = SSlogis,
args = list(10000, xmid[3], scal), aes(colour = as.character(xmid[3]))) +
scale_colour_manual(name = "xmid", values = c("deeppink",
"dodgerblue3", "green")) + guides(color = guide_legend(reverse = T)) +
xlab("day.numeric") + ylab("Cumulative Count") + ggtitle("Vary the xmid parameter (when liftoff happens)")
# vary scale
x <- 1:80
xmid <- 45
scal <- c(2, 10, 20)
ggplot(data.frame(day.numeric = x, y = SSlogis(x, 10000,
xmid, scal)), aes(x = x)) + stat_function(fun = SSlogis,
args = list(10000, xmid, scal[1]), aes(colour = " 2")) +
stat_function(fun = SSlogis, args = list(10000, xmid,
scal[2]), aes(colour = as.character(scal[2]))) +
stat_function(fun = SSlogis, args = list(10000, xmid,
scal[3]), aes(colour = as.character(scal[3]))) +
scale_colour_manual(name = "scal", values = c("deeppink",
"dodgerblue3", "green")) + guides(color = guide_legend(reverse = T)) +
xlab("day.numeric") + ylab("Cumulative Count") + ggtitle("Vary the scale parameter (once start, how steep)")
Let’s fit this logistic growth function to these data. I now use the per capita count of positive tests out of 10,000 to normalize by state population.
The following R code makes use of the self-starting function SSlogis(), which has some useful features for running nonlinear regressions. I could incorporate population by including a weight parameter (weights=varIdent(form= ~1|state.abb)), which allows each state to have its own residual variance, or change the dependent variable from raw counts to, say, per capita counts. I’ll leave that as an exercise for the reader to check on how that changes the results.
# version was defined at the beginning of the the
# index.Rmd file get current tag and use a filename
# prefix
currentversion <- system("git describe --tags", intern = T)
save.image(file = paste0("data-archive/", currentversion,
"-dataworkspace.RData"))
print(paste0("File name associated with this run is ", currentversion,
"-dataworkspace.RData"))
## [1] "File name associated with this run is v.95-123-g5a7aba6-dataworkspace.RData"
# per capita out of 10,000
allstates.long.new$percap10000 <- allstates.long.new$percap100 *
100
# starts <- getInitial(percap10000~ SSlogis(day.numeric,
# Asym=600, xmid=200, scal=50), data=allstates.long.new)
out.perc <- nlme(percap10000 ~ SSlogis(day.numeric, Asym,
xmid, scal), fixed = Asym + xmid + scal ~ 1, random = Asym +
xmid + scal ~ 1, data = na.omit(allstates.long.new),
start = c(Asym = 850, xmid = 275, scal = 70), verbose = T,
control = nlmeControl(msMaxIter = 500, opt = "nlminb",
pnlsTol = 1e-04, msVerbose = T, minScale = 1e-04,
maxIter = 400, eval.max = 400, tolerance = 1, msTol = 1e-04))
# count version out.logistic <- out use per capita
# version throughout
out.logistic <- out.perc
if (is_html_output()) pander(summary(out.logistic))
Value | Std.Error | DF | t-value | p-value | |
---|---|---|---|---|---|
Asym | 1161 | 107.8 | 20596 | 10.77 | 5.77e-27 |
xmid | 292.2 | 9.276 | 20596 | 31.5 | 1.014e-212 |
scal | 55.25 | 3.066 | 20596 | 18.02 | 5.129e-72 |
Min | Q1 | Med | Q3 | Max |
---|---|---|---|---|
-3.499 | -0.4451 | 0.1083 | 0.819 | 5.096 |
Observations | Groups | Log-restricted-likelihood | |
---|---|---|---|
state.abb | 20648 | 50 | -103273 |
# if (is_latex_output()) texreg(out.logistic,
# booktabs=T, dcolumn = T, use.packages=F,
# float.pos='h', single.row=T)
if (is_latex_output()) stargazer(out.logistic, header = F,
type = "latex", report = "vcst", single.row = T, omit.stat = "n",
omit.table.layout = "n", digits = 2, digits.extra = 2,
align = T)
# confidence intervals of the parameters
knitr::kable(intervals(out.logistic, which = "fixed")[[1]],
caption = "95% confidence intervals: Logistic growth model")
lower | est. | upper | |
---|---|---|---|
Asym | 949.9 | 1161.2 | 1372.6 |
xmid | 274.0 | 292.2 | 310.4 |
scal | 49.2 | 55.2 | 61.3 |
# Checking residuals for assumptions
plot(fitted(out.logistic), resid(out.logistic), main = "New York in blue")
# suspect weird points are new york; so put NY in blue
# points and connected dots with a curve
points(fitted(out.logistic)[names(fitted(out.logistic)) ==
"New York"], resid(out.logistic)[names(resid(out.logistic)) ==
"New York"], col = "blue")
lines(fitted(out.logistic)[names(fitted(out.logistic)) ==
"New York"], resid(out.logistic)[names(resid(out.logistic)) ==
"New York"], col = "blue")
# let's drop outliers and look at the resid plot (zoom
# in for fits < .08 and abs(resid) < .01) this helps us
# see the majority of the residual plot
plot(fitted(out.logistic)[fitted(out.logistic) < 50 & abs(resid(out.logistic)) <
7.5], resid(out.logistic)[fitted(out.logistic) < 50 &
abs(resid(out.logistic)) < 7.5], xlab = "fits < 50",
ylab = "abs(resid) < 7.5", main = "zoom in ")
## lag ACF
## 1 0 1.000
## 2 1 0.999
## 3 2 0.997
## 4 3 0.994
## 5 4 0.989
## 6 5 0.984
## 7 6 0.978
We examine the same three residual plots that we used with the exponential growth model and see a slight improvement in the zoomed in version and slightly less extreme tails. While better, these residual plots suggest there is still some structure that is not adequately modeled. I also include the autocorrelation function plot, suggesting significant autocorrelation in the residuals that is not being modeled.
Below is the plot of the predicted curves for this logistic growth model. Notice that the fits for some states begin to taper off according to the logistic growth model. One could examine these differences to see which model performs better in predictive accuracy (i.e., see if the actual data for the state tapers off or continues to grow exponentially). Such patterns are testable and the ideal way to compare models and decide which one provides a better representation of the data.
predmatnonlin <- data.frame(day.numeric = rep(c(0:max.day.numeric),
50), state.abb = rep(rownames(coef(out.logistic)), each = max.day.numeric +
1))
predmatnonlin$prediction <- predict(out.logistic, list(day.numeric = predmatnonlin$day.numeric,
state.abb = predmatnonlin$state.abb), level = 1)
predmatnonlin.logistic <- predmatnonlin
nonlinreg.logistic <- ggplot(predmatnonlin, aes(day.numeric,
prediction, group = state.abb, color = state.abb)) +
geom_line() + ggtitle("Logistic Growth Model: New York")
nonlinreg.logistic
temp <- allstates.long.new$percap10000[order(predmatnonlin$state.abb,
predmatnonlin$day.numeric)]
predmatnonlin$percap10000 <- temp
ggplot(predmatnonlin, aes(day.numeric, prediction)) + geom_line(size = 1) +
geom_point(data = allstates.long.new, aes(x = day.numeric,
percap10000, group = state.abb), color = "red",
size = 1, alpha = 0.1) + facet_wrap_paginate(~state.abb,
nrow = 5, ncol = 5, page = 1)
ggplot(predmatnonlin, aes(day.numeric, prediction)) + geom_line(size = 1) +
geom_point(data = allstates.long.new, aes(x = day.numeric,
percap10000, group = state.abb), color = "red",
size = 1, alpha = 0.1) + facet_wrap_paginate(~state.abb,
nrow = 5, ncol = 5, page = 2)
Here is the predicted curve for New York with the actual data points in blue. We know from the analyses of the residuals that the model fit is still not ideal, but this plots suggests a logistic growth pattern is a reasonable fit to the data and better than the pure exponential growth model. Where the model is not capturing the data well is the liftoff and the sharp increase once the curve started to increase. There is also a pattern in the data around day 75 that is worth examining (e.g., checking what happened around that time in New York, such as a change in counting policy or a change in diagnostic test) and the actual data pattern is continuing to increase more rapidly than the predicted logistic growth model after day 55.
nydf <- predmatnonlin[predmatnonlin$state.abb == "New York",
]
nydf$percap10000 <- allstates.long.new$percap10000[allstates.long$state.abb ==
"New York"]
nydf.logistic <- nydf
ggplot(nydf, aes(day.numeric, prediction, group = state.abb,
color = state.abb)) + geom_line() + geom_point(aes(x = day.numeric,
y = percap10000), col = "blue") + ggtitle("New York (blue=Actual Per Capita 10,000; red=logistic growth")
Another way to compare across models is to use an index such as the AIC, which is a measure based on information theory and akin to R^2 adjusted that can be used to compare regression models that vary in the number of parameters. One needs to be careful when using AIC across models where the dependent variable is on different scales as in these notes. For example, my log transform of percentage cases (count/population) in the linear mixed model is on a different scale than the exponential nonlinear regression model I computed above. R provides such information-based criteria measures such as AIC() and an alternative called BIC(). Here I give AIC and BIC for the logistic growth model. When comparing models on comparable dependent variables the model with the lower AIC (or BIC) value is selected.
## [1] 206567
## [1] 206646
The AIC is a function of the likelihood function and the number of parameters; BIC is a function of those two values as well as the number of data points. The two indices are functionally related by BIC = AIC + k*(log(n)-2). The logistic growth model I estimated has 7 parameters (three fixed effects, two random effects variances, one correlation between the two random effects and the residual variance) with n=20648.
Another consideration is that the residuals may not be independent given that these are time series data. Yesterday’s residual may be correlated with today’s residual. Here I add a correlation structure to the nlme command using an autocorrelation lag of 1 (AR p =1) and a moving average lag of 1 (MA q = 1). The ARMA implementation in nlme, though, operates completely on the residuals. So, the AR refers to the previous residual correlation and the MA refers to a set of independent and identically distributed noise terms apart from the residuals (see Pinhero & Bates, 2000, as this is not standard). You can think of the MA noise parameters are “factors” that are added to produce an additional source of correlation across residuals, much like a common factor would in a factor analysis model, but separate from the serial correlation of residuals. One could also try a heterogeneous residual variance model (the weight argument described above) along with the correlated error. More diagnostics would be needed to settle on the appropriate correlational structure for the error term plus the usual checks on the residuals to ensure we are properly meeting the statistical assumptions.
NEED TO FIX
Here is a test of the two logistic growth models (with and without correlated error structure). The model with correlated error structure fits significantly better with only two additional parameters: one for the autogressive portion and one for the moving average portion.
# test the reduced (no correlation) to the full (with
# ARMA(0,1) correlational structure)
anova(out.logistic, out.logistic2)
For illustration purposes, here is another way of fitting this model using a different package lme4. While lme4 is in some ways easier to use, it is not as flexible as nlme, which allows, for example, correlated error structure on the residuals and the ability to impose structure on the covariance matrix between random effects. This model includes random effects from the asym and xmid parameters but there are convergence issues when all three parameters are allowed to be random effects. [now got it working with 3 RE parameters]
# starts <- getInitial(percap10000 ~
# SSlogis(day.numeric, asym,xmid,scal),
# data=na.omit(allstates.long.new)) starts
# second way to get starting values, just run a regular
# nonlinear regression without random effects to plug
# into nlmer or nlme temp <- nls(percap10000 ~
# SSlogis(day.numeric, asym, xmid,scal),
# data=na.omit(allstates.long.new)) coef(temp)
out.logistic.nlmer <- nlmer(percap10000 ~ SSlogis(day.numeric,
asym, xmid, scal) ~ asym + xmid + scal | state.abb,
data = na.omit(allstates.long.new), start = c(asym = 600,
xmid = 125, scal = 25), control = nlmerControl(tolPwrss = 0.1))
Estimate | Std. Error | t value | |
---|---|---|---|
asym | 1140 | 1.369 | 833.2 |
xmid | 298.4 | 0.2039 | 1463 |
scal | 57.5 | 0.1404 | 409.6 |
if (is_latex_output()) stargazer(out.logistic.nlmer, header = F,
type = "latex", report = "vcst", single.row = T, omit.stat = "n",
omit.table.layout = "n", digits = 2, digits.extra = 2,
align = T)
5.3.1 Additional Predictors: Example with State-Level Poverty Rate
Now that we have a reasonable model, let’s see how we can include predictors to the model. One way to include predictors is to examine if the parameters of the logistic growth regression are influenced by the predictor, e.g. do the state asymptote parameters vary according to the predictor. We can accomplish this by, say, having the asymp parameter for each state be a linear function of the predictor, as in asym = \(\beta_0\) + \(\beta_2\) predictor + \(\epsilon\). Our current model is the special case where \(\beta_2\) is set to zero, thus asym = \(\beta_0\) + \(\epsilon\) where the intercept is the fixed effect (same value for each state) and the “\(\epsilon\)” is the random effect (a term associated with each state). By expanding the model to include a predictor associated with the asymp parameter we can examine if the predictor is linearly related to the asymp parameter.
For illustration, I examine whether the poverty rate in each state is associated with the “scal” random effect coefficient in the logistic growth model.
par(pty = "s")
plot(allstates$povertyperc, coef(out.logistic)[, 3], xlab = "State Poverty Percentage",
ylab = "random effect parameter scal in logistic growth")
abline(coef(lm(coef(out.logistic)[, 3] ~ allstates$povertyperc)))
It looks like a slight positive association is present though the data are noisy. In principle, I could test this regression slope through a regular regression but it is better to do it in the context of the nonlinear logistic growth regression model directly. The reason is that a linear regression using the coefficients from the logistic growth as though they were an observed variable doesn’t take into account the error structure of these parameters—the parameters aren’t observed data but they are estimates from another model. By including the poverty predictor simultaneously in the logistic growth model, we can model the uncertainty much better. It would be even better to run this as a Bayesian model, which I’ll add in a later version using the brm package.
# use earlier estimates as starting values
start.vals <- fixef(out.logistic)
# drop random 7/6/20 xmid
out <- nlme(percap10000 ~ SSlogis(day.numeric, Asym, xmid,
scal + b2 * povertyperc), fixed = Asym + xmid + scal +
b2 ~ 1, random = Asym + scal ~ 1, data = na.omit(allstates.long.new),
start = c(Asym = start.vals[1], xmid = start.vals[2],
scal = start.vals[3], b2 = -0.3), verbose = T, control = nlmeControl(msMaxIter = 1000,
opt = "nlminb", tolerance = 0.1, pnlsTol = 0.01,
msVerbose = T, minScale = 0.01, maxIter = 1000,
eval.max = 1000))
# results=ifelse(longform, 'markup', 'hide')
out.logistic.poverty <- out
if (is_html_output()) pander(summary(out.logistic.poverty))
Value | Std.Error | DF | t-value | p-value | |
---|---|---|---|---|---|
Asym | 1034 | 42.45 | 20595 | 24.36 | 3.323e-129 |
xmid | 274 | 0.2494 | 20595 | 1099 | 0 |
scal | 47.82 | 7.383 | 20595 | 6.477 | 9.587e-11 |
b2 | 0.2385 | 0.5625 | 20595 | 0.424 | 0.6716 |
Min | Q1 | Med | Q3 | Max |
---|---|---|---|---|
-3.489 | -0.3714 | 0.08368 | 0.6568 | 5.946 |
Observations | Groups | Log-restricted-likelihood | |
---|---|---|---|
state.abb | 20648 | 50 | -109411 |
if (is_latex_output()) stargazer(out.logistic.poverty, header = F,
type = "latex", report = "vcst", single.row = T, omit.stat = "n",
omit.table.layout = "n", digits = 2, digits.extra = 2,
align = T)
knitr::kable(intervals(out.logistic.poverty, which = "fixed")[[1]])
lower | est. | upper | |
---|---|---|---|
Asym | 950.740 | 1033.937 | 1117.13 |
xmid | 273.544 | 274.033 | 274.52 |
scal | 33.346 | 47.815 | 62.28 |
b2 | -0.864 | 0.238 | 1.34 |
DOUBLECHECK: significance status has changed.
The \(\beta\) associated with poverty is statistically significant in this example so there is evidence that the state variability in the scal parameter of this logistic growth model is associated with the state’s poverty level. In other words, state-level poverty is associated with how quickly a state increases in positive tests (the scal parameter). We need to be careful in interpreting this association given that the dependent variable we are using could reflect covid incidence but could also reflect the number of tests a state has conducted.
This example illustrates how one goes about adding predictors to such a nonlinear model. One could use more complicated predictors, even test differences across groups on parameter values (e.g., the question do states with democratic governors have different parameter values than states with republican governors, can be tested by creating an appropriate grouping code for the political party of the governor and including that code as a predictor in the same way I included the predictor poverty).
While the presentation has focused on associations, the overall framework can be extended to strengthen inference using various methods including instrumental variables (e.g., Carrol et al, 2004) and newer approaches (e.g., Runge et al, 2019), though much theoretical work remains to be done to move these nonlinear mixed models from mere description to stronger inferences by allowing more direct tests of possible alternative explanations. For an introduction to some basic modern methods in causal inference see Bauer and Cohen.
5.3.2 Additional assumption checking: random effect parameters
I’ll revert back to the simple logistic growth model at the start of this subsection. The residual plots already raised concern about the model fit. We can also examine additional assumptions of the model such as the distribution of the random effects. The random effect parameters in the nonlinear mixed logistic growth model are assumed to follow a multivariate normal distribution, which seems to be reasonable so many days into the pandemic. Here is a 2D density plot (I know, a lot to ask for when there are relatively few pairs of points, one for each state), separate qqplots and boxplots for the asymptote and scale parameters. The qqplot for the asymptote parameter has a long tail, mostly due to New York and New Jersey; the ones for the xmid and scal parameters look ok.
# edit this function to be pairs of scatterplots for 3
# random effects ggplot(coef(out.logistic), aes(x=l,
# y=b1)) + stat_density_2d(aes(fill = ..level..), geom =
# 'polygon', colour='white') + xlab('l = asymptote')
par(mfcol = c(1, 3))
qqnorm(coef(out.logistic)[, 1], main = "qqplot for asymptote parameter")
qqline(coef(out.logistic)[, 1])
qqnorm(coef(out.logistic)[, 2], main = "qqplot for xmid")
qqline(coef(out.logistic)[, 2])
qqnorm(coef(out.logistic)[, 3], main = "qqplot for scal")
qqline(coef(out.logistic)[, 3])
5.4 Evaluating Exponential and Logistic Growth Models
The overall point is that testing whether the data fit an exponential growth model, a logistic growth model, or various other plausible growth trajectories is not merely a modeling fitting exercise. There is much diagnostic testing one needs to do in order to figure out in what ways is a model working well and where it is messing up. The particular way a model is going awry may suggest directions for improvement and may offer clues about the underlying processes that generated the data, which is usually the important part of such an exercise.
The model selection game is more complex than merely selecting the model that has the best fit value on a standard metric such as AIC or BIC or predictive accuracy. To me, there is deeper insight that emerges from understanding the type of growth that is underlying the processes under investigation. We are in a better position to understand how to intervene if we understand the growth process that underlies a data set. Just because a model achieves relatively good performance on measures of fit does not imply that the model is following the same mechanisms of the processes that generated the actual data. This is where the diagnostic tests come in to help us diagnose such situations.
Let me illustrate what I mean by deeper implications. We know that if a growth pattern follows an exponential function, then the plot should be linear in the log scale. Think, the logical implication: “if p, then q”. Now, our data suggest that in the log scale we do not see linearity. What can we conclude? Well, through the logical contrapositive, the observation of not q, implies not p. In other words, the data rule out the simple exponential. Concluding that we can rule out an entire functional form from consideration is a more powerful result than merely claiming one model fits better than another.
The Center for Quality, Safety & Value Analytics at Rush University released a calculator that lets the user interactively change the model on cumulative cases. For example, one can click on exponential, or logistic, or (gasp) quadratic and run a model fit along with a forecast. The user can also select the state.
To get deeper in such process models of growth, we need to develop some additional machinery including differential equations, which are equations that govern changes over time. I’ll briefly touch on these models in the next chapter.
check on glmer and gamma
out.gamma <- glmer(percap100 ~ day.numeric + (day.numeric|state.abb), data=na.omit(allstates.long.new),
family=quasi(link="identity"))
5.5 Splines, Smoothers and Time Series
I’ll briefly give an example of splines because it will be helpful in subsequent sections. I don’t recommend going the route of splines and smoothers in this context. While splines can provide great fit to the data and can be used for interpolation (e.g., if we missed a day of data collection, we could interpolate), splines and smoothers are not helpful at informing one of the underlying process nor in forecasting (extrapolating) from one’s data, which is what we want to be able to do in this case. For illustration, and to show that I too can get curves that approximate the data points closely, I ran a mixed model spline regression where the knots are treated as a random effect by state and show the plots just for New York and Michigan for brevity. One would need to do all the usual diagnostics for this mixed model as well.
# linear mixed model using splines on the count data
out.sp <- lmer(count ~ (bs(day.numeric, intercept = T, degree = 5) |
state.abb), data = na.omit(allstates.long))
# diagnostics omitted for brevity plot(out.sp)
# qqnorm(resid(out.sp)) qqline(resid(out.sp))
out.temp <- na.omit(allstates.long)
out.temp$fit <- fitted(out.sp)
subset(out.temp, state.abb %in% c("New York", "Michigan")) %>%
ggplot(aes(x = day.numeric, y = fit, group = state.abb,
color = state.abb)) + geom_line() + geom_point(aes(x = day.numeric,
y = count), color = "black") + ggtitle("Black points are actual data; curves are spline regression fits")
The splines approach can be extended in the context of time series to permit forecasts. Here I’ll show the use of the forecast package and the excellent online book by Hyndman and Athanasopoulos. I’ll forecast 10 days out from the current date using New York data and compute an error band, show how to check residuals with the checkresiduals() command, and how to use hierarchical time series to model state trajectories within the US.
# better to redo plots with dates like 4/1/20 and weekly
# markers
nydays <- allstates.long$day[allstates.long$state.abb ==
"New York"]
# check
nycount.ts <- ts(nydf$percap10000, start = 1, end = max(nydf$day.numeric) +
1, frequency = 1)
autoplot(nycount.ts)
# splines, h= number days predicted
h <- 10
splne.forecast.ny <- splinef(nycount.ts, h = h)
autoplot(splne.forecast.ny)
# comparison (may drop) splne.forecast.ny <-
# splinef(nycount.ts,h=h, lambda='auto')
# autoplot(splne.forecast.ny)
# checkresiduals(splne.forecast.ny)
I tend to lump a different class of models into the bucket of models considered in this subsection. There are models that directly model the coefficients as varying by time. Polynomials can be thought of as having time varying coefficients. For example, the quadratic \(Y = \beta_0 + \beta_1 t + \beta_2 t^2\) can be rewritten as a linear regression with one predictor \(t\) that has intercept \(\beta_0\) and a time-varying slope term (\(\beta_1 + \beta_2 t)\). Splines extend this logic by allowing the slope at a given \(t\) to vary continuously. There are models that take this idea and run with it. One such approach is a nonhomogenous Poisson process, which uses the Poisson distribution to model counts and allows the rate parameter to vary by time (hence the term nonhomogenous). There are several R packages that implement such models, including NHPoisson and some extensions to spatial cases such as spatstat. .In the modeling presented in this book, we mostly work with counts so a Poisson would be a natural standard distribution to consider rather than blindly relying on normal distributions assumptions as I have done in several sections. I have highlighted how the normal distribution assumptions have been suspect. A different approach for implementing time-varying coefficients is in the R package tvReg.
5.6 Bayesian Estimation of Nonlinear Regression: Adding value to prediction and uncertainty
I will illustrate the use of Bayesian nonlinear regression modeling to estimate prediction uncertainty. The Bayesian approach also allows for more general models than the standard regression approaches described earlier in this chapter, these include expanding the assumptions limitations of the classic approach. For example, within a Bayesian approach one can include predictors of the variance terms, which is not easy to do in the classic setting. This could be useful, for example, if a predictor influences not only the value of a random effect parameter but also its variance. Or, we may be able to do a better job of developing models that are more appropriate for the count data we are using throughout these notes. Another example of something that is difficult to do in the classical framework is dealing with the distribution of random effects that do not follow a normal distribution. Earlier we saw that there may be two clusters of states and that the joint distribution of random effect parameters appears to be skewed. This makes suspect inferences based on random effects that assume a joint normal distribution. These nonstandard cases can be easily handled in the Bayesian framework.
I co-wrote an introductory chapter on Bayesian statistics with Fred Feinberg that highlights some of the advantages and disadvantages of the Bayesian framework. For a complete book on Bayesian statistics I very much like the comprehensive book by Gelman et al. Bayesian Data Analysis. A more readable textbook is Kruschke’s Doing Bayesian Data Analysis.
My Bayesian example here will take a problem that was extremely difficult in what we did earlier but is relatively easy to do in the Bayesian context: put intervals (or, as a Bayesian would say, credible intervals because the interpretation is slightly different than a standard confidence interval) around predictions form a nonlinear random effects regression model. This is an important detail that the classical approach has a difficult time addressing, as I mentioned in earlier subsections of this chapter when I wasn’t able to establish the uncertainty around the predictions. While uncertainty intervals around the parameter estimates are possible in the classical approach, computing error bars around the prediction interval in a random effects nonlinear model is not straightforward.
I’ll follow the same modeling implementation as used earlier. The brm() function in the brms package provides a convenient wrapper to the Stan Bayesian modeling using typical R commands. I’ll follow a similar set of assumptions as I did before (e.g., normally distributed random effects using relatively “flat” priors). The computation takes over four hours on my desktop computer so for now I’ll set the iteration to 2000 with 4 chains, which are the default settings. When these notes are complete and I’m no longer running the code each evening, I will increase the number of iterations, say, to 10000 and vary some of the other computational details to produce more stable estimates. Experts in Bayesian analysis will see some issues with the diagnostic plots I show but I’m sure many will agree with my decision to keep things relatively quick for now in this draft form.
NEED TO FIX BAYESIAN MODEL TO USE percap10000 dv
# once working correctly move these library calls to
# prelim file already there but leaving here for now
# during debugging
library(brms)
library(bayesplot)
library(tidybayes)
# continue to use save trick (maybe cache) to minimize
# computation time
fit_loss <- brm(bf(count ~ asymp/(1 + exp(b1 * (liftoff -
day.numeric))), asymp ~ 1 + (1 | state.abb), b1 ~ 1,
liftoff ~ 1 + (1 | state.abb), nl = TRUE), data = allstates.long,
family = gaussian(), prior = c(prior(normal(29000, 40000),
nlpar = "asymp"), prior(normal(70, 40), nlpar = "liftoff"),
prior(normal(0.2, 10), nlpar = "b1", lb = 0)), iter = 2000,
control = list(adapt_delta = 0.95))
saveRDS(fit_loss, file = "fitloss.RDS")
Here are the signature Bayesian plots of each parameter (density and trace plots) as well as the intercorrelations across the parameters.
# in interest of time just read in the saved object
# currently using data through 4/22/20 in what I saved
fit_loss <- readRDS("fitloss.RDS")
summary(fit_loss)
plot(fit_loss, N = 3, ask = FALSE)
pairs(fit_loss)
These are the prediction bands around the “fixed effect” part of the model and a plot for the state of Michigan.
int <- 1:100
out.predict <- predict(fit_loss, newdata = data.frame(state.abb = rep("Michigan",
max(int)), day.numeric = int), re_formula = NULL)
temp <- data.frame(day.numeric = int, out.predict)
ggplot(temp) + geom_line(aes(day.numeric, Estimate)) + geom_ribbon(aes(day.numeric,
ymin = Q2.5, ymax = Q97.5), alpha = 0.2) + geom_point(data = midf,
aes(x = day.numeric, y = count)) + ggtitle("Michigan")
5.6.1 Bayesian extensions
Pending: show a Bayesian approach to the generalized nonlinear mixed model
5.7 State-space modeling
The simple idea of a time series introduced earlier in Section 5.5 can be extend to a context involving a latent variable, or as it is called in this context, a state. The basic idea is that there are two equations governing different levels of analysis. One equation governs our observations and another equation govern a latent state.
One form of this approach has these equations. For the state-space portion, \[ S(t+1) = S(t) + \epsilon_s \] where S(t) is the state at time and the \(\epsilon_s\) is the noise in the state model. This equation governs the transition from one state to another at each time step possibly with error. The error is assumed to be normally distributed with mean = 0 and some variance. If the variance is constrained to be 0, then there is no noise in process modeling the transition across states. This model can be extend to include parameters, transition probabilities, covariates, structure on the error terms and other bells-and-whistles usually associated with regression models.
The second equation governs what we observe \[ Y(t) = S(t) + \epsilon_m \] where Y(t) is our observation at time t, S(t) is the state at time t and the \(\epsilon_m\) is the noise associated with our observation. Typically, the two \(\epsilon\)s in these two equations are assumed to be independent. If they are both normally distributed, then this is typically known as a dynamic linear model.
The way to think about these two equations is that there is a process governing the way way the system moves from time step to time step and there is an additional process that can lead to noisy observations of those states. There are analogous models in other areas of statistics (e.g., factor analysis can be represented in a similar way of a measurement model separate from a structural model).
I adapted some of the code from this webpage.
R example pending.
I will return to state-space models in Chapter 6 when discussing process models.
5.8 Basic Machine Learning Examples
In my opinion, while impressive progress has been made in text analysis and image analysis, the machine learning literature has lagged behind where it needs to be on development of methods that address multiple time series (e.g., times series by state, country or county). Here I review one approach, an implementation of a recurrent neural network (RNN) that I used for the data from New York state. We need to use an RNN, a specialized version of a neural network, because we have time series data. Here is one description, another description and a third description by UM’s Ivo Dinov of RNNs. The R package I will illustrate is rnn and it is a basic implementation of a recurrent neural network. There are much better implementations, such as the keras package and I’ll explore that later once I make more progress on other parts of these notes. For now, we’ll use the very simplified rnn package in R.
The underlying common element of a neural network is that they take inputs, weight them, sum and add a “bias” term. This is similar to regression in weighting predictors by betas, sum the products and add and intercept. The weighted sum is then converted into a number between 0 and 1, using the logistic form similar to logistic regression that converts the model on the real number line into the scale of the dependent variable that is a probability between 0 and 1. This can be done repeatedly to create multiple units, or different sets of weighted sums, in a manner similar to principal components analysis (PCA) where the observed variables are weighted and summed to produce components. It is possible to create additional layers with new outputs based on these weighted sums (so, weighted sums of nonlinear transformations of weighted sums). The the final layer of units is then mapped onto the outcome variable.
Here is a graphical representation of what I mean. This example has four inputs (i.e., four variables or features), one hidden layer with 8 units, and one output layer that is a single number prediction (like our counts of positive tests). The hidden layer is a set of 8 units. The two B terms are the “bias” terms that feed into the weighted sums. The thick black lines refer to higher positive weights and the thicker gray refer to smaller negative. Example adapted from the tutorial in the NeuralNetTools R package.
data(neuraldat)
neuraldat$X4 <- neuraldat$X1 * neuraldat$X2
mod <- nnet(Y1 ~ X1 + X2 + X3 + X4, data = neuraldat, size = 8)
The RNN approach we will use is a simplified version. More appropriate approaches would be long short-term memory models.
It is common to model time series using the counts for each day (such as with the incidence plot we saw in the Section 4.4) rather than the cumulative counts we have been using throughout most of these notes. I’ll follow that convention here.
The incident plot for New York State.
data.temp <- allstates.long %>% filter(state.abb == "New York")
data.temp$daily <- c(data.temp$count[1], diff(data.temp$count))
ggplot(data.temp, aes(x = day, y = daily)) + geom_bar(stat = "identity") +
scale_x_date(date_breaks = "1 week", date_labels = "%d %b") +
labs(y = "Daily incremental incidence ", title = "Positive US Covid-19 Cases New York") +
theme(legend.position = "none", strip.text.y = element_text(size = 11))
5.8.1 Simple RNN example
The following involves a relatively large chunk of code. Other implementations of deep learning networks require even more code. The basic idea is that we put all our data into one data frame, then partition it into testing data and training data. We want to be sure we partition early so as not to contaminate the testing data with any information that is in the testing data. For example, if we compute a mean we may inadvertently compute a mean for all the data (testing and training), and then if we use that mean in training we have contaminated the training data with information from the testing data.
The only information I have to work with so far is the number days and the previously observed daily counts; that is, as of this point in these notes we don’t have access to other possible predictors of a state’s daily count. Here, “previously observed” means if I’m at time t, then any data prior to t is “previous” data that I have but any data after t has not occurred. In the case of data I’m downloading, t may be, say, 4/10/20 so from that perspective any data after 4/10/20 cannot be used because as of that time point t I don’t know of data after t. Part of the machine learning algorithm works by varying t to different days, repeatedly feeding those sequences of data to the model and trying to learn the patterns by figuring out what it is getting wrong, making adjustments in the model’s parameters (the weights and bias terms) and making new predictions, and iterating that process.
The rest of the code is creating daily counts and several features to use as predictors. We don’t have a rich set of predictors set up yet so I created some from the day variable. I will use day, day\(^2\) and day\(^3\) (of course, mean centering day before computing the higher order terms). I created a predictor of how much daily change there was (e.g., the count at t minus the count at t - 1). This serves as a slope estimate of change, which is a first order difference because I’m working in discrete time (days); technically, it is a secant rather than the slope (which would correspond to the tangent). I also created a “difference of difference” (or second order difference analogous to acceleration, or how the change is changing) to measure how much the slope is changing from day to day. In other words how different is the change today from the change we saw yesterday. This process of creating new variables from existing variables is called “feature engineering.”
This model has 7 inputs: day, day\(^2\), day\(^3\), count, count difference, count second order difference and running mean of the counts from the previous 3 days. I currently have just have a few units in a single hidden layer, and one unit in the output layer. Of course, to make this more interesting I could add many more predictors including state-level and daily information (e.g., measures of compliance to social distancing measures; weather, which could affect people’s behavior going outside; changes in a state’s social distancing policy).
So, overall the long section of code is because we have to do some work to put the data in the right format for machine learning to do its magic.
In the figure below the red points are the actual daily cases in New York. The green curve is what the machine learning model learned. I intentionally over trained the model. The last day shows the prediction by the machine learning model in solid blue and the actual in solid red.
FIX: computer memory problem staring 8/27 with R upgrade
# set.seed(03302020) for rollify for running mean
y = allstates.long.new[allstates.long.new$state.abb == "New York",
"count"]
actualny <- data.frame(y = y, x = 0:(length(y) - 1))
data <- data.frame(days = actualny$x[-1], counts = actualny$y[-1])
# partition train and test data; good to do that first
# so that test data doesn't creep into training this
# could happen accidentally say when computing a mean
# for this classroom example the test set is the last
# day, so I'm testing by a prediction one day out
data_train <- data[-nrow(data), ]
data_test <- data[nrow(data), ]
# diff computes first order differences so need to add
# day 1 at the begining of the series
time.series.day.counts <- ts(c(173, diff(data_train$counts)))
time.series.day.counts
## Time Series:
## Start = 1
## End = 412
## Frequency = 1
## [1] 173 64 80 83 178 76 328 446
## [9] 1100 1714 3439 2750 4781 5693 4797 5160
## [17] 6556 7304 7699 7248 7015 9190 14524 10403
## [25] 10675 10250 10995 9861 7413 10211 12071 11473
## [33] 8533 8373 7853 7471 6658 7889 6986 6877
## [41] 6644 6386 4633 4259 5497 6668 9442 5677
## [49] 4313 2908 3136 4377 4490 4375 3511 3143
## [57] 2106 2123 3211 2899 2635 2400 1824 1330
## [65] 1518 2339 2248 2518 2023 1498 1386 1166
## [73] 1866 1900 2071 1808 1691 961 933 1156
## [81] 1663 1372 1209 1136 1069 789 1212 1123
## [89] 1055 857 736 598 461 810 816 760
## [97] 693 667 509 411 699 824 695 666
## [105] 642 501 473 814 829 734 611 534
## [113] 503 547 907 914 777 722 522 448
## [121] 568 765 920 878 747 635 873 702
## [129] 890 827 829 644 674 700 556 853
## [137] 750 760 628 635 505 572 797 690
## [145] 755 553 590 633 484 742 578 698
## [153] 574 521 582 464 642 650 692 610
## [161] 473 538 517 683 737 683 542 473
## [169] 555 482 798 669 694 713 685 603
## [177] 648 917 870 820 736 567 502 592
## [185] 662 1031 926 834 662 689 684 1025
## [193] 831 902 852 635 588 569 1101 1000
## [201] 1173 1032 969 979 984 1324 1759 1687
## [209] 1386 1147 1149 1218 2078 1609 1503 1274
## [217] 1254 1204 1098 1538 1747 1695 1512 1209
## [225] 1238 1716 1892 1834 2124 1920 1545 1781
## [233] 1711 2351 2328 2393 2122 1870 2069 2068
## [241] 3217 3211 3900 3579 3446 3957 4253 5225
## [249] 5666 5577 4192 4210 4612 4876 5824 5733
## [257] 6312 5680 6437 4882 6033 7312 8278 6861
## [265] 5478 7772 7087 8135 10148 11846 11054 10151
## [273] 8453 8435 9258 11124 11124 11000 10587 10138
## [281] 8964 9419 11723 12652 10067 9305 9805 9393
## [289] 10532 13024 13048 11129 8008 8268 10183 12917
## [297] 15393 17033 15849 12232 11242 13142 16016 17588
## [305] 19560 17839 16308 14179 14791 14704 14187 19469
## [313] 15945 13694 13410 12721 11119 13908 15684 13909
## [321] 12383 11972 11040 10365 13783 12951 12892 11209
## [329] 8932 7949 7154 6970 7897 11492 9923 8972
## [337] 8279 6584 10011 8715 8755 8330 6538 6709
## [345] 5925 9582 8573 8238 6237 6297 6923 6139
## [353] 8893 8074 8042 7827 6436 6381 6564 7468
## [361] 8151 8126 7658 5440 6833 6106 7169 4226
## [369] 10251 6687 7119 6711 6582 5515 5077 6667
## [377] 3735 2796 3309 27644 8551 8059 8694 8910
## [385] 8742 5875 7789 9170 8038 7755 7636 6659
## [393] 6041 6643 8727 8662 7495 6630 5631 5023
## [401] 6060 7029 6344 6590 5641 5026 4576 4292
## [409] 5238 4847 4138 4064
## Time Series:
## Start = 2
## End = 412
## Frequency = 1
## [1] -109 16 3 95 -102 252 118
## [8] 654 614 1725 -689 2031 912 -896
## [15] 363 1396 748 395 -451 -233 2175
## [22] 5334 -4121 272 -425 745 -1134 -2448
## [29] 2798 1860 -598 -2940 -160 -520 -382
## [36] -813 1231 -903 -109 -233 -258 -1753
## [43] -374 1238 1171 2774 -3765 -1364 -1405
## [50] 228 1241 113 -115 -864 -368 -1037
## [57] 17 1088 -312 -264 -235 -576 -494
## [64] 188 821 -91 270 -495 -525 -112
## [71] -220 700 34 171 -263 -117 -730
## [78] -28 223 507 -291 -163 -73 -67
## [85] -280 423 -89 -68 -198 -121 -138
## [92] -137 349 6 -56 -67 -26 -158
## [99] -98 288 125 -129 -29 -24 -141
## [106] -28 341 15 -95 -123 -77 -31
## [113] 44 360 7 -137 -55 -200 -74
## [120] 120 197 155 -42 -131 -112 238
## [127] -171 188 -63 2 -185 30 26
## [134] -144 297 -103 10 -132 7 -130
## [141] 67 225 -107 65 -202 37 43
## [148] -149 258 -164 120 -124 -53 61
## [155] -118 178 8 42 -82 -137 65
## [162] -21 166 54 -54 -141 -69 82
## [169] -73 316 -129 25 19 -28 -82
## [176] 45 269 -47 -50 -84 -169 -65
## [183] 90 70 369 -105 -92 -172 27
## [190] -5 341 -194 71 -50 -217 -47
## [197] -19 532 -101 173 -141 -63 10
## [204] 5 340 435 -72 -301 -239 2
## [211] 69 860 -469 -106 -229 -20 -50
## [218] -106 440 209 -52 -183 -303 29
## [225] 478 176 -58 290 -204 -375 236
## [232] -70 640 -23 65 -271 -252 199
## [239] -1 1149 -6 689 -321 -133 511
## [246] 296 972 441 -89 -1385 18 402
## [253] 264 948 -91 579 -632 757 -1555
## [260] 1151 1279 966 -1417 -1383 2294 -685
## [267] 1048 2013 1698 -792 -903 -1698 -18
## [274] 823 1866 0 -124 -413 -449 -1174
## [281] 455 2304 929 -2585 -762 500 -412
## [288] 1139 2492 24 -1919 -3121 260 1915
## [295] 2734 2476 1640 -1184 -3617 -990 1900
## [302] 2874 1572 1972 -1721 -1531 -2129 612
## [309] -87 -517 5282 -3524 -2251 -284 -689
## [316] -1602 2789 1776 -1775 -1526 -411 -932
## [323] -675 3418 -832 -59 -1683 -2277 -983
## [330] -795 -184 927 3595 -1569 -951 -693
## [337] -1695 3427 -1296 40 -425 -1792 171
## [344] -784 3657 -1009 -335 -2001 60 626
## [351] -784 2754 -819 -32 -215 -1391 -55
## [358] 183 904 683 -25 -468 -2218 1393
## [365] -727 1063 -2943 6025 -3564 432 -408
## [372] -129 -1067 -438 1590 -2932 -939 513
## [379] 24335 -19093 -492 635 216 -168 -2867
## [386] 1914 1381 -1132 -283 -119 -977 -618
## [393] 602 2084 -65 -1167 -865 -999 -608
## [400] 1037 969 -685 246 -949 -615 -450
## [407] -284 946 -391 -709 -74
## Time Series:
## Start = 3
## End = 412
## Frequency = 1
## [1] 125 -13 92 -197 354 -134 536
## [8] -40 1111 -2414 2720 -1119 -1808 1259
## [15] 1033 -648 -353 -846 218 2408 3159
## [22] -9455 4393 -697 1170 -1879 -1314 5246
## [29] -938 -2458 -2342 2780 -360 138 -431
## [36] 2044 -2134 794 -124 -25 -1495 1379
## [43] 1612 -67 1603 -6539 2401 -41 1633
## [50] 1013 -1128 -228 -749 496 -669 1054
## [57] 1071 -1400 48 29 -341 82 682
## [64] 633 -912 361 -765 -30 413 -108
## [71] 920 -666 137 -434 146 -613 702
## [78] 251 284 -798 128 90 6 -213
## [85] 703 -512 21 -130 77 -17 1
## [92] 486 -343 -62 -11 41 -132 60
## [99] 386 -163 -254 100 5 -117 113
## [106] 369 -326 -110 -28 46 46 75
## [113] 316 -353 -144 82 -145 126 194
## [120] 77 -42 -197 -89 19 350 -409
## [127] 359 -251 65 -187 215 -4 -170
## [134] 441 -400 113 -142 139 -137 197
## [141] 158 -332 172 -267 239 6 -192
## [148] 407 -422 284 -244 71 114 -179
## [155] 296 -170 34 -124 -55 202 -86
## [162] 187 -112 -108 -87 72 151 -155
## [169] 389 -445 154 -6 -47 -54 127
## [176] 224 -316 -3 -34 -85 104 155
## [183] -20 299 -474 13 -80 199 -32
## [190] 346 -535 265 -121 -167 170 28
## [197] 551 -633 274 -314 78 73 -5
## [204] 335 95 -507 -229 62 241 67
## [211] 791 -1329 363 -123 209 -30 -56
## [218] 546 -231 -261 -131 -120 332 449
## [225] -302 -234 348 -494 -171 611 -306
## [232] 710 -663 88 -336 19 451 -200
## [239] 1150 -1155 695 -1010 188 644 -215
## [246] 676 -531 -530 -1296 1403 384 -138
## [253] 684 -1039 670 -1211 1389 -2312 2706
## [260] 128 -313 -2383 34 3677 -2979 1733
## [267] 965 -315 -2490 -111 -795 1680 841
## [274] 1043 -1866 -124 -289 -36 -725 1629
## [281] 1849 -1375 -3514 1823 1262 -912 1551
## [288] 1353 -2468 -1943 -1202 3381 1655 819
## [295] -258 -836 -2824 -2433 2627 2890 974
## [302] -1302 400 -3693 190 -598 2741 -699
## [309] -430 5799 -8806 1273 1967 -405 -913
## [316] 4391 -1013 -3551 249 1115 -521 257
## [323] 4093 -4250 773 -1624 -594 1294 188
## [330] 611 1111 2668 -5164 618 258 -1002
## [337] 5122 -4723 1336 -465 -1367 1963 -955
## [344] 4441 -4666 674 -1666 2061 566 -1410
## [351] 3538 -3573 787 -183 -1176 1336 238
## [358] 721 -221 -708 -443 -1750 3611 -2120
## [365] 1790 -4006 8968 -9589 3996 -840 279
## [372] -938 629 2028 -4522 1993 1452 23822
## [379] -43428 18601 1127 -419 -384 -2699 4781
## [386] -533 -2513 849 164 -858 359 1220
## [393] 1482 -2149 -1102 302 -134 391 1645
## [400] -68 -1654 931 -1195 334 165 166
## [407] 1230 -1337 -318 635
# put into data matrix and pad differences with NA
data.mat <- data.frame(days.numeric = data_train$days -
mean(data_train$days), first = c(NA, first.diff), second = c(NA,
NA, second.diff))
# create day sq and day cube with centering
data.mat$daysq <- (data.mat$days.numeric - mean(data.mat$days.numeric))^2
data.mat$daycube <- (data.mat$days.numeric - mean(data.mat$days.numeric))^3
data.mat$previous.day.count <- c(NA, time.series.day.counts[-length(time.series.day.counts)])
# mean of last 3 days
running_mean <- rollify(mean, window = 5)
data.mat$runmean5 <- running_mean(data.mat$previous.day.count)
# but to make future prediction is sensible I can't
# compute first and second order difference knowing the
# future count I'm predicting so move the 1st and 2nd
# order diff down one row and pad with NAs; save the
# last count I'm dropping here because I do know it when
# making prediction to the training set
# (previous.day.count is already moved up)
save.last.row.data.mat <- data.mat[nrow(data.mat), ]
data.mat[, 2:3] <- cbind(c(NA, data.mat[1:(nrow(data.mat) -
1), 2]), c(NA, data.mat[1:(nrow(data.mat) - 1), 3]))
# drop first five rows to avoid missing data
data.mat <- data.mat[-c(1:5), ]
# define scaling function for scaling 0-1
sc <- function(x) {
(x - min(x, na.rm = T))/(max(x, na.rm = T) - min(x,
na.rm = T))
}
# testing function sc as identity to check creation of X
# and Y arrays sc <- function(x) x
# save the original mean, sd, min and max of the
# data_train features for later user cut first 5 points
# of time.series.day.counts too
tempmat <- cbind(data.mat$days.numeric, data.mat$daysq,
data.mat$daycube, data.mat$first, data.mat$second, data.mat$previous.day.count,
data.mat$runmean5, counts = time.series.day.counts[-c(1:5)])
minmat <- apply(tempmat, 2, min, na.rm = T)
maxmat <- apply(tempmat, 2, max, na.rm = T)
meanmat <- apply(tempmat, 2, mean, na.rm = T)
sdmat <- apply(tempmat, 2, function(i) sqrt(var(i, na.rm = T)))
# rnn programs usually require 3 dimensional arrays for
# X and Y
X <- array(c(sc(data.mat$days.numeric), sc(data.mat$daysq),
sc(data.mat$daycube), sc(data.mat$first), sc(data.mat$second),
sc(data.mat$previous.day.count), sc(data.mat$runmean5)),
dim = c(1, nrow(data.mat), 7))
Y <- array(sc(time.series.day.counts[-c(1:5)]), dim = c(1,
nrow(data.mat), 1))
# I tinkered with different settings for learningrate, momentum
#batchsize should not be greater than the size of the training set
model <- trainr(Y = Y,
X = X,
learningrate = .2, momentum=.5, batchsize=5,
hidden_dim = c(3), use_bias=T,
numepochs = 1000, network_type="rnn")
# Predicted values
Yp <- predictr(model, X)
# Plot predicted vs actual. Training set + testing set
dv.Y <- (maxmat[7]-minmat[7])*Yp + minmat[7]
dv.Y
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 96.4 96.3 96.3 96.3 96.3 96.3 96.3 96.3 96.3
## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3 96.3
## [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## [1,] 96.3 96.2 96.3 96.2 96.2 96.2 96.2 96.2
## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,98] [,99] [,100] [,101] [,102] [,103] [,104]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,105] [,106] [,107] [,108] [,109] [,110] [,111]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,112] [,113] [,114] [,115] [,116] [,117] [,118]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,119] [,120] [,121] [,122] [,123] [,124] [,125]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,126] [,127] [,128] [,129] [,130] [,131] [,132]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,133] [,134] [,135] [,136] [,137] [,138] [,139]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,140] [,141] [,142] [,143] [,144] [,145] [,146]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,147] [,148] [,149] [,150] [,151] [,152] [,153]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,154] [,155] [,156] [,157] [,158] [,159] [,160]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,161] [,162] [,163] [,164] [,165] [,166] [,167]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,168] [,169] [,170] [,171] [,172] [,173] [,174]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,175] [,176] [,177] [,178] [,179] [,180] [,181]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,182] [,183] [,184] [,185] [,186] [,187] [,188]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,189] [,190] [,191] [,192] [,193] [,194] [,195]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,196] [,197] [,198] [,199] [,200] [,201] [,202]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,203] [,204] [,205] [,206] [,207] [,208] [,209]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,210] [,211] [,212] [,213] [,214] [,215] [,216]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,217] [,218] [,219] [,220] [,221] [,222] [,223]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,224] [,225] [,226] [,227] [,228] [,229] [,230]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,231] [,232] [,233] [,234] [,235] [,236] [,237]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,238] [,239] [,240] [,241] [,242] [,243] [,244]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,245] [,246] [,247] [,248] [,249] [,250] [,251]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,252] [,253] [,254] [,255] [,256] [,257] [,258]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,259] [,260] [,261] [,262] [,263] [,264] [,265]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,266] [,267] [,268] [,269] [,270] [,271] [,272]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,273] [,274] [,275] [,276] [,277] [,278] [,279]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,280] [,281] [,282] [,283] [,284] [,285] [,286]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,287] [,288] [,289] [,290] [,291] [,292] [,293]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,294] [,295] [,296] [,297] [,298] [,299] [,300]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,301] [,302] [,303] [,304] [,305] [,306] [,307]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,308] [,309] [,310] [,311] [,312] [,313] [,314]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,315] [,316] [,317] [,318] [,319] [,320] [,321]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,322] [,323] [,324] [,325] [,326] [,327] [,328]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,329] [,330] [,331] [,332] [,333] [,334] [,335]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,336] [,337] [,338] [,339] [,340] [,341] [,342]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2
## [,343] [,344] [,345] [,346] [,347] [,348] [,349]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.3 96.2
## [,350] [,351] [,352] [,353] [,354] [,355] [,356]
## [1,] 96.2 96.2 96.2 96.2 96.2 96.3 96.3
## [,357] [,358] [,359] [,360] [,361] [,362] [,363]
## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3
## [,364] [,365] [,366] [,367] [,368] [,369] [,370]
## [1,] 96.3 96.2 96.3 96.3 96.3 96.3 96.3
## [,371] [,372] [,373] [,374] [,375] [,376] [,377]
## [1,] 96.3 96.3 96.3 96.3 96.3 96.2 96.3
## [,378] [,379] [,380] [,381] [,382] [,383] [,384]
## [1,] 96.2 96.3 96.2 96.3 96.3 96.3 96.3
## [,385] [,386] [,387] [,388] [,389] [,390] [,391]
## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3
## [,392] [,393] [,394] [,395] [,396] [,397] [,398]
## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3
## [,399] [,400] [,401] [,402] [,403] [,404] [,405]
## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3
## [,406] [,407]
## [1,] 96.3 96.4
# when sq and cubing, center using day mean from training set [1], save.last.row.datamat use [2] or [3] for first and second,respectively
Xnew <- array(c((data_train[nrow(data_train),1]-minmat[1])/(maxmat[1]-minmat[1]), (data_train[nrow(data_train),1]^2-minmat[2])/(maxmat[2]-minmat[2]), (data_train[nrow(data_train),1]^3-minmat[3])/(maxmat[3]-minmat[3]), unlist((save.last.row.data.mat[2]-minmat[4])/(maxmat[4]-minmat[4])),
unlist((save.last.row.data.mat[3]-minmat[5])/(maxmat[5]-minmat[5])),
#(time.series.day.counts[length(time.series.day.counts)]-minmat[6])/(maxmat[6]-minmat[6])),
unlist((save.last.row.data.mat[6]-minmat[6])/(maxmat[6]-minmat[6])),
unlist((save.last.row.data.mat[7]-minmat[7])/(maxmat[7]-minmat[7]))),
dim=c(1,1,7))
#X <- array(c(sc(data.mat$days.numeric),sc(data.mat$daysq), sc(data.mat$daycube), sc(data.mat$first),sc(data.mat$second), sc(data.mat$previous.day.count)), dim=c(1,nrow(data.mat),6))
#Y <- array(sc(time.series.day.counts[-c(1:3)]), dim=c(1,nrow(data.mat),1))
Yp.new <- predictr(model, Xnew)
Yp.new <- (maxmat[7]-minmat[7])*Yp.new + minmat[7]
Yp.new
## [,1]
## [1,] 96.9
plot(data$days[-c(1:5, length(data$days))], time.series.day.counts[-c(1:5)] , col = 'red', type = 'p', main = "RNN for New York State Daily Confirmed Cases\n actual counts (red & solid blue), training (green); predicted (solid blue)", ylab = "Daily Counts", xlab="Day", xlim=c(0,length(data$days)), ylim=c(0,max(c(Yp.new, diff(data$counts))+10)), cex.main=.7)
points(data$days[length(data$days)], Yp.new,col = 'blue' ,pch=19)
points(data$days[length(data$days)], diff(data$counts)[nrow(data)-1],col = 'red' , pch=19)
lines(data$days[-c(1:5)], c(dv.Y,Yp.new),col="green")
legend(c(60,95),c(max(diff(data$counts)),max(diff(data$counts))-1500) ,legend=paste("Last Day: \n difference between predicted (blue) and actual (red):",round(Yp.new-diff(data$counts)[nrow(data)-1],0)),cex=.5)
abline(v=nrow(data)-.5)
text(15,0,"training")
text(nrow(data)+.5,0,"test")
I’ll be the first to acknowledge that this example is just for demonstration purposes. Our data set is way, way too small to have any chance of estimating meaningful machine learning models, I created a kludgy set of predictors (aka features) without much thought, I haven’t included more predictors to the model and I’ve only focused on number of positive tests (e.g., a better demonstration of machine learning methods would be to use positive testing rates as predictors of outcomes such as subsequent death rates). the particular forms of the predictors I’m using (polynomial, first and second order differences) are known not to behave well when extrapolating beyond one’s data, as we are doing here when predicting the subsequent day. I also have not optimized for the various tuning parameters including the learning rate, so there is much one can do to improve the fit of this particular example.
There are more diagnostics to consider such as error in prediction and we can iterate over the tuning parameters to find optimal values that give good predictive validity without overfitting. Note that between day 25 and 40 the overall “trend” of the open red circles is relatively flat (as an average) but the daily variability is increasing. The current set of predictors I’m using in this RNN doesn’t have the ability to easily capture changes in the variability. This is one clue for additional predictors (features, see next subsection) that could be added to the model. This could be an interesting part of the story and something that can be addressed not only with machine learning models but also for the other models we considered in this chapter. Models that are not capturing this change in variance are possibly missing something important about the underlying process (which could just be something as simple as not all jurisdictions in New York state are reporting daily but nonetheless an important property to understand if we want to use these approaches to make forecasts). We could check whether this increasing variance occurs in other states and at different levels of analysis such as the county level. We could check whether this increase in variance is consistent with a multiplicative error model where error (the “plus \(\epsilon\)” term we discussed earlier) could provide a clue to the underlying process. The point is that there is a potential signal here that our current models are not capturing, and we need to figure out whether we need to move to a different class of models or whether this heterogeneity is due to a simple factor that can be easily addressed.
The increasing variance was not easy to detect in the cumulative plots and models on cumulative data such as the log linear, exponential and logistic models we covered earlier in this chapter. If you go back and look at those plots, even the ones using the curve fitting spline methods, they show systematic “under and over the curve” errors consistent with what we are seeing in the incidence plot. The increasing variance is there but just difficult to spot in the way we examined these data previously. Sometimes converting data to different forms, such as cumulative and incidence forms, helps highlight different aspects of the same data. It is not easy to see in the barplot of the New York incidence presented at the beginning of the RNN section. It is often useful to cross-check one’s interpretation with different model forms and plots.
There are many other approaches that fall under the large machine learning umbrella, these include extensions of RNNs such as long short-term memory deep learning models that can learn patterns further back in time than simple RNNs, general additive models with penalized smoothers, genetic algorithms, and symbolic regression as well as several packages in R to compute these methods. Overall, I was underwhelmed by the effort so far. Some of the R packages gave error messages even running the very examples supplied in the package or they made strange predictions. Part of the issue is that we don’t have enough data to allow these machine learning methods to shine. For a more detailed and systematic discussion machine learning approaches for time series data see Makridakis et al.
5.8.2 Gaussian Process Regression
There are many approaches that fall under the Machine Learning umbrella, with neural networks presented in the previous subsection being just one type of approach. In this subsection I will consider a different approach that mingles Bayesian methods with machine learning approaches and standard statistical models. The approach is to treat the function as a draw from a parent distribution of functions (see Erickson for a short introduction and Rasmussen & Williams (2006) for a book length treatment). I demonstrate one implementation using the GauPro package in R. The default is a Gaussian kernel but other can be used as well. In this demonstration I stick with the default settings but these can be tuned as with most machine learning algorithms.
FIX: temporarily commented out
# warning: sometimes this plot can look strange need to
# figure out how to get a more stable solution using
# newest version on github
first.order.train <- c(data_train[1, 2], diff(data_train[,
2]))
first.order.test <- data_test[1, 2] - data_train[nrow(data_train),
2]
gp <- GauPro(data_train[, 1], first.order.train, parallel = F)
## Error in deviance_fngr_joint(X, Z, K) :
## BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4
## Error in deviance_fngr_joint(X, Z, K) :
## BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4
## Error in deviance_fngr_joint(X, Z, K) :
## BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4
gp.predict <- predict(gp, c(nrow(data_train):(nrow(data_train) +
5)), se.fit = T)
x <- data_train[, 1]
plot(data_train[, 1], first.order.train, xlim = c(0, nrow(data_train) +
5))
curve(gp$predict(x), add = T, col = 2)
curve(gp$predict(x) + 2 * gp$predict(x, se = T)$se, add = T,
col = 4)
curve(gp$predict(x) - 2 * gp$predict(x, se = T)$se, add = T,
col = 4)
points((nrow(data_train) + 1):(nrow(data_train) + 5), gp.predict[-1,
1], col = "green")
There are related approaches in the area of functional data analysis that also treat the function as a sample but they use different estimation approaches than those in used in Gaussian process models.
5.8.3 Feature engineering
To implement machine learning methods for these data I need to spend more time thinking about the feature engineering side of things. In machine learning parlance feature engineering refers to how the input variables are processed. When one works with time, for instance, it is common to think about multiple ways that time could enter into the model and include all of them. Machine learning algorithms are designed to drop out the irrelevant formulations. There is a commonly used time series package used in machine learning (TSFRESH in python and an effort to port this to R tsfeaturex) that computes several hundred possible measures on each time series (mean, max, min, variance, entropy, auto-correlation, etc) and stores these computations into a large matrix so that they can be used down stream in subsequent analyses. I didn’t put in the effort to go through the sensible features to compute and include those as input into the models so I shouldn’t be surprised at the relatively low performance of these models given that I didn’t play by the rules of machine learning.
The concept of feature engineering is much like entering linear, quadratic, cubic terms of a predictor into a regression equation. The idea is that variants of the predictor in the form of a polynomial are treated as predictors. Take that idea, inject it with steroids, and you have feature engineering in machine learning.
As a simple illustration of feature extraction, here is the output of the daily count variable for each of the 50 states.
FIX: temporarily commented out
# since this is a package I'm just using here for
# illustration I'll read it from github and load it here
# rather than in preliminary.Rmd; if you want to
# download just uncomment the next line
# create just a submatrix of the allstates.long data
# frame
tsdat <- allstates.long[, c("day.numeric", "state.abb",
"count")]
out.featurex <- extract_features(tsdat, group_var = "state.abb",
value_var = "count", features = "all", verbose = F,
return_timing = F)
# list of features computed by tsfeaturex for each state
names(out.featurex)
# example for just the median
head(out.featurex$f.median)
# double check we get the median if we had computed it
# by hand
# median(allstates.long$count[allstates.long$state.abb
# == c('Alaska')], na.rm=T)
As you can see tsfeaturex computes 82 different measures on the count data for each state. Those 82 measures could then be included as state-level predictors in a machine learning algorithm using a type of regularizer that introduces sparseness (drops predictors that are not useful in the prediction).
A more appropriate machine learning strategy would be to take the time trajectory data for every country reported in the Johns Hopkins site (about 140 countries) and see if we can learn patterns about the nature of the trajectories, given various aspects of the country’s demographics, dates they instituted various policies (shelter at home), information about their health care system, etc. One could use the principles developed here about web scraping and gathering these other sources of data, merging them into a single file, and running appropriate machine learning algorithms to find patterns.
An interesting use of machine learning methods (specifically computer vision) is the development of a physical distancing index by a group at UM. Physical movement of humans and social distancing is detected in major cities. This website has some live cameras and charts to illustrate their index.
I’ll continue adding to this section and improving the quality of the machine learning example as time frees up from completing other sections.
5.9 Summary
We covered a lot of material in this chapter. We started with a simple linear mixed model on log transformed data that seemingly did a reasonable job of accounting for the patterns across states. The finding that the linear relation in the log scale holds fairly well suggests that there is exponential growth. The ability to make a statement about the underlying process, in this case exponential growth, is far more valuable than finding a good fit to a data set. But there was a suggestion that the exponential may not be the ideal fit as there was some systematic deviation from a growth curve. The rest of these notes covered a more direct way of fitting exponential growth models through nonlinear regression, a different type of growth model based on the logistic curve and a toy example using a standard machine learning example.
A key idea I followed in this chapter is not to have a single pattern that represents all states but to use models that simultaneously tell us something about the overall pattern across states (the fixed effect portion of the output) and the variability, or heterogeneity, across the states (the random effect portion of the model). We could get fancier with this idea where, for example, we could weight the results based on population size of each state, so counts from more populous states receive more weight. There are several generalizations of these notions from different perspectives such as statistics and machine learning that you can follow in papers (e.g., varying coefficient models such as one by Hastie and Tibshirani and extensions of tree-based methods like CART in the R package vcrpart) as well as books (Hastie et al; Ramsey & Silverman). Once we properly characterize the shape of the trajectories and the variability across units, we can then study the associations between predictors and the parameters of the model, such as I attempted to do with the state-level poverty data and a parameter from the logistic growth model.