Chapter 5 Descriptive Modeling: Basic Regressions

The previous chapter presented several descriptive plots of raw data. In the next two chapters I would like to show some simple modeling methods of the same data. By no means is any of the modeling presented here considered breakthrough modeling. I call this descriptive modeling because all I do is describe the patterns using statistical methods. In a later chapter (Chapter 6) I’ll give examples of explanatory models that attempt to capture mechanism and underlying process.

The difference between descriptive models and explanatory models is that the former are summaries of the data and, whie they can be used to forecast, but they do not provide understanding of the underling mechanisms. Descriptive models do not give insight into what will happen if a change is imposed on the system. Descriptive models don’t tell us the effect social distancing may have, for example, on the future rate of covid-19. Descriptive models are fine for describing what has happened in the past but can be limited in helping us understand how the future will play out under various scenarios including interventions. I sometimes refer to descriptive modeling as “models of data.” Explanatory, or process, models focus on modeling underlying mechanism. Explanatory models also use data to estimate parameters and make predictions, but the focus of process models is the underlying mechanism.

I’ll try to use the minimum R features necessary to make the points I want to highlight. There are more complicated approaches using time series packages that format the data differently (e.g., they use special time series objects in R). There are benefits of doing this because there are automatic plotting and estimation routines that can take time series objects and do relevant computations. But I believe that as you are learning this material it makes more sense to learn how to do some of the programming. This will put you in a better position to learn more advanced approaches like those for time series data, which use even more complex programming capabilities. It is best to learn how to program with simpler problems and then gradually add complexity rather than jump immediately to complex cases. So I’ll start with standard R objects and slowly introduce some elements unique to time series data.

For more general treatments on working with time series data see Hyndman & Athanasopoulos; Holmes, Scheuerell & Ward and the classic Box & Jenkins, with additional authors in newer editions of the book.

5.1 Regression on US State Percentages: Linear Mixed Models on Logs

In the previous chapter Descriptive Plots I showed that on the log scale the curves do not follow a straight line by state for the count/population variable, at least through the entire time series and likely not even in the early days of the pandemic. There is a belief that pandemics tend to follow an exponential process early on but not necessarily deeper into the pandemic. It isn’t clear that belief is consistent with the Covid-19 data.

I’ll illustrate one type of exponential growth with this simple example. Image each infected person infects 2 people and transmission occurs in the first day. The first person who contracts covid in the US infects 2 people for a total of 3 infected people. The next day those two newly infected individuals each transmit to 2 people, so on that day there are 4 newly infected people plus the 3 infected people from the previous day. This process continues each day, so that each newly infected person transmits to 2 new individuals. What is surprising is that if this process continues unchecked, it will take merely 28 days for the entire US population, approximately 330 million, to be infected with covid. Clearly, this simple exponential process did not play out because 10 months into the pandemic, the US is at 14 million known positive cases. I think 28 days would qualify as falling into the “early days of the pandemic,” so other processes, constraints and factors must be contributing to the spread of the disease. As we will see later in this chapter, exponential growth characterizes a family of growth rates, not just the specific one used in this example.

Still, for the time being let’s continue applying the belief that pandemics follow exponential growth. An exponential growth curve can be transformed into a straight line (i.e., linear growth) by taking logs of the data (this will be described later in this chapter). It is much easier to fit linear functions than nonlinear functions such as exponential curves, so many researchers follow the heuristic of applying a log transformation when dealing with these types of nonlinear curves.

Our ongoing example will focus on state-level effects so they are a natural candidate for a random effects linear regression. That is, each state will have its own intercept and slope on the log scale. The value of the state’s intercept can be interpreted as the percentage of the population infected with covid-19 on day 0; it can be tricky to interpret the intercept as I have formulated it so far, but I’ll present a different interpretation as we develop more machinery. The slope can be interpreted as the growth rate of the state.

We could use standard regression model (ordinary least squares) to create a complicated interaction between day and state, so that we have a state factor with 50 levels (i.e., 49 effect or contrast codes), giving us 49 intercepts (main effect of the state factor) and 49 slopes (interaction of the state factor with time) along with a main effect of time, but that is overkill. The random effects approach is a more efficient implementation. Intuitively, each state is modeled as a random draw from a multivariate normal distribution such that we draw 50 pairs of slopes and intercepts from that multivariate distribution. Once we have those 50 slope and intercept pairs, then we compute the predicted scores for each state. The algorithm finds the mean and covariance matrix of that multivariate distribution to minimize the discrepancy (i.e., squared residuals) between the observed and the predicted data. The multivariate normal distribution also allows a covariance between the intercept and slope to account for possible associations such as states that start early have more rapid growth rates.’’ As shown in the previous chapter, the log of the percentage of the state population follows an approximate line, though there was some evidence of curvature in the descriptive plots suggesting a deviation from a straight line, implying a deviation from a simple exponential growth curve.

In the following analyses, the variable day is entered into the model as a linear time variable (1, 2, 3, etc). In subsequent analyses we will consider more complicated treatments of the day variable to allow for additional nonlinearity over time.

I decided to drop rows from the data set that have counts of 0, in part because the log of 0 is undefined. Other strategies such as re-centering the time variable such that for each state day 1 corresponds to the first day the state showed a case as opposed to calendar day. In this strategy, day 1 would be assigned to March 18 for a state that showed its first case on that calendar day, whereas day 1 would be assigned to March 23 for the state that showed its first case on March 23. A drawback of this strategy is that it makes it a little trickier to model the date at which each state begins to show cases. I call this the liftoff’’ of the curve because it is when cases move away from 0. One may want to include predictors of liftoff, or attempt to explain liftoff differences across states or units, and it is easier to model these effects if the day variable follows calendar time. There are more complex modeling approaches, such as a structural equations modeling approach, that allow for more flexibility.

# drop all 0 counts from the allstates.long data base.
allstates.long <- allstates.long %>% mutate(count = na_if(count,
"0"))

# save percent variable and log10 version
allstates.long$percent <- allstates.long$count/allstates.long$population * 100 allstates.long$percent.log10 <- log10(allstates.long$percent) 5.1.1 Examine distributional assumptions It is good practice to check the distribution of the log percent variable to make sure there aren’t major issues with symmetry. The log of the percent of the state population that tests positive will be the dependent variable in the regressions we will conduct. # histogram of percent.log10; not a bad looking # distribution hist(allstates.long$percent.log10, xlab = "log percent",
main = "Histogram")

There is evidence of skewness in this variable, which raises concern about the distributional assumptions when conducting subsequent regressions. We will address this issue in later analyses but for now let’s plug along with the simpleminded idea that these data are a exponential function of day so a log transformation will be used.

5.1.2 Linear mixed model

Here are the results of this simple linear mixed model on the log scale.

# run linear mixed model default optimizer didn't
# converge so switched to Nelder_Mead
out.lmer <- lmer(percent.log10 ~ 1 + day.numeric + (1 +
day.numeric | state.abb), allstates.long, control = lmerControl(optimizer = "Nelder_Mead"))

if (is_html_output()) tab_model(out.lmer, show.se = T, show.stat = T)
percent log 10
Predictors Estimates std. Error CI Statistic p
(Intercept) -1.26 0.05 -1.35 – -1.17 -27.92 <0.001
day.numeric 0.01 0.00 0.01 – 0.01 45.16 <0.001
Random Effects
σ2 0.20
τ00 state.abb 0.10
τ11 state.abb.day.numeric 0.00
ρ01 state.abb -0.82
ICC 0.20
N state.abb 50
Observations 20648
Marginal R2 / Conditional R2 0.709 / 0.767

The estimates, se(est), CI, t and pvalue for the fixed effects are printed first, then the parameters of the random effect part are printed in the second part of the table. The random effect parameters, in order, are the variance of the residuals, the variance of the random effect intercept, the variance of the random effect slope, the correlation between the random intercept and the random slope, the adjusted ICC, number of states, number of observations (recall 0s have been removed from this data file so the N refers to the number of days by state for which there was a nonzero number of positive covid-19 cases), and the R$$^2$$ values. For more documentation on the computation of these terms see the help page for function tab_model().

5.1.3 Diagnostics

Now let’s examine plots of residuals to diagnose potential issues with this model. The plot of the residuals against the fitted values should look like a random point cloud in the ideal case and the QQ plot should show a relatively straight line. For a discussion of using residuals to diagnose model assumptions, see my lecture notes.

# plots of residuals
plot(out.lmer)

qqnorm(resid(out.lmer))
qqline(resid(out.lmer))

The residuals against the fitted values suggest a nonlinear pattern remains in the data. This suggests that we may not have completely linearized the data with the log transformation. Recall the logic rule modus tollens that states “if p, then q is accepted and not-q holds, then we can infer not-p.” Applied to this situation, if a functional relation is exponential (p), then a log transformation will linearize the relation (q). We observe from the residuals that a log transform did not linearlize the relation (not-q), so we conclude that the original functional relation was not exponential (not-p). While it may be disappointing that the residuals suggest the assumptions of regression are violated, we can make a relatively strong conclusion that the original relation, whatever it is, cannot be purely an exponential.

The qqplot also suggests a deviation from a typical normal distribution; that is, the residuals appear to have longer tails than expected by a normal distribution. Both of these plots suggest there are issues that need to be addressed. We probably don’t want to jump right into doing another transformation to deal with the skewness as that may create new issues with interpretation. We selected the log transformation because we are checking for exponential growth and on the log scale the exponential growth should become linear. So maybe the exponential growth model may not be appropriate here, or there may be other issues with the data that are contaminating the exponential growth pattern. Without additional modeling we cannot adjudicate between these possibilities just from this present regression alone.

Even though the residual plots raise concern about this model, let’s continue with this example and examine the state-level slopes and intercepts more directly with a table listing the slope and intercept for each state and with some plots. The the following plot, the light gray points have confidence intervals that overlap 0 so those parameter estimates are not significantly different from zero. We did not perform a multiple test correction such as Bonferroni or Scheffe to address the 100 tests we just did (50 states on each of slope and intercept). This would just require a small modification to the code to adjust the level of the CI to correspond, say, to the Bonferroni corrected levels. There has been some extensive debate in the literature as to whether multilevel models require additional corrections for multiple tests.

# list of the set of 50 intercept/slope pairs
coef(out.lmer)$state.abb ## (Intercept) day.numeric ## Alabama -1.193 0.00675 ## Alaska -1.732 0.00791 ## Arizona -1.341 0.00734 ## Arkansas -1.314 0.00712 ## California -1.312 0.00679 ## Colorado -1.154 0.00602 ## Connecticut -0.818 0.00505 ## Delaware -0.927 0.00561 ## Florida -1.221 0.00675 ## Georgia -1.102 0.00635 ## Hawaii -1.847 0.00667 ## Idaho -1.378 0.00721 ## Illinois -0.978 0.00596 ## Indiana -1.248 0.00675 ## Iowa -1.207 0.00693 ## Kansas -1.444 0.00746 ## Kentucky -1.560 0.00754 ## Louisiana -0.661 0.00496 ## Maine -1.557 0.00591 ## Maryland -1.013 0.00560 ## Massachusetts -0.616 0.00447 ## Michigan -1.000 0.00547 ## Minnesota -1.445 0.00734 ## Mississippi -1.044 0.00627 ## Missouri -1.479 0.00746 ## Montana -1.851 0.00845 ## Nebraska -1.234 0.00694 ## Nevada -1.229 0.00675 ## New Hampshire -1.406 0.00616 ## New Jersey -0.572 0.00453 ## New Mexico -1.322 0.00680 ## New York -0.404 0.00384 ## North Carolina -1.404 0.00705 ## North Dakota -1.382 0.00769 ## Ohio -1.438 0.00703 ## Oklahoma -1.598 0.00797 ## Oregon -1.633 0.00663 ## Pennsylvania -1.150 0.00596 ## Rhode Island -0.828 0.00567 ## South Carolina -1.333 0.00711 ## South Dakota -1.189 0.00700 ## Tennessee -1.257 0.00706 ## Texas -1.439 0.00740 ## Utah -1.331 0.00726 ## Vermont -1.504 0.00530 ## Virginia -1.282 0.00639 ## Washington -1.076 0.00513 ## West Virginia -1.701 0.00749 ## Wisconsin -1.371 0.00731 ## Wyoming -1.577 0.00756 # include 95% CI, could do instead +/- 1 standard error, # which is roughly .68 CIs light grey points are not # statistically different from 0 raeff <- REsim(out.lmer) plotREsim(raeff, labs = T, stat = "mean", level = 0.95) Exercise 5.1 Redo the plot of coefficients by altering the confidence interval to correspond to a Bonferroni correction for 50 tests within a “family.” This way the Bonferroni adjusts the confidence interval widths within family—50 confidence intervals around the intercepts and 50 confidence intervals around the slopes. It turns out that computing standard errors for the prediction is not straightforward for linear mixed models and quite complicated for nonlinear mixed models that we will perform later. You can read about the complexity here. Here is one method for computing the predicted value and a prediction interval that relies on sampling. To maintain clarity I’ll focus just on two states, but one could do these plots for all 50 states. I’ll print the data for the first 30 days in the state of New York. The table includes the day, the fit from the regression equation, the confidence interval around the fit, the fit placed back on the original scale (i.e., $$10^{\mbox{fit}}$$) and the original data on the percentage scale. # First New York nydf <- allstates.long %>% filter(state.abb == "New York") pred <- data.frame(day.numeric = nydf$day.numeric, predictInterval(out.lmer,
nydf, which = "full", level = 0.95), fit.per = 10^predictInterval(out.lmer,
nydf)$fit, percap100 = nydf$percap100)
head(round(pred, 3), 30)
##    day.numeric    fit   upr   lwr fit.per percap100
## 1            0 -0.410 0.454 -1.30   0.376     0.001
## 2            1 -0.391 0.493 -1.30   0.408     0.001
## 3            2 -0.441 0.518 -1.31   0.376     0.001
## 4            3 -0.397 0.500 -1.33   0.395     0.002
## 5            4 -0.335 0.482 -1.23   0.409     0.002
## 6            5 -0.394 0.518 -1.25   0.435     0.003
## 7            6 -0.398 0.551 -1.31   0.430     0.003
## 8            7 -0.335 0.489 -1.29   0.430     0.005
## 9            8 -0.404 0.523 -1.31   0.418     0.007
## 10           9 -0.368 0.497 -1.28   0.422     0.013
## 11          10 -0.374 0.554 -1.25   0.445     0.022
## 12          11 -0.375 0.507 -1.31   0.417     0.039
## 13          12 -0.361 0.543 -1.25   0.435     0.054
## 14          13 -0.352 0.559 -1.27   0.456     0.078
## 15          14 -0.333 0.580 -1.15   0.457     0.107
## 16          15 -0.342 0.517 -1.23   0.451     0.132
## 17          16 -0.333 0.505 -1.22   0.443     0.159
## 18          17 -0.328 0.600 -1.19   0.429     0.192
## 19          18 -0.312 0.565 -1.21   0.476     0.230
## 20          19 -0.326 0.522 -1.16   0.429     0.269
## 21          20 -0.291 0.600 -1.22   0.479     0.307
## 22          21 -0.346 0.508 -1.19   0.516     0.343
## 23          22 -0.319 0.563 -1.17   0.494     0.390
## 24          23 -0.317 0.570 -1.12   0.479     0.465
## 25          24 -0.274 0.617 -1.18   0.521     0.518
## 26          25 -0.302 0.604 -1.17   0.498     0.573
## 27          26 -0.316 0.645 -1.18   0.490     0.626
## 28          27 -0.313 0.578 -1.22   0.472     0.682
## 29          28 -0.314 0.583 -1.17   0.515     0.733
## 30          29 -0.285 0.633 -1.14   0.541     0.771
ggplot(pred) + geom_line(aes(day.numeric, fit)) + geom_ribbon(aes(day.numeric,
ymin = lwr, ymax = upr), alpha = 0.2) + geom_point(aes(day.numeric,
y = log10(percap100))) + ggtitle("New York")

Note that the linear pattern (line and prediction band) doesn’t do a good job of capturing the actual data points. Some points are over predicted in the early days, then under predicted, then over predicted again. This would produce a pattern where the residuals are negative, then positive, then negative again as we saw in one of the earlier residual plots. The data looked approximately linear but upon close inspection a model that is actually linear in the log scale seems to systematically miss the underlying pattern.

To check that this pattern is not just a fluke for New York, let’s look at another state. For Michigan a similar story, where we see over, under and over prediction. These plots suggest that the linear model may show some systematic deviations from linearity.

# similarly for michigan
midf <- allstates.long %>% filter(state.abb == "Michigan")
pred <- data.frame(day.numeric = midf$day.numeric, predictInterval(out.lmer, midf), fit.per = 10^predictInterval(out.lmer, midf)$fit,
percap100 = midf$percap100) head(round(pred, 3), 30) ## day.numeric fit upr lwr fit.per percap100 ## 1 0 -1.003 -0.488 -1.53 0.098 0.000 ## 2 1 -1.009 -0.417 -1.56 0.099 0.000 ## 3 2 -1.012 -0.428 -1.54 0.101 0.000 ## 4 3 -0.957 -0.387 -1.53 0.097 0.000 ## 5 4 -0.995 -0.379 -1.57 0.103 0.000 ## 6 5 -0.960 -0.370 -1.56 0.110 0.000 ## 7 6 -0.987 -0.370 -1.54 0.113 0.000 ## 8 7 -0.934 -0.370 -1.55 0.106 0.001 ## 9 8 -0.973 -0.392 -1.53 0.116 0.001 ## 10 9 -0.953 -0.389 -1.54 0.116 0.002 ## 11 10 -0.951 -0.379 -1.54 0.111 0.003 ## 12 11 -0.951 -0.411 -1.53 0.111 0.005 ## 13 12 -0.948 -0.346 -1.52 0.117 0.006 ## 14 13 -0.921 -0.372 -1.50 0.114 0.011 ## 15 14 -0.933 -0.366 -1.52 0.124 0.015 ## 16 15 -0.950 -0.367 -1.56 0.117 0.019 ## 17 16 -0.899 -0.322 -1.47 0.129 0.025 ## 18 17 -0.905 -0.330 -1.50 0.121 0.030 ## 19 18 -0.889 -0.269 -1.50 0.123 0.039 ## 20 19 -0.884 -0.318 -1.45 0.124 0.049 ## 21 20 -0.919 -0.312 -1.48 0.126 0.058 ## 22 21 -0.880 -0.292 -1.49 0.131 0.068 ## 23 22 -0.893 -0.305 -1.46 0.127 0.080 ## 24 23 -0.886 -0.315 -1.42 0.134 0.097 ## 25 24 -0.905 -0.333 -1.47 0.127 0.112 ## 26 25 -0.887 -0.316 -1.42 0.134 0.132 ## 27 26 -0.864 -0.285 -1.43 0.142 0.148 ## 28 27 -0.861 -0.286 -1.46 0.141 0.163 ## 29 28 -0.884 -0.308 -1.40 0.147 0.178 ## 30 29 -0.861 -0.267 -1.41 0.143 0.197 ggplot(pred) + geom_line(aes(day.numeric, fit)) + geom_ribbon(aes(day.numeric, ymin = lwr, ymax = upr), alpha = 0.2) + geom_point(aes(day.numeric, y = log10(percap100))) + ggtitle("Michigan") These plots suggest a major issue with this linear mixed model regression. The observation that the trajectories do not follow a straight line in the log scale suggests these data do not follow an exponential growth model. It is more complicated than that. We see similar violations of straight lines in the death data (e.g., see the New York Times graphs on the number of deaths for a similar curvature away from a straight line). The violation of the exponential could be due to many factors, such as interventions of social distancing, diseases tend to follow an exponential curve early but then switch to a different process (see below for logistic growth curves), additional sources of noise contaminating the data such as reporting bias, etc. Now that we understand the slopes and intercepts, let’s make some predictions. I’ll go 5 days out. The first plot will be on the log scale, but be mindful that these predictions are based on an exponential growth model that the residual analysis suggests may not hold for these data. max.day.numeric <- max(allstates.long$day.numeric)
max.day <- max(allstates.long$day) # predict day.ahead number of days days.ahead <- 5 # be sure order of states is the same across all parts # of this command predmat <- data.frame(day.numeric = rep(c(0:(max.day.numeric + days.ahead)), 50), state.abb = rep(rownames(coef(out.lmer)$state.abb),
each = max.day.numeric + days.ahead + 1), prettyval = rep(allstates.long$prettyval[order(match(allstates$Province/State,
state.population$NAME))], each = max.day.numeric + days.ahead + 1), population = rep(allstates.long$population[order(match(allstates$Province/State, state.population$NAME))], each = max.day.numeric + days.ahead +
1))

# compute predictions
predmat$prediction <- predict(out.lmer, predmat) # linear plot of predictions as estimated predmat %>% mutate(label = if_else(day.numeric == max(day.numeric), as.character(prettyval), NA_character_)) %>% ggplot(aes(day.numeric, prediction, group = prettyval, color = prettyval)) + geom_line() + geom_label_repel(aes(label = label), nudge_x = 2, na.rm = T, segment.color = "grey50", label.size = 0.01, size = 2.5, show.legend = F) + coord_cartesian(clip = "off") + scale_x_continuous(limits = c(0, max.day.numeric + days.ahead + 3)) + theme(legend.position = "none", plot.margin = margin(0.1, 1, 0.1, 0.1, "cm")) + geom_vline(xintercept = max.day.numeric + 0.5, color = "red") + annotate(geom = "text", x = max.day.numeric - 30, y = -5, label = "Days with observations", color = "red") + annotate(geom = "text", x = max.day.numeric + 8, y = -5, label = "Forecast", color = "red") Using the same data I take the exponential of those predictions to see the percentages with their predicted curves. The Y axis is on the percentage scale (count/population * 100). # plot of same predictions as 10^prediction to put back # on original percentage scale predmat %>% mutate(label = if_else(day.numeric == max(day.numeric), as.character(prettyval), NA_character_)) %>% ggplot(aes(day.numeric, 10^prediction, group = prettyval, color = prettyval)) + geom_line() + geom_label_repel(aes(label = label), nudge_x = 2, na.rm = T, segment.color = "grey50", label.size = 0.01, size = 2.5, show.legend = F) + coord_cartesian(clip = "off") + scale_x_continuous(limits = c(0, max.day.numeric + days.ahead + 3)) + theme(legend.position = "none", plot.margin = margin(0.1, 1, 0.1, 0.1, "cm")) + geom_vline(xintercept = max.day.numeric + 0.5, color = "red") + annotate(geom = "text", x = max.day.numeric - 30, y = 16, label = "Days with observations", color = "red") + annotate(geom = "text", x = max.day.numeric + 8, y = 16, label = "Forecast", color = "red") Other approaches I could have taken include using the state’s population as a weight and computed weighted linear mixed models on the log of the count data, or, because I’m working in log scale, I could use the state population as an offset in the linear mixed model (as in offset(log(population))) and use log counts as the dependent variable. These are alternative ways of taking population size into account. Further checks include taking into the account possible correlated residuals. For example, it isn’t reasonable to assume that the residual at day t is independent from the residual at day t+1 for a given state, which is what my model assumed in what I presented above. I’ll show how to include a correlation pattern on the residuals later in this chapter. There is clearly a lot more one would need to do in order to justify the use of the analytic strategy in this section. Exercise 5.2 Rerun the linear mixed model on the log transform of the counts for each state and include population size as an offset. Compare these results to the previous regression on the log of the percentage. Of course, once the basic structure of this linear mixed model on the log scale is set (e.g., error structure on residuals is determined, how best to use the state’s population in the model, etc), then one could bring in additional predictors. We could, for example, take some of the data sets we downloaded that have the dates of when states took different measures and see if any of those state mandates changed, say, the slopes for the states. At this point, once the complex error structure, linearity, and random effect structure are determined, then the rest is just routine regression that follows the analogous criteria in regression. For example, one could include interactions between predictors or controls using other predictors. 5.1.4 Comparison with polynomial regression In the previous model I showed that the linear mixed model regression (allowing for each state to have their own slope and intercept) on the log of the state-level percentages can be used to model these data. Here I want to go on a little tangent and illustrate how well a polynomial can mimic the exponential pattern we see in the data. Let’s take the case that in a particular locale the rate of covid-19 follows this general curve. In this subsection I made up the data so I know the curve follows exponential growth with an additive error term (the plus $$\epsilon$$) because I added random normally distributed noise with mean 0 and standard deviation .25. I’ll run several regressions in order and produce one plot at the end. The points on the graph are the observed data points, the colored curves emerge from different regression models. The first regression is a linear regression with just x (days 1-25) as a linear predictor. I added a violet straight line to indicate the best fitting straight line—not a good fit. The second regression adds a quadratic term and will appear as a red curve. This is an improvement, but it is nonmonotonic and shows a decrease in the first few days (which is not a pattern displayed in the raw data). The third regression adds a cubic term and fits the points relatively well. Next, I estimate the exponential directly using nonlinear regression. Instead of the parameter being a multiplier of the predictor (like in linear regression), the parameter serves as the growth factor. The green curve follows the data closely. Here only one parameter was estimated and it was .1502, very close to the actual .15 I used to estimate the data. The polynomial did quite well but needed 4 parameters (intercept and three slopes); those parameters are not easy to interpret. The single parameter of the exponential yields a very good fit with only one interpretable parameter. This makes sense because these hypothetical data were generated from an exponential model. Note that here I am using the nls() command, which estimates a nonlinear regression. Finally, for comparison, I run a linear regression through these data after transforming the Y variable into the log, then re-expressing the predicted scores back to the original scale like I did in the predictions using the linear mixed model with the lmer() command above. I’ll draw that curve in black—very similar result to the nonlinear regression with the exponential and the third order polynomial. # set random seed to results appear the same each run set.seed(101202019) x <- 1:25 # here is the true model that generated the data y <- 1 * exp(0.15 * x) + rnorm(length(x), 0, 1.5) plot(x, y) # linear regression with one predictor, draw violet # curve out.lm <- lm(y ~ x) summary(out.lm) ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.150 -3.676 -0.611 1.977 16.006 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.13 2.23 -3.19 0.0041 ** ## x 1.47 0.15 9.76 1.2e-09 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.42 on 23 degrees of freedom ## Multiple R-squared: 0.806, Adjusted R-squared: 0.797 ## F-statistic: 95.3 on 1 and 23 DF, p-value: 1.2e-09 lines(x, predict(out.lm), col = "violet") # add quadratic term, draw red curve out.q <- lm(y ~ x + I(x^2)) summary(out.q) ## ## Call: ## lm(formula = y ~ x + I(x^2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.269 -2.061 0.498 1.971 7.055 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.2569 1.7863 2.38 0.0262 * ## x -1.0626 0.3166 -3.36 0.0029 ** ## I(x^2) 0.0973 0.0118 8.23 3.7e-08 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.74 on 22 degrees of freedom ## Multiple R-squared: 0.952, Adjusted R-squared: 0.948 ## F-statistic: 220 on 2 and 22 DF, p-value: 2.86e-15 lines(x, predict(out.q), col = "red") # add cubic term, draw blue curve out.c <- lm(y ~ x + I(x^2) + I(x^3)) summary(out.c) ## ## Call: ## lm(formula = y ~ x + I(x^2) + I(x^3)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.326 -1.077 -0.044 0.807 4.874 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.89469 1.85025 -1.02 0.31748 ## x 1.52620 0.60425 2.53 0.01965 * ## I(x^2) -0.14682 0.05345 -2.75 0.01208 * ## I(x^3) 0.00626 0.00135 4.63 0.00015 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.98 on 21 degrees of freedom ## Multiple R-squared: 0.976, Adjusted R-squared: 0.973 ## F-statistic: 290 on 3 and 21 DF, p-value: <2e-16 lines(x, predict(out.c), col = "blue") # nonlinear exponential directly estimated, draw green # curve out.exp <- nls(y ~ exp(beta * x), start = list(beta = 0.5)) summary(out.exp) ## ## Formula: y ~ exp(beta * x) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## beta 0.150017 0.000941 159 <2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.76 on 24 degrees of freedom ## ## Number of iterations to convergence: 12 ## Achieved convergence tolerance: 5.2e-06 lines(x, predict(out.exp), col = "green") # log transform followed by linear regression out.ln <- lm(log(y) ~ x) summary(out.ln) ## ## Call: ## lm(formula = log(y) ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.1274 -0.1167 -0.0457 0.2062 0.6039 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.24687 0.14114 -1.75 0.094 . ## x 0.16446 0.00949 17.32 1.1e-14 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.342 on 23 degrees of freedom ## Multiple R-squared: 0.929, Adjusted R-squared: 0.926 ## F-statistic: 300 on 1 and 23 DF, p-value: 1.08e-14 lines(x, exp(predict(out.ln)), col = "black") Polynomials can do a reasonable job at capturing functions but they are not always easy to interpret without some additional work (e.g., Cudek & du Toit, 2002). The third order polynomial got close but good luck in interpreting the four parameters (intercept and three $$\beta$$s). The quadratic though while it was close predicts that the counts in the early days will drop (the red curves is decreasing from day 1 to day 4) so that’s not good because the data points don’t see to be showing that pattern. The nonlinear model (nls) estimated the parameters quite closely (the true multiplier was .15 and that is what we estimated; the standard error of the residuals was .25 and it was estimated as .29). Another detail to highlight is that the linear model on the log data (black curve) was almost indistinghisable from the true exponential pattern as estimated by the fits of the green line. The estimates from the linear model (lm) on the log transformed data are also quite close to the true value. The intercept in the log linear model should correspond to the multiplier in the true model, which is 1, and the exponential of the intercept is .98 (exp(-.0183)). The value of the multiplier in the log linear model (.1512) is close to the true multiplier in the model I created .15. While polynomials may do a good job of fitting data in hand, they may not be the useful for prediction and forecast—basically, extrapolating out of the range of the data. Here I take the same plot just examined but extend the prediction out 25 days beyond the data. The vertical black line marks the region of data (left) and prediction (right). The green curve is the estimated exponential growth fit to the data and extended 25 days out. The blue curve is the cubic form that fits the data relatively well (left side) but in extrapolation out 25 days from the data may not do a good job of predicting the actual curvature of the exponential function. The gray bands refer to the 95% prediction bands. Note that the bands for the exponential curve appear to be too narrow, and this is an outgrowth that the theory for such intervals in the nonlinear case is not clear cut. A simulation-based approach to estimate the prediction interval may be better. x <- 1:50 # pad y with missing data from day 26 to 50 y <- c(y, rep(NA, 25)) plot(x, y, ylim = c(0, 1800)) # lines(x,exp(predict(out.ln,newdata= data.frame(x=x))), # col='black') title("Failure of polynomial extrapolation") abline(v = 25.5, col = "black") text(20, 500, "Data") text(30, 500, "Prediction") text(45, 1800, "True Exponential", col = "green") text(45, 100, "Estimated Polynomial", col = "blue") pred.exp <- predFit(out.exp, newdata = data.frame(x = x[26:50]), se.fit = T, interval = "prediction", level = 0.95) polygon(c(x[26:50], rev(x[26:50])), c(pred.exp$fit[, "lwr"],
rev(pred.exp$fit[, "upr"])), col = "gray", border = NA) pred.c <- predFit(out.c, newdata = data.frame(x = x[26:50]), se.fit = T, interval = "prediction", level = 0.95) polygon(c(x[26:50], rev(x[26:50])), c(pred.c$fit[, "lwr"],
rev(pred.cfit[, "upr"])), col = "gray", border = NA) lines(x, predict(out.exp, newdata = data.frame(x = x)), col = "green") lines(x, predict(out.c, newdata = data.frame(x = x)), col = "blue") Polynomials are frequently used. I found this website that uses a 4th order polynomial to model covid-19 data from Spain, Italy and US. This page also focuses on the daily incidence rate, a topic I mentioned earlier. Personally, I’m suspicious of using such high order terms to model such curves especially if they are used to forecast out from the data (i.e., extrapolation). [As of June 17, the website no longer shows polynomials.] The polynomial idea to approximate functions can be extended in more complex ways. Here is a tutorial using the MARS package in R; this is probably the direction I’d go if I was interested in taking this type of approximation approach. But I’d rather stick to more interpretable functional forms for most of the remaining sections. 5.1.5 Thoughts on the log transformation This type of model where we transform the dependent variable by converting to logs, creates an additive error term in the log scale; that is, log(counts) = f(days) + $$\epsilon$$. But maybe that isn’t the right error structure. Maybe the error is added to the counts rather than the log of the counts; that is, the data may exhibit a counts = f(days) + $$\epsilon$$ structure, which is a different error structure than the one on the log scale. The point here is that while one may think about using a transformation to linearize a model, that transformation also has an implicit set of assumptions and has potential side-effects on the modeling such as introducing particular error structure. We now turn to modeling the curvature directly using nonlinear regression methods that not only permits complicated functioanl forms such as exponentials but also permits modeling the error as an additive effect on the counts as in the form: counts = f(days) + $$\epsilon$$ model. 5.2 Exponential Nonlinear Mixed Model Regression In the previous section I illustrated how to run a nonlinear regression using the nls() command to estimate exponential growth. 5.2.1 Exponential Background For a good explanation of exponential growth and logistic growth (which will be described in the next subsection) see this video. You can think of this exponential form as being similar to a balance in a savings account that builds with compound interest. At each time step, a percentage is added to the current principal thus increasing the balance in the savings account. At the subsequent time step, the principal is now higher so the interest earned on that day will also be higher than the previous time step. In the case of the covid-19 positive test counts, if each day the total count grows by some percentage, analogous to the principal in the savings account, then the growth is exponential. allstates.long.new <- groupedData(count ~ day.numeric | state.abb, data = allstates.long) # this is a more general function that can estimate # liftoff l (point at which counts begin to be nonzero) # for for now set l=0 richfu <- function(d, l, rate, y, b1) { ifelse(d < l, 0, y * rate^(b1 * (d - l))) } # example exponential growth model plot(0:max.day.numeric, richfu(0:max.day.numeric, 0, 1.15, 0.000555, 1) * 19500000/100, type = "l", xlab = "days", ylab = "counts", main = "Hypothetical plot of counts following an exponential growth \n(tied to population of New York)") I’m using one form of the exponential where the “rate” parameter is the base and it is raised to the power t. I did this to maintain a connection to the savings account analogy where the principal grows at rate x (where, say, a 5% interest rate has x = 1.05). The form is \begin{aligned} \mbox{count} &= y r^t \end{aligned} where t is the day, r is the rate (i.e., an “interest rate” of 5% has rate = 1.05), and y is a scaling parameter. The hypothetical plot above used a rate of 1.15 and a multiplier of .000555, which is in part based on the population of NY of 19.5M. I estimated these numbers from data collected in mid March, early into the covid-19 pandemic, and extrapolated out 60 days. Back in mid-March the idea of having 350,000 cases in New York within a couple of months seemed crazy. On May 7 New York reported 327,000 cases. There are other forms of the exponential that are equivalent, just on a different scale. For example, in the natural log scale, the form I’m using $$yr^t$$ is equivalently expressed as $$y*\exp{(log(r)t)}$$ and in base 10 it is equivalently expressed as $$y*10^{log10(r)t}$$. These are equivalent formulations, one just needs to keep track of the scale of the rate across these different choices of the base. 5.2.2 Exponential: The Liftoff Parameter The exponential function can vary in terms of liftoff, where the curve breaks away from 0. Liftoff is the parameter “l” I included in the richfu function in the previous code chunk. Here I plot two hypothetical states that have the identical set of parameters except that one state (black) has a liftoff of 0 and the other state (blue) has a liftoff of 4. It looks like the counts in the state depicted by the black curve is shooting up and by the end of the sequence almost double the count than the state represented by the blue curve. plot(0:max.day.numeric, richfu(0:max.day.numeric, 1, 1.23, 0.000555, 1) * 19500000/100, type = "l", xlab = "days", ylab = "counts", main = "Hypothetical plot of counts for two states\n with different liftoffs") lines(0:max.day.numeric, richfu(0:max.day.numeric, 4, 1.23, 0.000555, 1) * 19500000/100, col = "blue") To illustrate how time locking can show structure, let me use the previous plot, where the two states look to have different patterns. Time locking the two states to the first day when they each have nonzero cases shows that both states have identical trajectories. It is just that one state started earlier than the other state and so the patterns in the previous plot look (dramatically) different, and just by a delay of 4 days! Much like singing “Row, row, row your boat” in rounds—voices sing the same melody but start at different times creating a harmonious whole. # kludgy code to impose time locking plot(c(0:max.day.numeric)[-1], richfu(c(0:max.day.numeric)[-1], 1, 1.23, 0.000555, 1) * 19500000/100, type = "l", xlab = "days", ylab = "counts", main = "Hypothetical plot of counts for two states \n with different liftoffs but time locked") lines(c(0:max.day.numeric)[-1 * c(1:6)] - 5, richfu(c(0:max.day.numeric)[-1 * c(1:6)] - 2, 4, 1.23, 0.000555, 1) * 19500000/100, col = "blue") There are better ways to time lock using specific tools written for time series data, but those involve much more explanation that I want to go into just for illustrating the usefulness of time locking time series. For additional information see R packages written for time series data such as functional data analysis R packages. A good example of controlling for different liftoff patterns is by John Burn-Murdoch of the Financial Times. These visualization are unique from the ones I have seen on the web in that their curves are not plotted on calendar time but in terms of when the country hit X cases (sometimes X is 3 or 10 depending on the graph). This normalizes the onset and allows us to compare across units that had different starting points. A common phrase for this is “time locking” to a particular event, in this case time locking to when a country hit, say, 10 cases. There is a sense in which we didn’t have to worry about liftoff in the linear mixed model on the log scale we ran in the previous subsection because the intercept parameter essentially took care of that. We could complicate the exponential model by including a random effect intercept, but I’ll keep things simple in these notes. 5.2.3 Estimating the exponential with the nonlinear mixed model Let’s estimate an exponential growth model using a nonlinear mixed model regression. This mimics the LMER model we ran earlier in these notes but rather than transforming with a log to convert the data into straight lines, this time we model the curvature directly with an exponential function. To simplify things I constrain the liftoff parameter to 0 and the multiplier b1 to 1. To estimate all four parameters simultaneously we need to move to a different R package saemix that uses a stochastic EM algorithm to estimate this nonlinear mixed model because nlme can’t handle very complex models like this one without doing more ground work in setting up a self-starting function that has information about the gradient of the function. But, some additional work needs to be done to check that all four parameters can be estimated. There may be a redundancy with the rate and the b1 parameters (related to the base of the exponent that I described earlier in the Exponential Growth Background subsection when moving to different bases like base 10 or $$\exp$$). Sometimes parameters cannot be identified. For this estimation of the exponential function I’m using percent of positive cases in the state (i.e., count/population) rather than the raw count. currentversion <- system("git describe --tags", intern = T) save.image(file = paste0("data-archive/", currentversion, "-dataworkspace.RData")) print(paste0("File name associated with this run is ", currentversion, "-dataworkspace.RData")) ## [1] "File name associated with this run is v.95-123-g5a7aba6-dataworkspace.RData" out <- nlme(percent ~ richfu(day.numeric, l = 0, rate, y, b1 = 1), fixed = rate + y ~ 1, random = rate + y ~ 1, data = na.omit(allstates.long.new), start = c(rate = 1.01, y = 0.35), verbose = T, control = nlmeControl(msMaxIter = 500, opt = "nlminb", pnlsTol = 0.01, msVerbose = T, minScale = 0.01)) out.exponential <- out if (is_html_output()) pander(summary(out.exponential)) Fixed effects: rate + y ~ 1 Value Std.Error DF t-value p-value rate 1.007 9.558e-05 20597 10540 0 y 0.5937 0.03331 20597 17.82 1.559e-70 Standardized Within-Group Residuals Min Q1 Med Q3 Max -4.284 -0.803 -0.3264 0.2745 4.104 Linear mixed-effects model fit by maximum likelihood : rate + y ~ 1 Observations Groups Log-restricted-likelihood state.abb 20648 50 -29663 if (is_latex_output()) stargazer(out.exponential, header = F, type = "latex", report = "vcst", single.row = T, omit.stat = "n", omit.table.layout = "n", digits = 2, digits.extra = 2) # intervals(out.exponential, which='fixed') knitr::kable(intervals(out.exponential)[[1]], caption = "Confidence Intervals for Exponential Model Estimates") (#tab:ci.expo.model)Confidence Intervals for Exponential Model Estimates lower est. upper rate 1.007 1.007 1.008 y 0.528 0.594 0.659 # idea for boostrapping to get predicted interval but be # careful about the time aspect; maybe better to # simulate data # https://stackoverflow.com/questions/55552080/how-to-obtain-confidence-intervals-for-non-linear-mixed-model-logistic-growth-c The fixed effects parameters refer to state-level parameters. You can think of these as the typical values for the states. The random effects part refers to the variability across states in those parameters. For example, while the fixed effects y parameter is estimated at 0.59, the states have a standard deviation of 0.233. Let’s plot the fitted values of each state. These are the predicted exponential curves according to the model. Note that this plot represents the model estimates rather than the actual data. predmatnonlin <- data.frame(day.numeric = rep(c(0:max.day.numeric), 50), state.abb = rep(rownames(coef(out.exponential)), each = max.day.numeric + 1)) predmatnonlinprediction <- predict(out.exponential, list(day.numeric = predmatnonlin$day.numeric, state.abb = predmatnonlin$state.abb), level = 1)
predmatnonlin.exponential <- predmatnonlin
nonlinreg <- ggplot(predmatnonlin, aes(day.numeric, prediction,
group = state.abb, color = state.abb)) + geom_line()
nonlinreg

We can examine specific states. The best fitting exponential curve for New York is shown in the next figure (observed data points are in blue).

nydf <- predmatnonlin[predmatnonlin$state.abb == "New York", ] nydf$percent <- allstates.long$percent[allstates.long$state.abb ==
"New York"]
nydf.exponential <- nydf
ggplot(nydf, aes(day.numeric, prediction, group = state.abb,
color = state.abb)) + geom_line() + geom_point(aes(x = day.numeric,
y = percent), col = "blue") + ggtitle("New York: Observed data in blue; exponential growth estimate in red")

Consistent to what we observed in the log transformed data and the linear mixed model, the points do not appear to follow an exponential curve, with the case of New York being one example.

We now examine the residuals to evaluate whether there is any structure the model is failing to pick up. The overall plot of the residuals suggests some major issues. For example, I colored in blue the residuals corresponding to New York and connected them with line segments to help them stand out. There is other structure in those residuals.

plot(fitted(out.exponential), resid(out.exponential), main = "New York in blue")
# suspect weird points are new york; so put NY in blue
# points and connected dots with a curve
points(fitted(out.exponential)[names(fitted(out.exponential)) ==
"New York"], resid(out.exponential)[names(resid(out.exponential)) ==
"New York"], col = "blue")
lines(fitted(out.exponential)[names(fitted(out.exponential)) ==
"New York"], resid(out.exponential)[names(resid(out.exponential)) ==
"New York"], col = "blue")

Here is a zoomed in version of the residual plot focusing in on the major “clump” of points.

[eval off]

# let's drop outliers and look at the resid plot (zoom
# in for fits < .08 and abs(resid) < .01) this helps us
# see the majority of the residual plot
plot(fitted(out.exponential)[fitted(out.exponential) < 0.08 &
abs(resid(out.exponential)) < 0.01], resid(out.exponential)[fitted(out.exponential) <
0.08 & abs(resid(out.exponential)) < 0.01], xlab = "fits < .08",
ylab = "abs(resid) < .01", main = "zoom in ")

The QQ plot raises further suspicion on this model due to the long tails.

qqnorm(resid(out.exponential))
qqline(resid(out.exponential))

Overall, the exponential curve does not appear to offer a solid representation of these data. This is consistent with the conclusion we reached from the linear mixed model on the log data. We will move to a different form that may be more appropriate for these data.

5.3 Logistic Nonlinear Mixed Regression

Let’s switch to a logistic growth model (which is not the same as logistic regression for binary data). The nlme package has some additional functions to simplify the fitting of logistic growth models (e.g., SSlogis), which make it easier to set up starting values. In other work I’ve seen more efficient estimation (and less fussiness around starting values) with the stochastic EM approach to fitting nonlinear regression models such as in the package saemix.

The nlme() command has a difficult time with the three parameters of the logistic growth model. I did a little bit of tinkering with starting values and the three parameters. I was able to get a reasonable fit with fitting three parameters of the logistic growth model and having two of them be random effects (i.e., they are allowed to vary by state). A more careful analysis would need to be done to make this publishable, including exploring other parameterizations of logistic growth that may behave better with nlme. I give hints in the commented out sections of the R code.

The logistic growth model starts off being close to an exponential at first but eventually tapers off. Many biological systems follow a logistic growth function. Here is a link to a relatively simple explanation of logistic growth.

This is the form of logistic growth I’m using

\begin{aligned} count &= \frac{asymptote}{1 + ye^{\beta_1(day.mid - day)}} \end{aligned} The asymptote parameter controls the maximum value the function reaches, the day.mid parameter is both the day at which the function is half way to the asymptote and is also the inflection point and $$\beta_1$$ is a scaling parameter defined to be the distance between the inflection point day.mid and point at which the count is 73% of the asymptote (basically, the difference between day.mid and the day corresponding to the count asymptote*$$\frac{1}{1+exp(-1)}$$, see Pinhero & Bates). It is possible to have an extra parameter y, which is a multiplier, but be careful that the model is identified because $$y e^{x_1}$$ can be reexpressed as $$e^{u + x1}$$ where $$u=\log{(y)}$$ creating issues in separately estimating y from additive parameters that are in the exponential. For most of the following I’ll constrain y to 1, thus setting the scale, and not estimate the y parameter.

The next three plots examine how each of the three parameters control the shape of the logistic growth curve. I’m making use of the SSlogis function to define these logistic growth curves. The SSlogis() function labels these parameters asym, xmid and scal, respectively.

# vary asymp
x <- 1:80
xmid <- 25
scal <- 10
ggplot(data.frame(day.numeric = x, y = SSlogis(x, 10000,
xmid, scal)), aes(x = x)) + stat_function(fun = SSlogis,
args = list(12000, xmid, scal), aes(colour = "12000")) +
stat_function(fun = SSlogis, args = list(10000, xmid,
scal), aes(colour = "10000")) + stat_function(fun = SSlogis,
args = list(8000, xmid, scal), aes(colour = " 8000")) +
scale_colour_manual(name = "Asymp", values = c("deeppink",
"dodgerblue3", "green"), labels = c(12000 = "12000",
10000 = "10000", 8000 = "8000")) + guides(color = guide_legend(reverse = T)) +
xlab("day.numeric") + ylab("Cumulative Count") + ggtitle("Vary the asymptote parameter (how high)")

# vary xmid
x <- 1:80
xmid <- c(25, 45, 65)
scal <- 10
ggplot(data.frame(day.numeric = x, y = SSlogis(x, 10000,
xmid, scal)), aes(x = x)) + stat_function(fun = SSlogis,
args = list(10000, xmid[1], scal), aes(colour = as.character(xmid[1]))) +
stat_function(fun = SSlogis, args = list(10000, xmid[2],
scal), aes(colour = as.character(xmid[2]))) + stat_function(fun = SSlogis,
args = list(10000, xmid[3], scal), aes(colour = as.character(xmid[3]))) +
scale_colour_manual(name = "xmid", values = c("deeppink",
"dodgerblue3", "green")) + guides(color = guide_legend(reverse = T)) +
xlab("day.numeric") + ylab("Cumulative Count") + ggtitle("Vary the xmid parameter (when liftoff happens)")

# vary scale
x <- 1:80
xmid <- 45
scal <- c(2, 10, 20)
ggplot(data.frame(day.numeric = x, y = SSlogis(x, 10000,
xmid, scal)), aes(x = x)) + stat_function(fun = SSlogis,
args = list(10000, xmid, scal[1]), aes(colour = " 2")) +
stat_function(fun = SSlogis, args = list(10000, xmid,
scal[2]), aes(colour = as.character(scal[2]))) +
stat_function(fun = SSlogis, args = list(10000, xmid,
scal[3]), aes(colour = as.character(scal[3]))) +
scale_colour_manual(name = "scal", values = c("deeppink",
"dodgerblue3", "green")) + guides(color = guide_legend(reverse = T)) +
xlab("day.numeric") + ylab("Cumulative Count") + ggtitle("Vary the scale parameter (once start, how steep)")

Let’s fit this logistic growth function to these data. I now use the per capita count of positive tests out of 10,000 to normalize by state population.

The following R code makes use of the self-starting function SSlogis(), which has some useful features for running nonlinear regressions. I could incorporate population by including a weight parameter (weights=varIdent(form= ~1|state.abb)), which allows each state to have its own residual variance, or change the dependent variable from raw counts to, say, per capita counts. I’ll leave that as an exercise for the reader to check on how that changes the results.

# version was defined at the beginning of the the
# index.Rmd file get current tag and use a filename
# prefix
currentversion <- system("git describe --tags", intern = T)
save.image(file = paste0("data-archive/", currentversion,
"-dataworkspace.RData"))
print(paste0("File name associated with this run is ", currentversion,
"-dataworkspace.RData"))
## [1] "File name associated with this run is v.95-123-g5a7aba6-dataworkspace.RData"
# per capita out of 10,000
allstates.long.new$percap10000 <- allstates.long.new$percap100 *
100
# starts <- getInitial(percap10000~ SSlogis(day.numeric,
# Asym=600, xmid=200, scal=50), data=allstates.long.new)

out.perc <- nlme(percap10000 ~ SSlogis(day.numeric, Asym,
xmid, scal), fixed = Asym + xmid + scal ~ 1, random = Asym +
xmid + scal ~ 1, data = na.omit(allstates.long.new),
start = c(Asym = 850, xmid = 275, scal = 70), verbose = T,
control = nlmeControl(msMaxIter = 500, opt = "nlminb",
pnlsTol = 1e-04, msVerbose = T, minScale = 1e-04,
maxIter = 400, eval.max = 400, tolerance = 1, msTol = 1e-04))
# count version out.logistic <- out use per capita
# version throughout
out.logistic <- out.perc

if (is_html_output()) pander(summary(out.logistic))
Fixed effects: Asym + xmid + scal ~ 1
Value Std.Error DF t-value p-value
Asym 1161 107.8 20596 10.77 5.77e-27
xmid 292.2 9.276 20596 31.5 1.014e-212
scal 55.25 3.066 20596 18.02 5.129e-72
Standardized Within-Group Residuals
Min Q1 Med Q3 Max
-3.499 -0.4451 0.1083 0.819 5.096
Linear mixed-effects model fit by maximum likelihood : Asym + xmid + scal ~ 1
Observations Groups Log-restricted-likelihood
state.abb 20648 50 -103273
# if (is_latex_output()) texreg(out.logistic,
# booktabs=T, dcolumn = T, use.packages=F,
# float.pos='h', single.row=T)
if (is_latex_output()) stargazer(out.logistic, header = F,
type = "latex", report = "vcst", single.row = T, omit.stat = "n",
omit.table.layout = "n", digits = 2, digits.extra = 2,
align = T)
# confidence intervals of the parameters
knitr::kable(intervals(out.logistic, which = "fixed")[[1]],
caption = "95% confidence intervals: Logistic growth model")
(#tab:ci.logistic)95% confidence intervals: Logistic growth model
lower est. upper
Asym 949.9 1161.2 1372.6
xmid 274.0 292.2 310.4
scal 49.2 55.2 61.3
# Checking residuals for assumptions
plot(fitted(out.logistic), resid(out.logistic), main = "New York in blue")
# suspect weird points are new york; so put NY in blue
# points and connected dots with a curve
points(fitted(out.logistic)[names(fitted(out.logistic)) ==
"New York"], resid(out.logistic)[names(resid(out.logistic)) ==
"New York"], col = "blue")
lines(fitted(out.logistic)[names(fitted(out.logistic)) ==
"New York"], resid(out.logistic)[names(resid(out.logistic)) ==
"New York"], col = "blue")

# let's drop outliers and look at the resid plot (zoom
# in for fits < .08 and abs(resid) < .01) this helps us
# see the majority of the residual plot
plot(fitted(out.logistic)[fitted(out.logistic) < 50 & abs(resid(out.logistic)) <
7.5], resid(out.logistic)[fitted(out.logistic) < 50 &
abs(resid(out.logistic)) < 7.5], xlab = "fits < 50",
ylab = "abs(resid) < 7.5", main = "zoom in ")

qqnorm(resid(out.logistic))
qqline(resid(out.logistic))

# autocorrelation, set lag to 6 given small data set
ACF(out.logistic, maxLag = 6)
##   lag   ACF
## 1   0 1.000
## 2   1 0.999
## 3   2 0.997
## 4   3 0.994
## 5   4 0.989
## 6   5 0.984
## 7   6 0.978
plot(ACF(out.logistic, maxLag = 6))

We examine the same three residual plots that we used with the exponential growth model and see a slight improvement in the zoomed in version and slightly less extreme tails. While better, these residual plots suggest there is still some structure that is not adequately modeled. I also include the autocorrelation function plot, suggesting significant autocorrelation in the residuals that is not being modeled.

Below is the plot of the predicted curves for this logistic growth model. Notice that the fits for some states begin to taper off according to the logistic growth model. One could examine these differences to see which model performs better in predictive accuracy (i.e., see if the actual data for the state tapers off or continues to grow exponentially). Such patterns are testable and the ideal way to compare models and decide which one provides a better representation of the data.

predmatnonlin <- data.frame(day.numeric = rep(c(0:max.day.numeric),
50), state.abb = rep(rownames(coef(out.logistic)), each = max.day.numeric +
1))
predmatnonlin$prediction <- predict(out.logistic, list(day.numeric = predmatnonlin$day.numeric,
state.abb = predmatnonlin$state.abb), level = 1) predmatnonlin.logistic <- predmatnonlin nonlinreg.logistic <- ggplot(predmatnonlin, aes(day.numeric, prediction, group = state.abb, color = state.abb)) + geom_line() + ggtitle("Logistic Growth Model: New York") nonlinreg.logistic temp <- allstates.long.new$percap10000[order(predmatnonlin$state.abb, predmatnonlin$day.numeric)]
predmatnonlin$percap10000 <- temp ggplot(predmatnonlin, aes(day.numeric, prediction)) + geom_line(size = 1) + geom_point(data = allstates.long.new, aes(x = day.numeric, percap10000, group = state.abb), color = "red", size = 1, alpha = 0.1) + facet_wrap_paginate(~state.abb, nrow = 5, ncol = 5, page = 1) ggplot(predmatnonlin, aes(day.numeric, prediction)) + geom_line(size = 1) + geom_point(data = allstates.long.new, aes(x = day.numeric, percap10000, group = state.abb), color = "red", size = 1, alpha = 0.1) + facet_wrap_paginate(~state.abb, nrow = 5, ncol = 5, page = 2) Here is the predicted curve for New York with the actual data points in blue. We know from the analyses of the residuals that the model fit is still not ideal, but this plots suggests a logistic growth pattern is a reasonable fit to the data and better than the pure exponential growth model. Where the model is not capturing the data well is the liftoff and the sharp increase once the curve started to increase. There is also a pattern in the data around day 75 that is worth examining (e.g., checking what happened around that time in New York, such as a change in counting policy or a change in diagnostic test) and the actual data pattern is continuing to increase more rapidly than the predicted logistic growth model after day 55. nydf <- predmatnonlin[predmatnonlin$state.abb == "New York",
]

nydf$percap10000 <- allstates.long.new$percap10000[allstates.longstate.abb == "New York"] nydf.logistic <- nydf ggplot(nydf, aes(day.numeric, prediction, group = state.abb, color = state.abb)) + geom_line() + geom_point(aes(x = day.numeric, y = percap10000), col = "blue") + ggtitle("New York (blue=Actual Per Capita 10,000; red=logistic growth") Another way to compare across models is to use an index such as the AIC, which is a measure based on information theory and akin to R^2 adjusted that can be used to compare regression models that vary in the number of parameters. One needs to be careful when using AIC across models where the dependent variable is on different scales as in these notes. For example, my log transform of percentage cases (count/population) in the linear mixed model is on a different scale than the exponential nonlinear regression model I computed above. R provides such information-based criteria measures such as AIC() and an alternative called BIC(). Here I give AIC and BIC for the logistic growth model. When comparing models on comparable dependent variables the model with the lower AIC (or BIC) value is selected. AIC(out.logistic) ## [1] 206567 BIC(out.logistic) ## [1] 206646 The AIC is a function of the likelihood function and the number of parameters; BIC is a function of those two values as well as the number of data points. The two indices are functionally related by BIC = AIC + k*(log(n)-2). The logistic growth model I estimated has 7 parameters (three fixed effects, two random effects variances, one correlation between the two random effects and the residual variance) with n=20648. Exercise 5.3 Verify that you can compute BIC from the AIC, number of parameters and the number of data points. Exercise 5.4 Rerun the logistic growth model using the weight parameter to allow each state to have a different residual variance. Exercise 5.5 Use the anova command to compare the fits of the regressions with equal and unequal state error variances. This is is an increment in fit test comparing the reduced model (e.g., one residual variance for all states) to a full model (e.g., each state has its own residual variance). The command is anova(reduced, full) where you replace reduced and full with the name of the object you saved the nlme (e.g., reduced = out.logistic in the example below). Another consideration is that the residuals may not be independent given that these are time series data. Yesterday’s residual may be correlated with today’s residual. Here I add a correlation structure to the nlme command using an autocorrelation lag of 1 (AR p =1) and a moving average lag of 1 (MA q = 1). The ARMA implementation in nlme, though, operates completely on the residuals. So, the AR refers to the previous residual correlation and the MA refers to a set of independent and identically distributed noise terms apart from the residuals (see Pinhero & Bates, 2000, as this is not standard). You can think of the MA noise parameters are “factors” that are added to produce an additional source of correlation across residuals, much like a common factor would in a factor analysis model, but separate from the serial correlation of residuals. One could also try a heterogeneous residual variance model (the weight argument described above) along with the correlated error. More diagnostics would be needed to settle on the appropriate correlational structure for the error term plus the usual checks on the residuals to ensure we are properly meeting the statistical assumptions. NEED TO FIX out <- update(out.logistic, corr = corARMA(form = ~day.numeric | state.abb, p = 0, q = 1)) Here is a test of the two logistic growth models (with and without correlated error structure). The model with correlated error structure fits significantly better with only two additional parameters: one for the autogressive portion and one for the moving average portion. # test the reduced (no correlation) to the full (with # ARMA(0,1) correlational structure) anova(out.logistic, out.logistic2) For illustration purposes, here is another way of fitting this model using a different package lme4. While lme4 is in some ways easier to use, it is not as flexible as nlme, which allows, for example, correlated error structure on the residuals and the ability to impose structure on the covariance matrix between random effects. This model includes random effects from the asym and xmid parameters but there are convergence issues when all three parameters are allowed to be random effects. [now got it working with 3 RE parameters] # starts <- getInitial(percap10000 ~ # SSlogis(day.numeric, asym,xmid,scal), # data=na.omit(allstates.long.new)) starts # second way to get starting values, just run a regular # nonlinear regression without random effects to plug # into nlmer or nlme temp <- nls(percap10000 ~ # SSlogis(day.numeric, asym, xmid,scal), # data=na.omit(allstates.long.new)) coef(temp) out.logistic.nlmer <- nlmer(percap10000 ~ SSlogis(day.numeric, asym, xmid, scal) ~ asym + xmid + scal | state.abb, data = na.omit(allstates.long.new), start = c(asym = 600, xmid = 125, scal = 25), control = nlmerControl(tolPwrss = 0.1)) if (is_html_output()) pander(summary(out.logistic.nlmer)[[10]]) Estimate Std. Error t value asym 1140 1.369 833.2 xmid 298.4 0.2039 1463 scal 57.5 0.1404 409.6 if (is_latex_output()) stargazer(out.logistic.nlmer, header = F, type = "latex", report = "vcst", single.row = T, omit.stat = "n", omit.table.layout = "n", digits = 2, digits.extra = 2, align = T) 5.3.1 Additional Predictors: Example with State-Level Poverty Rate Now that we have a reasonable model, let’s see how we can include predictors to the model. One way to include predictors is to examine if the parameters of the logistic growth regression are influenced by the predictor, e.g. do the state asymptote parameters vary according to the predictor. We can accomplish this by, say, having the asymp parameter for each state be a linear function of the predictor, as in asym = $$\beta_0$$ + $$\beta_2$$ predictor + $$\epsilon$$. Our current model is the special case where $$\beta_2$$ is set to zero, thus asym = $$\beta_0$$ + $$\epsilon$$ where the intercept is the fixed effect (same value for each state) and the “$$\epsilon$$” is the random effect (a term associated with each state). By expanding the model to include a predictor associated with the asymp parameter we can examine if the predictor is linearly related to the asymp parameter. For illustration, I examine whether the poverty rate in each state is associated with the “scal” random effect coefficient in the logistic growth model. par(pty = "s") plot(allstatespovertyperc, coef(out.logistic)[, 3], xlab = "State Poverty Percentage",
ylab = "random effect parameter scal in logistic growth")
abline(coef(lm(coef(out.logistic)[, 3] ~ allstatespovertyperc))) par(pty = "m") It looks like a slight positive association is present though the data are noisy. In principle, I could test this regression slope through a regular regression but it is better to do it in the context of the nonlinear logistic growth regression model directly. The reason is that a linear regression using the coefficients from the logistic growth as though they were an observed variable doesn’t take into account the error structure of these parameters—the parameters aren’t observed data but they are estimates from another model. By including the poverty predictor simultaneously in the logistic growth model, we can model the uncertainty much better. It would be even better to run this as a Bayesian model, which I’ll add in a later version using the brm package. # use earlier estimates as starting values start.vals <- fixef(out.logistic) # drop random 7/6/20 xmid out <- nlme(percap10000 ~ SSlogis(day.numeric, Asym, xmid, scal + b2 * povertyperc), fixed = Asym + xmid + scal + b2 ~ 1, random = Asym + scal ~ 1, data = na.omit(allstates.long.new), start = c(Asym = start.vals[1], xmid = start.vals[2], scal = start.vals[3], b2 = -0.3), verbose = T, control = nlmeControl(msMaxIter = 1000, opt = "nlminb", tolerance = 0.1, pnlsTol = 0.01, msVerbose = T, minScale = 0.01, maxIter = 1000, eval.max = 1000)) # results=ifelse(longform, 'markup', 'hide') out.logistic.poverty <- out if (is_html_output()) pander(summary(out.logistic.poverty)) Fixed effects: Asym + xmid + scal + b2 ~ 1 Value Std.Error DF t-value p-value Asym 1034 42.45 20595 24.36 3.323e-129 xmid 274 0.2494 20595 1099 0 scal 47.82 7.383 20595 6.477 9.587e-11 b2 0.2385 0.5625 20595 0.424 0.6716 Standardized Within-Group Residuals Min Q1 Med Q3 Max -3.489 -0.3714 0.08368 0.6568 5.946 Linear mixed-effects model fit by maximum likelihood : Asym + xmid + scal + b2 ~ 1 Observations Groups Log-restricted-likelihood state.abb 20648 50 -109411 if (is_latex_output()) stargazer(out.logistic.poverty, header = F, type = "latex", report = "vcst", single.row = T, omit.stat = "n", omit.table.layout = "n", digits = 2, digits.extra = 2, align = T) knitr::kable(intervals(out.logistic.poverty, which = "fixed")[[1]]) lower est. upper Asym 950.740 1033.937 1117.13 xmid 273.544 274.033 274.52 scal 33.346 47.815 62.28 b2 -0.864 0.238 1.34 DOUBLECHECK: significance status has changed. The $$\beta$$ associated with poverty is statistically significant in this example so there is evidence that the state variability in the scal parameter of this logistic growth model is associated with the state’s poverty level. In other words, state-level poverty is associated with how quickly a state increases in positive tests (the scal parameter). We need to be careful in interpreting this association given that the dependent variable we are using could reflect covid incidence but could also reflect the number of tests a state has conducted. Exercise 5.6 Check whether poverty is related to the two other parameters of the logistic growth model: asymp and xmid. Interpret whether the findings suggest that poverty is related to how high the positive test cases reach or the timing of positive cases, respectively. This example illustrates how one goes about adding predictors to such a nonlinear model. One could use more complicated predictors, even test differences across groups on parameter values (e.g., the question do states with democratic governors have different parameter values than states with republican governors, can be tested by creating an appropriate grouping code for the political party of the governor and including that code as a predictor in the same way I included the predictor poverty). While the presentation has focused on associations, the overall framework can be extended to strengthen inference using various methods including instrumental variables (e.g., Carrol et al, 2004) and newer approaches (e.g., Runge et al, 2019), though much theoretical work remains to be done to move these nonlinear mixed models from mere description to stronger inferences by allowing more direct tests of possible alternative explanations. For an introduction to some basic modern methods in causal inference see Bauer and Cohen. 5.3.2 Additional assumption checking: random effect parameters I’ll revert back to the simple logistic growth model at the start of this subsection. The residual plots already raised concern about the model fit. We can also examine additional assumptions of the model such as the distribution of the random effects. The random effect parameters in the nonlinear mixed logistic growth model are assumed to follow a multivariate normal distribution, which seems to be reasonable so many days into the pandemic. Here is a 2D density plot (I know, a lot to ask for when there are relatively few pairs of points, one for each state), separate qqplots and boxplots for the asymptote and scale parameters. The qqplot for the asymptote parameter has a long tail, mostly due to New York and New Jersey; the ones for the xmid and scal parameters look ok. # edit this function to be pairs of scatterplots for 3 # random effects ggplot(coef(out.logistic), aes(x=l, # y=b1)) + stat_density_2d(aes(fill = ..level..), geom = # 'polygon', colour='white') + xlab('l = asymptote') par(mfcol = c(1, 3)) qqnorm(coef(out.logistic)[, 1], main = "qqplot for asymptote parameter") qqline(coef(out.logistic)[, 1]) qqnorm(coef(out.logistic)[, 2], main = "qqplot for xmid") qqline(coef(out.logistic)[, 2]) qqnorm(coef(out.logistic)[, 3], main = "qqplot for scal") qqline(coef(out.logistic)[, 3]) boxplot(coef(out.logistic)[, 1], ylab = "asymptote parameter") boxplot(coef(out.logistic)[, 2], ylab = "xmid parameter") boxplot(coef(out.logistic)[, 3], ylab = "scal parameter") par(mfcol = c(1, 1)) 5.4 Evaluating Exponential and Logistic Growth Models The overall point is that testing whether the data fit an exponential growth model, a logistic growth model, or various other plausible growth trajectories is not merely a modeling fitting exercise. There is much diagnostic testing one needs to do in order to figure out in what ways is a model working well and where it is messing up. The particular way a model is going awry may suggest directions for improvement and may offer clues about the underlying processes that generated the data, which is usually the important part of such an exercise. The model selection game is more complex than merely selecting the model that has the best fit value on a standard metric such as AIC or BIC or predictive accuracy. To me, there is deeper insight that emerges from understanding the type of growth that is underlying the processes under investigation. We are in a better position to understand how to intervene if we understand the growth process that underlies a data set. Just because a model achieves relatively good performance on measures of fit does not imply that the model is following the same mechanisms of the processes that generated the actual data. This is where the diagnostic tests come in to help us diagnose such situations. Let me illustrate what I mean by deeper implications. We know that if a growth pattern follows an exponential function, then the plot should be linear in the log scale. Think, the logical implication: “if p, then q”. Now, our data suggest that in the log scale we do not see linearity. What can we conclude? Well, through the logical contrapositive, the observation of not q, implies not p. In other words, the data rule out the simple exponential. Concluding that we can rule out an entire functional form from consideration is a more powerful result than merely claiming one model fits better than another. The Center for Quality, Safety & Value Analytics at Rush University released a calculator that lets the user interactively change the model on cumulative cases. For example, one can click on exponential, or logistic, or (gasp) quadratic and run a model fit along with a forecast. The user can also select the state. To get deeper in such process models of growth, we need to develop some additional machinery including differential equations, which are equations that govern changes over time. I’ll briefly touch on these models in the next chapter. check on glmer and gamma out.gamma <- glmer(percap100 ~ day.numeric + (day.numeric|state.abb), data=na.omit(allstates.long.new), family=quasi(link="identity")) 5.5 Splines, Smoothers and Time Series I’ll briefly give an example of splines because it will be helpful in subsequent sections. I don’t recommend going the route of splines and smoothers in this context. While splines can provide great fit to the data and can be used for interpolation (e.g., if we missed a day of data collection, we could interpolate), splines and smoothers are not helpful at informing one of the underlying process nor in forecasting (extrapolating) from one’s data, which is what we want to be able to do in this case. For illustration, and to show that I too can get curves that approximate the data points closely, I ran a mixed model spline regression where the knots are treated as a random effect by state and show the plots just for New York and Michigan for brevity. One would need to do all the usual diagnostics for this mixed model as well. # linear mixed model using splines on the count data out.sp <- lmer(count ~ (bs(day.numeric, intercept = T, degree = 5) | state.abb), data = na.omit(allstates.long)) # diagnostics omitted for brevity plot(out.sp) # qqnorm(resid(out.sp)) qqline(resid(out.sp)) out.temp <- na.omit(allstates.long) out.tempfit <- fitted(out.sp)
subset(out.temp, state.abb %in% c("New York", "Michigan")) %>%
ggplot(aes(x = day.numeric, y = fit, group = state.abb,
color = state.abb)) + geom_line() + geom_point(aes(x = day.numeric,
y = count), color = "black") + ggtitle("Black points are actual data; curves are spline regression fits")

The splines approach can be extended in the context of time series to permit forecasts. Here I’ll show the use of the forecast package and the excellent online book by Hyndman and Athanasopoulos. I’ll forecast 10 days out from the current date using New York data and compute an error band, show how to check residuals with the checkresiduals() command, and how to use hierarchical time series to model state trajectories within the US.

# better to redo plots with dates like 4/1/20 and weekly
# markers

nydays <- allstates.long$day[allstates.long$state.abb ==
"New York"]
# check
nycount.ts <- ts(nydf$percap10000, start = 1, end = max(nydf$day.numeric) +
1, frequency = 1)
autoplot(nycount.ts)

plot(forecast::forecast(nycount.ts), xlab = "Day", ylab = "count")

# splines, h= number days predicted
h <- 10
splne.forecast.ny <- splinef(nycount.ts, h = h)
autoplot(splne.forecast.ny)

checkresiduals(splne.forecast.ny)

# comparison (may drop) splne.forecast.ny <-
# splinef(nycount.ts,h=h, lambda='auto')
# autoplot(splne.forecast.ny)
# checkresiduals(splne.forecast.ny)

I tend to lump a different class of models into the bucket of models considered in this subsection. There are models that directly model the coefficients as varying by time. Polynomials can be thought of as having time varying coefficients. For example, the quadratic $$Y = \beta_0 + \beta_1 t + \beta_2 t^2$$ can be rewritten as a linear regression with one predictor $$t$$ that has intercept $$\beta_0$$ and a time-varying slope term ($$\beta_1 + \beta_2 t)$$. Splines extend this logic by allowing the slope at a given $$t$$ to vary continuously. There are models that take this idea and run with it. One such approach is a nonhomogenous Poisson process, which uses the Poisson distribution to model counts and allows the rate parameter to vary by time (hence the term nonhomogenous). There are several R packages that implement such models, including NHPoisson and some extensions to spatial cases such as spatstat. .In the modeling presented in this book, we mostly work with counts so a Poisson would be a natural standard distribution to consider rather than blindly relying on normal distributions assumptions as I have done in several sections. I have highlighted how the normal distribution assumptions have been suspect. A different approach for implementing time-varying coefficients is in the R package tvReg.

5.6 Bayesian Estimation of Nonlinear Regression: Adding value to prediction and uncertainty

I will illustrate the use of Bayesian nonlinear regression modeling to estimate prediction uncertainty. The Bayesian approach also allows for more general models than the standard regression approaches described earlier in this chapter, these include expanding the assumptions limitations of the classic approach. For example, within a Bayesian approach one can include predictors of the variance terms, which is not easy to do in the classic setting. This could be useful, for example, if a predictor influences not only the value of a random effect parameter but also its variance. Or, we may be able to do a better job of developing models that are more appropriate for the count data we are using throughout these notes. Another example of something that is difficult to do in the classical framework is dealing with the distribution of random effects that do not follow a normal distribution. Earlier we saw that there may be two clusters of states and that the joint distribution of random effect parameters appears to be skewed. This makes suspect inferences based on random effects that assume a joint normal distribution. These nonstandard cases can be easily handled in the Bayesian framework.

I co-wrote an introductory chapter on Bayesian statistics with Fred Feinberg that highlights some of the advantages and disadvantages of the Bayesian framework. For a complete book on Bayesian statistics I very much like the comprehensive book by Gelman et al. Bayesian Data Analysis. A more readable textbook is Kruschke’s Doing Bayesian Data Analysis.

My Bayesian example here will take a problem that was extremely difficult in what we did earlier but is relatively easy to do in the Bayesian context: put intervals (or, as a Bayesian would say, credible intervals because the interpretation is slightly different than a standard confidence interval) around predictions form a nonlinear random effects regression model. This is an important detail that the classical approach has a difficult time addressing, as I mentioned in earlier subsections of this chapter when I wasn’t able to establish the uncertainty around the predictions. While uncertainty intervals around the parameter estimates are possible in the classical approach, computing error bars around the prediction interval in a random effects nonlinear model is not straightforward.

I’ll follow the same modeling implementation as used earlier. The brm() function in the brms package provides a convenient wrapper to the Stan Bayesian modeling using typical R commands. I’ll follow a similar set of assumptions as I did before (e.g., normally distributed random effects using relatively “flat” priors). The computation takes over four hours on my desktop computer so for now I’ll set the iteration to 2000 with 4 chains, which are the default settings. When these notes are complete and I’m no longer running the code each evening, I will increase the number of iterations, say, to 10000 and vary some of the other computational details to produce more stable estimates. Experts in Bayesian analysis will see some issues with the diagnostic plots I show but I’m sure many will agree with my decision to keep things relatively quick for now in this draft form.

NEED TO FIX BAYESIAN MODEL TO USE percap10000 dv

# once working correctly move these library calls to
# prelim file already there but leaving here for now
# during debugging
library(brms)
library(bayesplot)
library(tidybayes)
# continue to use save trick (maybe cache) to minimize
# computation time
fit_loss <- brm(bf(count ~ asymp/(1 + exp(b1 * (liftoff -
day.numeric))), asymp ~ 1 + (1 | state.abb), b1 ~ 1,
liftoff ~ 1 + (1 | state.abb), nl = TRUE), data = allstates.long,
family = gaussian(), prior = c(prior(normal(29000, 40000),
nlpar = "asymp"), prior(normal(70, 40), nlpar = "liftoff"),
prior(normal(0.2, 10), nlpar = "b1", lb = 0)), iter = 2000,
control = list(adapt_delta = 0.95))
saveRDS(fit_loss, file = "fitloss.RDS")

Here are the signature Bayesian plots of each parameter (density and trace plots) as well as the intercorrelations across the parameters.

# in interest of time just read in the saved object
# currently using data through 4/22/20 in what I saved

summary(fit_loss)

plot(fit_loss, N = 3, ask = FALSE)
pairs(fit_loss)

These are the prediction bands around the “fixed effect” part of the model and a plot for the state of Michigan.

plot(conditional_effects(fit_loss))
int <- 1:100
out.predict <- predict(fit_loss, newdata = data.frame(state.abb = rep("Michigan",
max(int)), day.numeric = int), re_formula = NULL)
temp <- data.frame(day.numeric = int, out.predict)
ggplot(temp) + geom_line(aes(day.numeric, Estimate)) + geom_ribbon(aes(day.numeric,
ymin = Q2.5, ymax = Q97.5), alpha = 0.2) + geom_point(data = midf,
aes(x = day.numeric, y = count)) + ggtitle("Michigan")

5.6.1 Bayesian extensions

Pending: show a Bayesian approach to the generalized nonlinear mixed model

5.7 State-space modeling

The simple idea of a time series introduced earlier in Section 5.5 can be extend to a context involving a latent variable, or as it is called in this context, a state. The basic idea is that there are two equations governing different levels of analysis. One equation governs our observations and another equation govern a latent state.

One form of this approach has these equations. For the state-space portion, $S(t+1) = S(t) + \epsilon_s$ where S(t) is the state at time and the $$\epsilon_s$$ is the noise in the state model. This equation governs the transition from one state to another at each time step possibly with error. The error is assumed to be normally distributed with mean = 0 and some variance. If the variance is constrained to be 0, then there is no noise in process modeling the transition across states. This model can be extend to include parameters, transition probabilities, covariates, structure on the error terms and other bells-and-whistles usually associated with regression models.

The second equation governs what we observe $Y(t) = S(t) + \epsilon_m$ where Y(t) is our observation at time t, S(t) is the state at time t and the $$\epsilon_m$$ is the noise associated with our observation. Typically, the two $$\epsilon$$s in these two equations are assumed to be independent. If they are both normally distributed, then this is typically known as a dynamic linear model.

The way to think about these two equations is that there is a process governing the way way the system moves from time step to time step and there is an additional process that can lead to noisy observations of those states. There are analogous models in other areas of statistics (e.g., factor analysis can be represented in a similar way of a measurement model separate from a structural model).

I adapted some of the code from this webpage.

R example pending.

I will return to state-space models in Chapter 6 when discussing process models.

5.8 Basic Machine Learning Examples

In my opinion, while impressive progress has been made in text analysis and image analysis, the machine learning literature has lagged behind where it needs to be on development of methods that address multiple time series (e.g., times series by state, country or county). Here I review one approach, an implementation of a recurrent neural network (RNN) that I used for the data from New York state. We need to use an RNN, a specialized version of a neural network, because we have time series data. Here is one description, another description and a third description by UM’s Ivo Dinov of RNNs. The R package I will illustrate is rnn and it is a basic implementation of a recurrent neural network. There are much better implementations, such as the keras package and I’ll explore that later once I make more progress on other parts of these notes. For now, we’ll use the very simplified rnn package in R.

The underlying common element of a neural network is that they take inputs, weight them, sum and add a “bias” term. This is similar to regression in weighting predictors by betas, sum the products and add and intercept. The weighted sum is then converted into a number between 0 and 1, using the logistic form similar to logistic regression that converts the model on the real number line into the scale of the dependent variable that is a probability between 0 and 1. This can be done repeatedly to create multiple units, or different sets of weighted sums, in a manner similar to principal components analysis (PCA) where the observed variables are weighted and summed to produce components. It is possible to create additional layers with new outputs based on these weighted sums (so, weighted sums of nonlinear transformations of weighted sums). The the final layer of units is then mapped onto the outcome variable.

Here is a graphical representation of what I mean. This example has four inputs (i.e., four variables or features), one hidden layer with 8 units, and one output layer that is a single number prediction (like our counts of positive tests). The hidden layer is a set of 8 units. The two B terms are the “bias” terms that feed into the weighted sums. The thick black lines refer to higher positive weights and the thicker gray refer to smaller negative. Example adapted from the tutorial in the NeuralNetTools R package.

data(neuraldat)

neuraldat$X4 <- neuraldat$X1 * neuraldat$X2 mod <- nnet(Y1 ~ X1 + X2 + X3 + X4, data = neuraldat, size = 8) # plot par(mar = numeric(4)) plotnet(mod) The RNN approach we will use is a simplified version. More appropriate approaches would be long short-term memory models. It is common to model time series using the counts for each day (such as with the incidence plot we saw in the Section 4.4) rather than the cumulative counts we have been using throughout most of these notes. I’ll follow that convention here. The incident plot for New York State. data.temp <- allstates.long %>% filter(state.abb == "New York") data.temp$daily <- c(data.temp$count[1], diff(data.temp$count))

ggplot(data.temp, aes(x = day, y = daily)) + geom_bar(stat = "identity") +
scale_x_date(date_breaks = "1 week", date_labels = "%d %b") +
labs(y = "Daily incremental incidence ", title = "Positive US Covid-19 Cases New York") +
theme(legend.position = "none", strip.text.y = element_text(size = 11))

5.8.1 Simple RNN example

The following involves a relatively large chunk of code. Other implementations of deep learning networks require even more code. The basic idea is that we put all our data into one data frame, then partition it into testing data and training data. We want to be sure we partition early so as not to contaminate the testing data with any information that is in the testing data. For example, if we compute a mean we may inadvertently compute a mean for all the data (testing and training), and then if we use that mean in training we have contaminated the training data with information from the testing data.

The only information I have to work with so far is the number days and the previously observed daily counts; that is, as of this point in these notes we don’t have access to other possible predictors of a state’s daily count. Here, “previously observed” means if I’m at time t, then any data prior to t is “previous” data that I have but any data after t has not occurred. In the case of data I’m downloading, t may be, say, 4/10/20 so from that perspective any data after 4/10/20 cannot be used because as of that time point t I don’t know of data after t. Part of the machine learning algorithm works by varying t to different days, repeatedly feeding those sequences of data to the model and trying to learn the patterns by figuring out what it is getting wrong, making adjustments in the model’s parameters (the weights and bias terms) and making new predictions, and iterating that process.

The rest of the code is creating daily counts and several features to use as predictors. We don’t have a rich set of predictors set up yet so I created some from the day variable. I will use day, day$$^2$$ and day$$^3$$ (of course, mean centering day before computing the higher order terms). I created a predictor of how much daily change there was (e.g., the count at t minus the count at t - 1). This serves as a slope estimate of change, which is a first order difference because I’m working in discrete time (days); technically, it is a secant rather than the slope (which would correspond to the tangent). I also created a “difference of difference” (or second order difference analogous to acceleration, or how the change is changing) to measure how much the slope is changing from day to day. In other words how different is the change today from the change we saw yesterday. This process of creating new variables from existing variables is called “feature engineering.”

This model has 7 inputs: day, day$$^2$$, day$$^3$$, count, count difference, count second order difference and running mean of the counts from the previous 3 days. I currently have just have a few units in a single hidden layer, and one unit in the output layer. Of course, to make this more interesting I could add many more predictors including state-level and daily information (e.g., measures of compliance to social distancing measures; weather, which could affect people’s behavior going outside; changes in a state’s social distancing policy).

So, overall the long section of code is because we have to do some work to put the data in the right format for machine learning to do its magic.

In the figure below the red points are the actual daily cases in New York. The green curve is what the machine learning model learned. I intentionally over trained the model. The last day shows the prediction by the machine learning model in solid blue and the actual in solid red.

FIX: computer memory problem staring 8/27 with R upgrade

# set.seed(03302020) for rollify for running mean
y = allstates.long.new[allstates.long.new$state.abb == "New York", "count"] actualny <- data.frame(y = y, x = 0:(length(y) - 1)) data <- data.frame(days = actualny$x[-1], counts = actualny$y[-1]) # partition train and test data; good to do that first # so that test data doesn't creep into training this # could happen accidentally say when computing a mean # for this classroom example the test set is the last # day, so I'm testing by a prediction one day out data_train <- data[-nrow(data), ] data_test <- data[nrow(data), ] # diff computes first order differences so need to add # day 1 at the begining of the series time.series.day.counts <- ts(c(173, diff(data_train$counts)))
time.series.day.counts
## Time Series:
## Start = 1
## End = 412
## Frequency = 1
##   [1]   173    64    80    83   178    76   328   446
##   [9]  1100  1714  3439  2750  4781  5693  4797  5160
##  [17]  6556  7304  7699  7248  7015  9190 14524 10403
##  [25] 10675 10250 10995  9861  7413 10211 12071 11473
##  [33]  8533  8373  7853  7471  6658  7889  6986  6877
##  [41]  6644  6386  4633  4259  5497  6668  9442  5677
##  [49]  4313  2908  3136  4377  4490  4375  3511  3143
##  [57]  2106  2123  3211  2899  2635  2400  1824  1330
##  [65]  1518  2339  2248  2518  2023  1498  1386  1166
##  [73]  1866  1900  2071  1808  1691   961   933  1156
##  [81]  1663  1372  1209  1136  1069   789  1212  1123
##  [89]  1055   857   736   598   461   810   816   760
##  [97]   693   667   509   411   699   824   695   666
## [105]   642   501   473   814   829   734   611   534
## [113]   503   547   907   914   777   722   522   448
## [121]   568   765   920   878   747   635   873   702
## [129]   890   827   829   644   674   700   556   853
## [137]   750   760   628   635   505   572   797   690
## [145]   755   553   590   633   484   742   578   698
## [153]   574   521   582   464   642   650   692   610
## [161]   473   538   517   683   737   683   542   473
## [169]   555   482   798   669   694   713   685   603
## [177]   648   917   870   820   736   567   502   592
## [185]   662  1031   926   834   662   689   684  1025
## [193]   831   902   852   635   588   569  1101  1000
## [201]  1173  1032   969   979   984  1324  1759  1687
## [209]  1386  1147  1149  1218  2078  1609  1503  1274
## [217]  1254  1204  1098  1538  1747  1695  1512  1209
## [225]  1238  1716  1892  1834  2124  1920  1545  1781
## [233]  1711  2351  2328  2393  2122  1870  2069  2068
## [241]  3217  3211  3900  3579  3446  3957  4253  5225
## [249]  5666  5577  4192  4210  4612  4876  5824  5733
## [257]  6312  5680  6437  4882  6033  7312  8278  6861
## [265]  5478  7772  7087  8135 10148 11846 11054 10151
## [273]  8453  8435  9258 11124 11124 11000 10587 10138
## [281]  8964  9419 11723 12652 10067  9305  9805  9393
## [289] 10532 13024 13048 11129  8008  8268 10183 12917
## [297] 15393 17033 15849 12232 11242 13142 16016 17588
## [305] 19560 17839 16308 14179 14791 14704 14187 19469
## [313] 15945 13694 13410 12721 11119 13908 15684 13909
## [321] 12383 11972 11040 10365 13783 12951 12892 11209
## [329]  8932  7949  7154  6970  7897 11492  9923  8972
## [337]  8279  6584 10011  8715  8755  8330  6538  6709
## [345]  5925  9582  8573  8238  6237  6297  6923  6139
## [353]  8893  8074  8042  7827  6436  6381  6564  7468
## [361]  8151  8126  7658  5440  6833  6106  7169  4226
## [369] 10251  6687  7119  6711  6582  5515  5077  6667
## [377]  3735  2796  3309 27644  8551  8059  8694  8910
## [385]  8742  5875  7789  9170  8038  7755  7636  6659
## [393]  6041  6643  8727  8662  7495  6630  5631  5023
## [401]  6060  7029  6344  6590  5641  5026  4576  4292
## [409]  5238  4847  4138  4064
# first order difference
first.diff <- diff(time.series.day.counts)
first.diff
## Time Series:
## Start = 2
## End = 412
## Frequency = 1
##   [1]   -109     16      3     95   -102    252    118
##   [8]    654    614   1725   -689   2031    912   -896
##  [15]    363   1396    748    395   -451   -233   2175
##  [22]   5334  -4121    272   -425    745  -1134  -2448
##  [29]   2798   1860   -598  -2940   -160   -520   -382
##  [36]   -813   1231   -903   -109   -233   -258  -1753
##  [43]   -374   1238   1171   2774  -3765  -1364  -1405
##  [50]    228   1241    113   -115   -864   -368  -1037
##  [57]     17   1088   -312   -264   -235   -576   -494
##  [64]    188    821    -91    270   -495   -525   -112
##  [71]   -220    700     34    171   -263   -117   -730
##  [78]    -28    223    507   -291   -163    -73    -67
##  [85]   -280    423    -89    -68   -198   -121   -138
##  [92]   -137    349      6    -56    -67    -26   -158
##  [99]    -98    288    125   -129    -29    -24   -141
## [106]    -28    341     15    -95   -123    -77    -31
## [113]     44    360      7   -137    -55   -200    -74
## [120]    120    197    155    -42   -131   -112    238
## [127]   -171    188    -63      2   -185     30     26
## [134]   -144    297   -103     10   -132      7   -130
## [141]     67    225   -107     65   -202     37     43
## [148]   -149    258   -164    120   -124    -53     61
## [155]   -118    178      8     42    -82   -137     65
## [162]    -21    166     54    -54   -141    -69     82
## [169]    -73    316   -129     25     19    -28    -82
## [176]     45    269    -47    -50    -84   -169    -65
## [183]     90     70    369   -105    -92   -172     27
## [190]     -5    341   -194     71    -50   -217    -47
## [197]    -19    532   -101    173   -141    -63     10
## [204]      5    340    435    -72   -301   -239      2
## [211]     69    860   -469   -106   -229    -20    -50
## [218]   -106    440    209    -52   -183   -303     29
## [225]    478    176    -58    290   -204   -375    236
## [232]    -70    640    -23     65   -271   -252    199
## [239]     -1   1149     -6    689   -321   -133    511
## [246]    296    972    441    -89  -1385     18    402
## [253]    264    948    -91    579   -632    757  -1555
## [260]   1151   1279    966  -1417  -1383   2294   -685
## [267]   1048   2013   1698   -792   -903  -1698    -18
## [274]    823   1866      0   -124   -413   -449  -1174
## [281]    455   2304    929  -2585   -762    500   -412
## [288]   1139   2492     24  -1919  -3121    260   1915
## [295]   2734   2476   1640  -1184  -3617   -990   1900
## [302]   2874   1572   1972  -1721  -1531  -2129    612
## [309]    -87   -517   5282  -3524  -2251   -284   -689
## [316]  -1602   2789   1776  -1775  -1526   -411   -932
## [323]   -675   3418   -832    -59  -1683  -2277   -983
## [330]   -795   -184    927   3595  -1569   -951   -693
## [337]  -1695   3427  -1296     40   -425  -1792    171
## [344]   -784   3657  -1009   -335  -2001     60    626
## [351]   -784   2754   -819    -32   -215  -1391    -55
## [358]    183    904    683    -25   -468  -2218   1393
## [365]   -727   1063  -2943   6025  -3564    432   -408
## [372]   -129  -1067   -438   1590  -2932   -939    513
## [379]  24335 -19093   -492    635    216   -168  -2867
## [386]   1914   1381  -1132   -283   -119   -977   -618
## [393]    602   2084    -65  -1167   -865   -999   -608
## [400]   1037    969   -685    246   -949   -615   -450
## [407]   -284    946   -391   -709    -74
# second order difference
second.diff <- diff(first.diff)
second.diff
## Time Series:
## Start = 3
## End = 412
## Frequency = 1
##   [1]    125    -13     92   -197    354   -134    536
##   [8]    -40   1111  -2414   2720  -1119  -1808   1259
##  [15]   1033   -648   -353   -846    218   2408   3159
##  [22]  -9455   4393   -697   1170  -1879  -1314   5246
##  [29]   -938  -2458  -2342   2780   -360    138   -431
##  [36]   2044  -2134    794   -124    -25  -1495   1379
##  [43]   1612    -67   1603  -6539   2401    -41   1633
##  [50]   1013  -1128   -228   -749    496   -669   1054
##  [57]   1071  -1400     48     29   -341     82    682
##  [64]    633   -912    361   -765    -30    413   -108
##  [71]    920   -666    137   -434    146   -613    702
##  [78]    251    284   -798    128     90      6   -213
##  [85]    703   -512     21   -130     77    -17      1
##  [92]    486   -343    -62    -11     41   -132     60
##  [99]    386   -163   -254    100      5   -117    113
## [106]    369   -326   -110    -28     46     46     75
## [113]    316   -353   -144     82   -145    126    194
## [120]     77    -42   -197    -89     19    350   -409
## [127]    359   -251     65   -187    215     -4   -170
## [134]    441   -400    113   -142    139   -137    197
## [141]    158   -332    172   -267    239      6   -192
## [148]    407   -422    284   -244     71    114   -179
## [155]    296   -170     34   -124    -55    202    -86
## [162]    187   -112   -108    -87     72    151   -155
## [169]    389   -445    154     -6    -47    -54    127
## [176]    224   -316     -3    -34    -85    104    155
## [183]    -20    299   -474     13    -80    199    -32
## [190]    346   -535    265   -121   -167    170     28
## [197]    551   -633    274   -314     78     73     -5
## [204]    335     95   -507   -229     62    241     67
## [211]    791  -1329    363   -123    209    -30    -56
## [218]    546   -231   -261   -131   -120    332    449
## [225]   -302   -234    348   -494   -171    611   -306
## [232]    710   -663     88   -336     19    451   -200
## [239]   1150  -1155    695  -1010    188    644   -215
## [246]    676   -531   -530  -1296   1403    384   -138
## [253]    684  -1039    670  -1211   1389  -2312   2706
## [260]    128   -313  -2383     34   3677  -2979   1733
## [267]    965   -315  -2490   -111   -795   1680    841
## [274]   1043  -1866   -124   -289    -36   -725   1629
## [281]   1849  -1375  -3514   1823   1262   -912   1551
## [288]   1353  -2468  -1943  -1202   3381   1655    819
## [295]   -258   -836  -2824  -2433   2627   2890    974
## [302]  -1302    400  -3693    190   -598   2741   -699
## [309]   -430   5799  -8806   1273   1967   -405   -913
## [316]   4391  -1013  -3551    249   1115   -521    257
## [323]   4093  -4250    773  -1624   -594   1294    188
## [330]    611   1111   2668  -5164    618    258  -1002
## [337]   5122  -4723   1336   -465  -1367   1963   -955
## [344]   4441  -4666    674  -1666   2061    566  -1410
## [351]   3538  -3573    787   -183  -1176   1336    238
## [358]    721   -221   -708   -443  -1750   3611  -2120
## [365]   1790  -4006   8968  -9589   3996   -840    279
## [372]   -938    629   2028  -4522   1993   1452  23822
## [379] -43428  18601   1127   -419   -384  -2699   4781
## [386]   -533  -2513    849    164   -858    359   1220
## [393]   1482  -2149  -1102    302   -134    391   1645
## [400]    -68  -1654    931  -1195    334    165    166
## [407]   1230  -1337   -318    635
# put into data matrix and pad differences with NA
data.mat <- data.frame(days.numeric = data_train$days - mean(data_train$days), first = c(NA, first.diff), second = c(NA,
NA, second.diff))
# create day sq and day cube with centering
data.mat$daysq <- (data.mat$days.numeric - mean(data.mat$days.numeric))^2 data.mat$daycube <- (data.mat$days.numeric - mean(data.mat$days.numeric))^3
data.mat$previous.day.count <- c(NA, time.series.day.counts[-length(time.series.day.counts)]) # mean of last 3 days running_mean <- rollify(mean, window = 5) data.mat$runmean5 <- running_mean(data.mat$previous.day.count) # but to make future prediction is sensible I can't # compute first and second order difference knowing the # future count I'm predicting so move the 1st and 2nd # order diff down one row and pad with NAs; save the # last count I'm dropping here because I do know it when # making prediction to the training set # (previous.day.count is already moved up) save.last.row.data.mat <- data.mat[nrow(data.mat), ] data.mat[, 2:3] <- cbind(c(NA, data.mat[1:(nrow(data.mat) - 1), 2]), c(NA, data.mat[1:(nrow(data.mat) - 1), 3])) # drop first five rows to avoid missing data data.mat <- data.mat[-c(1:5), ] # define scaling function for scaling 0-1 sc <- function(x) { (x - min(x, na.rm = T))/(max(x, na.rm = T) - min(x, na.rm = T)) } # testing function sc as identity to check creation of X # and Y arrays sc <- function(x) x # save the original mean, sd, min and max of the # data_train features for later user cut first 5 points # of time.series.day.counts too tempmat <- cbind(data.mat$days.numeric, data.mat$daysq, data.mat$daycube, data.mat$first, data.mat$second, data.mat$previous.day.count, data.mat$runmean5, counts = time.series.day.counts[-c(1:5)])
minmat <- apply(tempmat, 2, min, na.rm = T)
maxmat <- apply(tempmat, 2, max, na.rm = T)
meanmat <- apply(tempmat, 2, mean, na.rm = T)
sdmat <- apply(tempmat, 2, function(i) sqrt(var(i, na.rm = T)))

# rnn programs usually require 3 dimensional arrays for
# X and Y
X <- array(c(sc(data.mat$days.numeric), sc(data.mat$daysq),
sc(data.mat$daycube), sc(data.mat$first), sc(data.mat$second), sc(data.mat$previous.day.count), sc(data.mat$runmean5)), dim = c(1, nrow(data.mat), 7)) Y <- array(sc(time.series.day.counts[-c(1:5)]), dim = c(1, nrow(data.mat), 1)) # I tinkered with different settings for learningrate, momentum #batchsize should not be greater than the size of the training set model <- trainr(Y = Y, X = X, learningrate = .2, momentum=.5, batchsize=5, hidden_dim = c(3), use_bias=T, numepochs = 1000, network_type="rnn") # Predicted values Yp <- predictr(model, X) # Plot predicted vs actual. Training set + testing set dv.Y <- (maxmat[7]-minmat[7])*Yp + minmat[7] dv.Y ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] 96.4 96.3 96.3 96.3 96.3 96.3 96.3 96.3 96.3 ## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] ## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3 96.3 ## [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] ## [1,] 96.3 96.2 96.3 96.2 96.2 96.2 96.2 96.2 ## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,98] [,99] [,100] [,101] [,102] [,103] [,104] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,105] [,106] [,107] [,108] [,109] [,110] [,111] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,112] [,113] [,114] [,115] [,116] [,117] [,118] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,119] [,120] [,121] [,122] [,123] [,124] [,125] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,126] [,127] [,128] [,129] [,130] [,131] [,132] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,133] [,134] [,135] [,136] [,137] [,138] [,139] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,140] [,141] [,142] [,143] [,144] [,145] [,146] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,147] [,148] [,149] [,150] [,151] [,152] [,153] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,154] [,155] [,156] [,157] [,158] [,159] [,160] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,161] [,162] [,163] [,164] [,165] [,166] [,167] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,168] [,169] [,170] [,171] [,172] [,173] [,174] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,175] [,176] [,177] [,178] [,179] [,180] [,181] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,182] [,183] [,184] [,185] [,186] [,187] [,188] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,189] [,190] [,191] [,192] [,193] [,194] [,195] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,196] [,197] [,198] [,199] [,200] [,201] [,202] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,203] [,204] [,205] [,206] [,207] [,208] [,209] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,210] [,211] [,212] [,213] [,214] [,215] [,216] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,217] [,218] [,219] [,220] [,221] [,222] [,223] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,224] [,225] [,226] [,227] [,228] [,229] [,230] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,231] [,232] [,233] [,234] [,235] [,236] [,237] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,238] [,239] [,240] [,241] [,242] [,243] [,244] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,245] [,246] [,247] [,248] [,249] [,250] [,251] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,252] [,253] [,254] [,255] [,256] [,257] [,258] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,259] [,260] [,261] [,262] [,263] [,264] [,265] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,266] [,267] [,268] [,269] [,270] [,271] [,272] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,273] [,274] [,275] [,276] [,277] [,278] [,279] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,280] [,281] [,282] [,283] [,284] [,285] [,286] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,287] [,288] [,289] [,290] [,291] [,292] [,293] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,294] [,295] [,296] [,297] [,298] [,299] [,300] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,301] [,302] [,303] [,304] [,305] [,306] [,307] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,308] [,309] [,310] [,311] [,312] [,313] [,314] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,315] [,316] [,317] [,318] [,319] [,320] [,321] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,322] [,323] [,324] [,325] [,326] [,327] [,328] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,329] [,330] [,331] [,332] [,333] [,334] [,335] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,336] [,337] [,338] [,339] [,340] [,341] [,342] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.2 96.2 ## [,343] [,344] [,345] [,346] [,347] [,348] [,349] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.3 96.2 ## [,350] [,351] [,352] [,353] [,354] [,355] [,356] ## [1,] 96.2 96.2 96.2 96.2 96.2 96.3 96.3 ## [,357] [,358] [,359] [,360] [,361] [,362] [,363] ## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3 ## [,364] [,365] [,366] [,367] [,368] [,369] [,370] ## [1,] 96.3 96.2 96.3 96.3 96.3 96.3 96.3 ## [,371] [,372] [,373] [,374] [,375] [,376] [,377] ## [1,] 96.3 96.3 96.3 96.3 96.3 96.2 96.3 ## [,378] [,379] [,380] [,381] [,382] [,383] [,384] ## [1,] 96.2 96.3 96.2 96.3 96.3 96.3 96.3 ## [,385] [,386] [,387] [,388] [,389] [,390] [,391] ## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3 ## [,392] [,393] [,394] [,395] [,396] [,397] [,398] ## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3 ## [,399] [,400] [,401] [,402] [,403] [,404] [,405] ## [1,] 96.3 96.3 96.3 96.3 96.3 96.3 96.3 ## [,406] [,407] ## [1,] 96.3 96.4 # when sq and cubing, center using day mean from training set [1], save.last.row.datamat use [2] or [3] for first and second,respectively Xnew <- array(c((data_train[nrow(data_train),1]-minmat[1])/(maxmat[1]-minmat[1]), (data_train[nrow(data_train),1]^2-minmat[2])/(maxmat[2]-minmat[2]), (data_train[nrow(data_train),1]^3-minmat[3])/(maxmat[3]-minmat[3]), unlist((save.last.row.data.mat[2]-minmat[4])/(maxmat[4]-minmat[4])), unlist((save.last.row.data.mat[3]-minmat[5])/(maxmat[5]-minmat[5])), #(time.series.day.counts[length(time.series.day.counts)]-minmat[6])/(maxmat[6]-minmat[6])), unlist((save.last.row.data.mat[6]-minmat[6])/(maxmat[6]-minmat[6])), unlist((save.last.row.data.mat[7]-minmat[7])/(maxmat[7]-minmat[7]))), dim=c(1,1,7)) #X <- array(c(sc(data.mat$days.numeric),sc(data.mat$daysq), sc(data.mat$daycube), sc(data.mat$first),sc(data.mat$second), sc(data.mat$previous.day.count)), dim=c(1,nrow(data.mat),6)) #Y <- array(sc(time.series.day.counts[-c(1:3)]), dim=c(1,nrow(data.mat),1)) Yp.new <- predictr(model, Xnew) Yp.new <- (maxmat[7]-minmat[7])*Yp.new + minmat[7] Yp.new ## [,1] ## [1,] 96.9 plot(data$days[-c(1:5, length(data$days))], time.series.day.counts[-c(1:5)] , col = 'red', type = 'p', main = "RNN for New York State Daily Confirmed Cases\n actual counts (red & solid blue), training (green); predicted (solid blue)", ylab = "Daily Counts", xlab="Day", xlim=c(0,length(data$days)), ylim=c(0,max(c(Yp.new, diff(data$counts))+10)), cex.main=.7) points(data$days[length(data$days)], Yp.new,col = 'blue' ,pch=19) points(data$days[length(data$days)], diff(data$counts)[nrow(data)-1],col = 'red' , pch=19)
lines(data$days[-c(1:5)], c(dv.Y,Yp.new),col="green") legend(c(60,95),c(max(diff(data$counts)),max(diff(data$counts))-1500) ,legend=paste("Last Day: \n difference between predicted (blue) and actual (red):",round(Yp.new-diff(data$counts)[nrow(data)-1],0)),cex=.5)
abline(v=nrow(data)-.5)
text(15,0,"training")
text(nrow(data)+.5,0,"test")

#plot commented out; shows error decrease over iterations
#plot(colMeans(model$error), type="l") I’ll be the first to acknowledge that this example is just for demonstration purposes. Our data set is way, way too small to have any chance of estimating meaningful machine learning models, I created a kludgy set of predictors (aka features) without much thought, I haven’t included more predictors to the model and I’ve only focused on number of positive tests (e.g., a better demonstration of machine learning methods would be to use positive testing rates as predictors of outcomes such as subsequent death rates). the particular forms of the predictors I’m using (polynomial, first and second order differences) are known not to behave well when extrapolating beyond one’s data, as we are doing here when predicting the subsequent day. I also have not optimized for the various tuning parameters including the learning rate, so there is much one can do to improve the fit of this particular example. There are more diagnostics to consider such as error in prediction and we can iterate over the tuning parameters to find optimal values that give good predictive validity without overfitting. Note that between day 25 and 40 the overall “trend” of the open red circles is relatively flat (as an average) but the daily variability is increasing. The current set of predictors I’m using in this RNN doesn’t have the ability to easily capture changes in the variability. This is one clue for additional predictors (features, see next subsection) that could be added to the model. This could be an interesting part of the story and something that can be addressed not only with machine learning models but also for the other models we considered in this chapter. Models that are not capturing this change in variance are possibly missing something important about the underlying process (which could just be something as simple as not all jurisdictions in New York state are reporting daily but nonetheless an important property to understand if we want to use these approaches to make forecasts). We could check whether this increasing variance occurs in other states and at different levels of analysis such as the county level. We could check whether this increase in variance is consistent with a multiplicative error model where error (the “plus $$\epsilon$$” term we discussed earlier) could provide a clue to the underlying process. The point is that there is a potential signal here that our current models are not capturing, and we need to figure out whether we need to move to a different class of models or whether this heterogeneity is due to a simple factor that can be easily addressed. The increasing variance was not easy to detect in the cumulative plots and models on cumulative data such as the log linear, exponential and logistic models we covered earlier in this chapter. If you go back and look at those plots, even the ones using the curve fitting spline methods, they show systematic “under and over the curve” errors consistent with what we are seeing in the incidence plot. The increasing variance is there but just difficult to spot in the way we examined these data previously. Sometimes converting data to different forms, such as cumulative and incidence forms, helps highlight different aspects of the same data. It is not easy to see in the barplot of the New York incidence presented at the beginning of the RNN section. It is often useful to cross-check one’s interpretation with different model forms and plots. There are many other approaches that fall under the large machine learning umbrella, these include extensions of RNNs such as long short-term memory deep learning models that can learn patterns further back in time than simple RNNs, general additive models with penalized smoothers, genetic algorithms, and symbolic regression as well as several packages in R to compute these methods. Overall, I was underwhelmed by the effort so far. Some of the R packages gave error messages even running the very examples supplied in the package or they made strange predictions. Part of the issue is that we don’t have enough data to allow these machine learning methods to shine. For a more detailed and systematic discussion machine learning approaches for time series data see Makridakis et al. 5.8.2 Gaussian Process Regression There are many approaches that fall under the Machine Learning umbrella, with neural networks presented in the previous subsection being just one type of approach. In this subsection I will consider a different approach that mingles Bayesian methods with machine learning approaches and standard statistical models. The approach is to treat the function as a draw from a parent distribution of functions (see Erickson for a short introduction and Rasmussen & Williams (2006) for a book length treatment). I demonstrate one implementation using the GauPro package in R. The default is a Gaussian kernel but other can be used as well. In this demonstration I stick with the default settings but these can be tuned as with most machine learning algorithms. FIX: temporarily commented out # warning: sometimes this plot can look strange need to # figure out how to get a more stable solution using # newest version on github first.order.train <- c(data_train[1, 2], diff(data_train[, 2])) first.order.test <- data_test[1, 2] - data_train[nrow(data_train), 2] gp <- GauPro(data_train[, 1], first.order.train, parallel = F) ## Error in deviance_fngr_joint(X, Z, K) : ## BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4 ## Error in deviance_fngr_joint(X, Z, K) : ## BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4 ## Error in deviance_fngr_joint(X, Z, K) : ## BLAS/LAPACK routine 'DLASCLMOIEDDLASRTDSP' gave error code -4 plot(gp, n2 = 500, nn = 1000) gp.predict <- predict(gp, c(nrow(data_train):(nrow(data_train) + 5)), se.fit = T) x <- data_train[, 1] plot(data_train[, 1], first.order.train, xlim = c(0, nrow(data_train) + 5)) curve(gp$predict(x), add = T, col = 2)
curve(gp$predict(x) + 2 * gp$predict(x, se = T)$se, add = T, col = 4) curve(gp$predict(x) - 2 * gp$predict(x, se = T)$se, add = T,
col = 4)
points((nrow(data_train) + 1):(nrow(data_train) + 5), gp.predict[-1,
1], col = "green")

There are related approaches in the area of functional data analysis that also treat the function as a sample but they use different estimation approaches than those in used in Gaussian process models.

5.8.3 Feature engineering

To implement machine learning methods for these data I need to spend more time thinking about the feature engineering side of things. In machine learning parlance feature engineering refers to how the input variables are processed. When one works with time, for instance, it is common to think about multiple ways that time could enter into the model and include all of them. Machine learning algorithms are designed to drop out the irrelevant formulations. There is a commonly used time series package used in machine learning (TSFRESH in python and an effort to port this to R tsfeaturex) that computes several hundred possible measures on each time series (mean, max, min, variance, entropy, auto-correlation, etc) and stores these computations into a large matrix so that they can be used down stream in subsequent analyses. I didn’t put in the effort to go through the sensible features to compute and include those as input into the models so I shouldn’t be surprised at the relatively low performance of these models given that I didn’t play by the rules of machine learning.

The concept of feature engineering is much like entering linear, quadratic, cubic terms of a predictor into a regression equation. The idea is that variants of the predictor in the form of a polynomial are treated as predictors. Take that idea, inject it with steroids, and you have feature engineering in machine learning.

As a simple illustration of feature extraction, here is the output of the daily count variable for each of the 50 states.

FIX: temporarily commented out

# since this is a package I'm just using here for
# illustration I'll read it from github and load it here
# rather than in preliminary.Rmd; if you want to
# download just uncomment the next line

# create just a submatrix of the allstates.long data
# frame
tsdat <- allstates.long[, c("day.numeric", "state.abb",
"count")]
out.featurex <- extract_features(tsdat, group_var = "state.abb",
value_var = "count", features = "all", verbose = F,
return_timing = F)

# list of features computed by tsfeaturex for each state
names(out.featurex)
# example for just the median
head(out.featurex$f.median) # double check we get the median if we had computed it # by hand # median(allstates.long$count[allstates.long\$state.abb
# == c('Alaska')], na.rm=T)

As you can see tsfeaturex computes 82 different measures on the count data for each state. Those 82 measures could then be included as state-level predictors in a machine learning algorithm using a type of regularizer that introduces sparseness (drops predictors that are not useful in the prediction).

A more appropriate machine learning strategy would be to take the time trajectory data for every country reported in the Johns Hopkins site (about 140 countries) and see if we can learn patterns about the nature of the trajectories, given various aspects of the country’s demographics, dates they instituted various policies (shelter at home), information about their health care system, etc. One could use the principles developed here about web scraping and gathering these other sources of data, merging them into a single file, and running appropriate machine learning algorithms to find patterns.

An interesting use of machine learning methods (specifically computer vision) is the development of a physical distancing index by a group at UM. Physical movement of humans and social distancing is detected in major cities. This website has some live cameras and charts to illustrate their index.

I’ll continue adding to this section and improving the quality of the machine learning example as time frees up from completing other sections.

5.9 Summary

We covered a lot of material in this chapter. We started with a simple linear mixed model on log transformed data that seemingly did a reasonable job of accounting for the patterns across states. The finding that the linear relation in the log scale holds fairly well suggests that there is exponential growth. The ability to make a statement about the underlying process, in this case exponential growth, is far more valuable than finding a good fit to a data set. But there was a suggestion that the exponential may not be the ideal fit as there was some systematic deviation from a growth curve. The rest of these notes covered a more direct way of fitting exponential growth models through nonlinear regression, a different type of growth model based on the logistic curve and a toy example using a standard machine learning example.

A key idea I followed in this chapter is not to have a single pattern that represents all states but to use models that simultaneously tell us something about the overall pattern across states (the fixed effect portion of the output) and the variability, or heterogeneity, across the states (the random effect portion of the model). We could get fancier with this idea where, for example, we could weight the results based on population size of each state, so counts from more populous states receive more weight. There are several generalizations of these notions from different perspectives such as statistics and machine learning that you can follow in papers (e.g., varying coefficient models such as one by Hastie and Tibshirani and extensions of tree-based methods like CART in the R package vcrpart) as well as books (Hastie et al; Ramsey & Silverman). Once we properly characterize the shape of the trajectories and the variability across units, we can then study the associations between predictors and the parameters of the model, such as I attempted to do with the state-level poverty data and a parameter from the logistic growth model.