wip: update tooling book

dshemetov · dshemetov · commit f891cec5512c · 2024-09-26T19:08:00.000-07:00
diff --git a/outliers.qmd b/outliers.qmd
@@ -12,12 +12,12 @@ source("_common.R")
 ```
 
 ```{r}
-x <- incidence_num_outlier_example
+incidence_num_outlier_example
 ```
 
 ```{r, warning=FALSE, message=FALSE}
 #| code-fold: true
-ggplot(x, aes(x = time_value, y = cases, color = geo_value)) +
+ggplot(incidence_num_outlier_example, aes(x = time_value, y = cases, color = geo_value)) +
   geom_line() +
   scale_color_manual(values = c(3, 6)) +
   geom_hline(yintercept = 0, linetype = 3) +
@@ -63,11 +63,7 @@ detection_methods = bind_rows(
          args = list(list(detect_negatives = TRUE,
                           detection_multiplier = 2.5,
                           seasonal_period = 7)),
-         abbr = "stl_seasonal"),
-  tibble(method = "stl",
-         args = list(list(detect_negatives = TRUE,
-                          detection_multiplier = 2.5)),
-         abbr = "stl_nonseasonal"))
+         abbr = "stl_seasonal"))
 
 detection_methods
 ```
@@ -78,14 +74,15 @@ Note that using this combined median threshold is equivalent to using a majority
 vote across the base methods to determine whether a value is an outlier.
 
 ```{r}
-x <- x %>%
+x <- incidence_num_outlier_example %>%
   group_by(geo_value) %>%
   mutate(
     outlier_info  = detect_outlr(
       x = time_value, y = cases,
       methods = detection_methods,
       combiner = "median")
   ) %>%
+  unpack(outlier_info) %>%
   ungroup()
 
 x
diff --git a/renv.lock b/renv.lock
@@ -775,14 +775,15 @@
     },
     "epipredict": {
       "Package": "epipredict",
-      "Version": "0.0.20",
+      "Version": "0.0.21",
       "Source": "GitHub",
       "RemoteType": "github",
-      "RemoteUsername": "cmu-delphi",
-      "RemoteRepo": "epipredict",
-      "RemoteRef": "dev",
-      "RemoteSha": "f76961cfb8ddf73f04412ef5432fd657933d0e49",
       "RemoteHost": "api.github.com",
+      "RemoteRepo": "epipredict",
+      "RemoteUsername": "cmu-delphi",
+      "RemotePkgRef": "cmu-delphi/epipredict@ds/epiprocess-0.9.0",
+      "RemoteRef": "ds/epiprocess-0.9.0",
+      "RemoteSha": "93f41405a631d4564c2bf4d1655363db90668112",
       "Requirements": [
         "R",
         "checkmate",
@@ -806,18 +807,19 @@
         "vctrs",
         "workflows"
       ],
-      "Hash": "5abb03e30a3e9a9337de4e75f64b6d34"
+      "Hash": "e94275cb50c856e89156fa7b600329e3"
     },
     "epiprocess": {
       "Package": "epiprocess",
       "Version": "0.9.0",
       "Source": "GitHub",
       "RemoteType": "github",
       "RemoteHost": "api.github.com",
-      "RemoteUsername": "cmu-delphi",
       "RemoteRepo": "epiprocess",
-      "RemoteRef": "lcb/slide-improvements-2024-06",
-      "RemoteSha": "a96297e6811ec1eba62e914c9428f6abbcd281b8",
+      "RemoteUsername": "cmu-delphi",
+      "RemotePkgRef": "cmu-delphi/epiprocess@187bd5d",
+      "RemoteRef": "187bd5d",
+      "RemoteSha": "187bd5d87eea3a4acc323a715d1ba32edb34cd49",
       "Requirements": [
         "R",
         "checkmate",
@@ -841,7 +843,7 @@
         "vctrs",
         "waldo"
       ],
-      "Hash": "a87577d084c2f364f3bf60283afec69e"
+      "Hash": "c5339d0a09b1deefb263d087ee834a9a"
     },
     "evaluate": {
       "Package": "evaluate",
diff --git a/slide.qmd b/slide.qmd
@@ -1,7 +1,7 @@
 # Sliding computations {#sec-sliding}
 
 A central tool in the `{epiprocess}` package is `epi_slide()`, which is based
-on the powerful functionality provided in the 
+on the powerful functionality provided in the
 [`slider`](https://cran.r-project.org/web/packages/slider) package. In
 `{epiprocess}`, to "slide" means to apply a computation---represented as a
 function or formula---over a sliding/rolling data window. Suitable
@@ -10,7 +10,7 @@ groupings can always be achieved by a preliminary call to `group_by()`.
 By default, the meaning of one time step is inferred from the `time_value`
 column of the `epi_df` object under consideration, based on the way this column
 understands addition and subtraction. For example, if the time values are coded
-as `Date` objects, then one time step is one day, since 
+as `Date` objects, then one time step is one day, since
 `as.Date("2022-01-01") + 1` equals `as.Date("2022-01-02")`. Alternatively, the time step can be specified
 manually in the call to `epi_slide()`; you can read the documentation for more
 details. Furthermore, the alignment of the running window used in `epi_slide()`
@@ -51,10 +51,10 @@ order to smooth the signal, by passing in a formula for the first argument of
 `epi_slide()`. To do this computation per state, we first call `group_by()`.
 
 ```{r}
-x %>% 
-  group_by(geo_value) %>% 
-  epi_slide(~ mean(.x$cases), before = 6) %>%
-  ungroup() 
+x %>%
+  group_by(geo_value) %>%
+  epi_slide(~ mean(.x$cases), .window_size = 7) %>%
+  ungroup()
 ```
 
 The formula specified has access to all non-grouping columns present in the
@@ -65,9 +65,9 @@ default. We can of course change this post hoc, or we can instead specify a new
 name up front using the `new_col_name` argument:
 
 ```{r}
-x %>% 
+x %>%
   group_by(geo_value) %>%
-  epi_slide(~ mean(.x$cases), before = 6, new_col_name = "cases_7dav") %>%
+  epi_slide(~ mean(.x$cases), .window_size = 7, .new_col_name = "cases_7dav") %>%
   ungroup()
 ```
 
@@ -81,7 +81,7 @@ Like in `group_modify()`, there are alternative names for these variables as
 well: `.` can be used instead of `.x`, `.y` instead of `.group_key`, and `.z`
 instead of `.ref_time_value`.
 
-## Slide with a function 
+## Slide with a function
 
 We can also pass a function for the first argument in `epi_slide()`. In this
 case, the passed function must accept the following arguments:
@@ -97,10 +97,10 @@ receives to `f`.
 Recreating the last example of a 7-day trailing average:
 
 ```{r}
-x %>% 
-  group_by(geo_value) %>% 
-  epi_slide(function(x, gk, rtv) mean(x$cases), 
-            before = 6, new_col_name = "cases_7dav") %>%
+x %>%
+  group_by(geo_value) %>%
+  epi_slide(function(x, gk, rtv) mean(x$cases),
+            .window_size = 7, .new_col_name = "cases_7dav") %>%
   ungroup()
 ```
 
@@ -113,9 +113,9 @@ to a computation in which we can access any columns of `x` by name, just as we
 would in a call to `dplyr::mutate()`, or any of the `dplyr` verbs. For example:
 
 ```{r}
-x <- x %>% 
-  group_by(geo_value) %>% 
-  epi_slide(cases_7dav = mean(cases), before = 6) %>%
+x <- x %>%
+  group_by(geo_value) %>%
+  epi_slide(cases_7dav = mean(cases), .window_size = 7) %>%
   ungroup()
 ```
 In addition to referring to individual columns by name, you can refer to the
@@ -128,7 +128,7 @@ top of the original counts.
 #| code-fold: true
 cols <- RColorBrewer::brewer.pal(7, "Set1")[-6]
 ggplot(x, aes(x = time_value)) +
-  geom_col(aes(y = cases, fill = geo_value), alpha = 0.5, 
+  geom_col(aes(y = cases, fill = geo_value), alpha = 0.5,
            show.legend = FALSE) +
   scale_y_continuous(expand = expansion(c(0, 0.05))) +
   geom_line(aes(y = cases_7dav, col = geo_value), show.legend = FALSE) +
@@ -139,14 +139,14 @@ ggplot(x, aes(x = time_value)) +
   labs(x = "Date", y = "Reported COVID-19 cases")
 ```
 
-As we can see from the center top panel, it looks like Florida moved to weekly 
+As we can see from the center top panel, it looks like Florida moved to weekly
 reporting of COVID-19 cases in summer of 2021, while California occasionally reported negative cases counts!
 
 ## Running a local forecaster {#sec-local-forecaster}
 
 As a more complex example, we preview some of the functionality of `{epipredict}` described in future chapters, and use a forecaster based on a
 local (in time)
-autoregression or "AR model". AR models can be fit in numerous ways 
+autoregression or "AR model". AR models can be fit in numerous ways
 (using base R
 functions and various packages), but here we the `arx_forecaster()`, implemented in `{epipredict}` both
 provides a more advanced example of sliding a function over an `epi_df` object,
@@ -165,46 +165,46 @@ considered in this vignette).
 
 ```{r eval=FALSE}
 arx_forecaster <- function(
-    epi_df, 
+    epi_df,
     outcome, # the outcome column name in `epi_df`
     predictors, # a character vector, containing 1 or more predictors in `epi_df`
-    trainer = quantile_reg(), 
+    trainer = quantile_reg(),
     args_list = arx_args_list(
-      lags = c(0, 7, 14), 
+      lags = c(0, 7, 14),
       ahead = 7,
       quantile_levels = c(0.05, 0.95)
     )
 )
 
 ```
 
-We go ahead and slide this AR forecaster over the working `epi_df` of COVID-19 
-cases. Note that we actually model the `cases_7dav` column, to operate on the 
+We go ahead and slide this AR forecaster over the working `epi_df` of COVID-19
+cases. Note that we actually model the `cases_7dav` column, to operate on the
 scale of smoothed COVID-19 cases. This is clearly equivalent, up to a constant,
 to modeling weekly sums of COVID-19 cases.
 
 ```{r, warning=FALSE}
 fc_time_values <- seq(
-  from = as.Date("2020-06-01"), 
-  to = as.Date("2021-12-01"), 
+  from = as.Date("2020-06-01"),
+  to = as.Date("2021-12-01"),
   by = "1 months")
 
 fcasts <- epi_slide(
-  x, 
-  ~ arx_forecaster(
-    epi_data = .x, 
-    outcome = "cases_7dav", 
-    predictors = "cases_7dav", 
-    trainer = quantile_reg(), 
-    args_list = arx_args_list(ahead = 7))$predictions, 
-  before = 119, 
-  ref_time_values = fc_time_values,
-  new_col_name = "fc")
+  x,
+  .f = ~ arx_forecaster(
+    epi_data = .x,
+    outcome = "cases_7dav",
+    predictors = "cases_7dav",
+    trainer = quantile_reg(),
+    args_list = arx_args_list(ahead = 7))$predictions,
+  .window_size = 120,
+  .ref_time_values = fc_time_values,
+  .new_col_name = "fc")
 
 # grab just the relevant columns, and make them easier to plot
 fcasts <- fcasts %>%
-  select(geo_value, time_value, cases_7dav,
-         contains("_distn"), fc_target_date) %>%
+  unpack(fc, names_sep = "_") %>%
+  select(geo_value, time_value, cases_7dav, starts_with("fc")) %>%
   pivot_quantiles_wider(contains("_distn"))
 fcasts
 ```
@@ -216,29 +216,29 @@ that correspond to the date the forecast is for (rather than the date it was mad
 95\% prediction band.[^1]
 
 [^1]: If instead we had set `as_list_col = TRUE`
-in the call to `epi_slide()`, then we would have gotten a list column `fc`, 
+in the call to `epi_slide()`, then we would have gotten a list column `fc`,
 where each element of `fc` contains these results.
 
 To finish off, we plot the forecasts at some times (spaced out by a few months)
-over the last year, at multiple horizons: 7, 14, 21, and 28 days ahead. To do 
-so, we encapsulate the process of generating forecasts into a simple function, 
+over the last year, at multiple horizons: 7, 14, 21, and 28 days ahead. To do
+so, we encapsulate the process of generating forecasts into a simple function,
 so that we can call it a few times.
 
 ```{r, message = FALSE, warning = FALSE}
 k_week_ahead <- function(ahead = 7) {
   epi_slide(
-    x, 
+    x,
     ~ arx_forecaster(
-      epi_data = .x, 
-      outcome = "cases_7dav", 
-      predictors = "cases_7dav", 
-      trainer = quantile_reg(), 
-      args_list = arx_args_list(ahead = ahead))$predictions, 
-    before = 119, 
-    ref_time_values = fc_time_values,
-    new_col_name = "fc") %>%
-    select(geo_value, time_value, cases_7dav, contains("_distn"), 
-           fc_target_date) %>%
+      epi_data = .x,
+      outcome = "cases_7dav",
+      predictors = "cases_7dav",
+      trainer = quantile_reg(),
+      args_list = arx_args_list(ahead = ahead))$predictions,
+    .window_size = 120,
+    .ref_time_values = fc_time_values,
+    .new_col_name = "fc") %>%
+    unpack(fc, names_sep = "_") %>%
+    select(geo_value, time_value, cases_7dav, starts_with("fc")) %>%
     pivot_quantiles_wider(contains("_distn"))
 }
 
@@ -247,15 +247,16 @@ z <- map(c(7, 14, 21, 28), k_week_ahead) %>% list_rbind()
 ```
 
 Then we can plot the on top of the observed data
+
 ```{r, fig.width=8, fig.height=9}
 #| code-fold: true
 ggplot(z) +
-  geom_line(data = x, aes(x = time_value, y = cases_7dav), color = "gray50") + 
+  geom_line(data = x, aes(x = time_value, y = cases_7dav), color = "gray50") +
   geom_ribbon(aes(x = fc_target_date, ymin = `0.05`, ymax = `0.95`,
-                  group = time_value, fill = geo_value), alpha = 0.4) + 
-  geom_line(aes(x = fc_target_date, y = `0.5`, group = time_value)) + 
-  geom_point(aes(x = fc_target_date, y = `0.5`, group = time_value), size = 0.5) + 
-  #geom_vline(data = tibble(x = fc_time_values), aes(xintercept = x), 
+                  group = time_value, fill = geo_value), alpha = 0.4) +
+  geom_line(aes(x = fc_target_date, y = `0.5`, group = time_value)) +
+  geom_point(aes(x = fc_target_date, y = `0.5`, group = time_value), size = 0.5) +
+  #geom_vline(data = tibble(x = fc_time_values), aes(xintercept = x),
   #           linetype = 2, alpha = 0.5) +
   facet_wrap(vars(geo_value), scales = "free_y", nrow = 3) +
   scale_y_continuous(expand = expansion(c(0, 0.05))) +
@@ -269,22 +270,22 @@ spotty. At various points in time, we can see that its forecasts are volatile
 (its point predictions are all over the place), or overconfident (its bands are
 too narrow), or both at the same time. This is only meant as a simple demo and
 not entirely unexpected given the way the AR model is set up. The
-[`epipredict`](https://cmu-delphi.github.io/epipredict) package, 
-offers a suite of predictive modeling tools 
-that improve on many of the shortcomings of the above simple AR model (simply 
+[`epipredict`](https://cmu-delphi.github.io/epipredict) package,
+offers a suite of predictive modeling tools
+that improve on many of the shortcomings of the above simple AR model (simply
 using all states for training rather than 6 is a huge improvement).
 
 Second, the AR forecaster here is using finalized data, meaning, it uses the
 latest versions of signal values (reported COVID-19 cases) available, for both
 training models and making predictions historically. However, this is not
 reflective of the provisional nature of the data that it must cope with in a
 true forecast task. Training and making predictions on finalized data can lead
-to an overly optimistic sense of accuracy; see, for example, 
+to an overly optimistic sense of accuracy; see, for example,
 [@McDonaldBien2021] and references
 therein. Fortunately, the `epiprocess` package provides a data structure called
 `epi_archive` that can be used to store all data revisions, and furthermore, an
 `epi_archive` object knows how to slide computations in the correct
 version-aware sense (for the computation at each reference time $t$, it uses
-only data that would have been available as of $t$). We will revisit this 
-example in the [archive 
+only data that would have been available as of $t$). We will revisit this
+example in the [archive
 vignette](https://cmu-delphi.github.io/epiprocess/articles/archive.html).
diff --git a/sliding-forecasters.qmd b/sliding-forecasters.qmd