Skip to content

Commit f891cec

Browse files
committed
wip: update tooling book
1 parent a260551 commit f891cec

File tree

4 files changed

+94
-102
lines changed

4 files changed

+94
-102
lines changed

outliers.qmd

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ source("_common.R")
1212
```
1313

1414
```{r}
15-
x <- incidence_num_outlier_example
15+
incidence_num_outlier_example
1616
```
1717

1818
```{r, warning=FALSE, message=FALSE}
1919
#| code-fold: true
20-
ggplot(x, aes(x = time_value, y = cases, color = geo_value)) +
20+
ggplot(incidence_num_outlier_example, aes(x = time_value, y = cases, color = geo_value)) +
2121
geom_line() +
2222
scale_color_manual(values = c(3, 6)) +
2323
geom_hline(yintercept = 0, linetype = 3) +
@@ -63,11 +63,7 @@ detection_methods = bind_rows(
6363
args = list(list(detect_negatives = TRUE,
6464
detection_multiplier = 2.5,
6565
seasonal_period = 7)),
66-
abbr = "stl_seasonal"),
67-
tibble(method = "stl",
68-
args = list(list(detect_negatives = TRUE,
69-
detection_multiplier = 2.5)),
70-
abbr = "stl_nonseasonal"))
66+
abbr = "stl_seasonal"))
7167
7268
detection_methods
7369
```
@@ -78,14 +74,15 @@ Note that using this combined median threshold is equivalent to using a majority
7874
vote across the base methods to determine whether a value is an outlier.
7975

8076
```{r}
81-
x <- x %>%
77+
x <- incidence_num_outlier_example %>%
8278
group_by(geo_value) %>%
8379
mutate(
8480
outlier_info = detect_outlr(
8581
x = time_value, y = cases,
8682
methods = detection_methods,
8783
combiner = "median")
8884
) %>%
85+
unpack(outlier_info) %>%
8986
ungroup()
9087
9188
x

renv.lock

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -775,14 +775,15 @@
775775
},
776776
"epipredict": {
777777
"Package": "epipredict",
778-
"Version": "0.0.20",
778+
"Version": "0.0.21",
779779
"Source": "GitHub",
780780
"RemoteType": "github",
781-
"RemoteUsername": "cmu-delphi",
782-
"RemoteRepo": "epipredict",
783-
"RemoteRef": "dev",
784-
"RemoteSha": "f76961cfb8ddf73f04412ef5432fd657933d0e49",
785781
"RemoteHost": "api.github.com",
782+
"RemoteRepo": "epipredict",
783+
"RemoteUsername": "cmu-delphi",
784+
"RemotePkgRef": "cmu-delphi/epipredict@ds/epiprocess-0.9.0",
785+
"RemoteRef": "ds/epiprocess-0.9.0",
786+
"RemoteSha": "93f41405a631d4564c2bf4d1655363db90668112",
786787
"Requirements": [
787788
"R",
788789
"checkmate",
@@ -806,18 +807,19 @@
806807
"vctrs",
807808
"workflows"
808809
],
809-
"Hash": "5abb03e30a3e9a9337de4e75f64b6d34"
810+
"Hash": "e94275cb50c856e89156fa7b600329e3"
810811
},
811812
"epiprocess": {
812813
"Package": "epiprocess",
813814
"Version": "0.9.0",
814815
"Source": "GitHub",
815816
"RemoteType": "github",
816817
"RemoteHost": "api.github.com",
817-
"RemoteUsername": "cmu-delphi",
818818
"RemoteRepo": "epiprocess",
819-
"RemoteRef": "lcb/slide-improvements-2024-06",
820-
"RemoteSha": "a96297e6811ec1eba62e914c9428f6abbcd281b8",
819+
"RemoteUsername": "cmu-delphi",
820+
"RemotePkgRef": "cmu-delphi/epiprocess@187bd5d",
821+
"RemoteRef": "187bd5d",
822+
"RemoteSha": "187bd5d87eea3a4acc323a715d1ba32edb34cd49",
821823
"Requirements": [
822824
"R",
823825
"checkmate",
@@ -841,7 +843,7 @@
841843
"vctrs",
842844
"waldo"
843845
],
844-
"Hash": "a87577d084c2f364f3bf60283afec69e"
846+
"Hash": "c5339d0a09b1deefb263d087ee834a9a"
845847
},
846848
"evaluate": {
847849
"Package": "evaluate",

slide.qmd

Lines changed: 64 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Sliding computations {#sec-sliding}
22

33
A central tool in the `{epiprocess}` package is `epi_slide()`, which is based
4-
on the powerful functionality provided in the
4+
on the powerful functionality provided in the
55
[`slider`](https://cran.r-project.org/web/packages/slider) package. In
66
`{epiprocess}`, to "slide" means to apply a computation---represented as a
77
function or formula---over a sliding/rolling data window. Suitable
@@ -10,7 +10,7 @@ groupings can always be achieved by a preliminary call to `group_by()`.
1010
By default, the meaning of one time step is inferred from the `time_value`
1111
column of the `epi_df` object under consideration, based on the way this column
1212
understands addition and subtraction. For example, if the time values are coded
13-
as `Date` objects, then one time step is one day, since
13+
as `Date` objects, then one time step is one day, since
1414
`as.Date("2022-01-01") + 1` equals `as.Date("2022-01-02")`. Alternatively, the time step can be specified
1515
manually in the call to `epi_slide()`; you can read the documentation for more
1616
details. Furthermore, the alignment of the running window used in `epi_slide()`
@@ -51,10 +51,10 @@ order to smooth the signal, by passing in a formula for the first argument of
5151
`epi_slide()`. To do this computation per state, we first call `group_by()`.
5252

5353
```{r}
54-
x %>%
55-
group_by(geo_value) %>%
56-
epi_slide(~ mean(.x$cases), before = 6) %>%
57-
ungroup()
54+
x %>%
55+
group_by(geo_value) %>%
56+
epi_slide(~ mean(.x$cases), .window_size = 7) %>%
57+
ungroup()
5858
```
5959

6060
The formula specified has access to all non-grouping columns present in the
@@ -65,9 +65,9 @@ default. We can of course change this post hoc, or we can instead specify a new
6565
name up front using the `new_col_name` argument:
6666

6767
```{r}
68-
x %>%
68+
x %>%
6969
group_by(geo_value) %>%
70-
epi_slide(~ mean(.x$cases), before = 6, new_col_name = "cases_7dav") %>%
70+
epi_slide(~ mean(.x$cases), .window_size = 7, .new_col_name = "cases_7dav") %>%
7171
ungroup()
7272
```
7373

@@ -81,7 +81,7 @@ Like in `group_modify()`, there are alternative names for these variables as
8181
well: `.` can be used instead of `.x`, `.y` instead of `.group_key`, and `.z`
8282
instead of `.ref_time_value`.
8383

84-
## Slide with a function
84+
## Slide with a function
8585

8686
We can also pass a function for the first argument in `epi_slide()`. In this
8787
case, the passed function must accept the following arguments:
@@ -97,10 +97,10 @@ receives to `f`.
9797
Recreating the last example of a 7-day trailing average:
9898

9999
```{r}
100-
x %>%
101-
group_by(geo_value) %>%
102-
epi_slide(function(x, gk, rtv) mean(x$cases),
103-
before = 6, new_col_name = "cases_7dav") %>%
100+
x %>%
101+
group_by(geo_value) %>%
102+
epi_slide(function(x, gk, rtv) mean(x$cases),
103+
.window_size = 7, .new_col_name = "cases_7dav") %>%
104104
ungroup()
105105
```
106106

@@ -113,9 +113,9 @@ to a computation in which we can access any columns of `x` by name, just as we
113113
would in a call to `dplyr::mutate()`, or any of the `dplyr` verbs. For example:
114114

115115
```{r}
116-
x <- x %>%
117-
group_by(geo_value) %>%
118-
epi_slide(cases_7dav = mean(cases), before = 6) %>%
116+
x <- x %>%
117+
group_by(geo_value) %>%
118+
epi_slide(cases_7dav = mean(cases), .window_size = 7) %>%
119119
ungroup()
120120
```
121121
In addition to referring to individual columns by name, you can refer to the
@@ -128,7 +128,7 @@ top of the original counts.
128128
#| code-fold: true
129129
cols <- RColorBrewer::brewer.pal(7, "Set1")[-6]
130130
ggplot(x, aes(x = time_value)) +
131-
geom_col(aes(y = cases, fill = geo_value), alpha = 0.5,
131+
geom_col(aes(y = cases, fill = geo_value), alpha = 0.5,
132132
show.legend = FALSE) +
133133
scale_y_continuous(expand = expansion(c(0, 0.05))) +
134134
geom_line(aes(y = cases_7dav, col = geo_value), show.legend = FALSE) +
@@ -139,14 +139,14 @@ ggplot(x, aes(x = time_value)) +
139139
labs(x = "Date", y = "Reported COVID-19 cases")
140140
```
141141

142-
As we can see from the center top panel, it looks like Florida moved to weekly
142+
As we can see from the center top panel, it looks like Florida moved to weekly
143143
reporting of COVID-19 cases in summer of 2021, while California occasionally reported negative cases counts!
144144

145145
## Running a local forecaster {#sec-local-forecaster}
146146

147147
As a more complex example, we preview some of the functionality of `{epipredict}` described in future chapters, and use a forecaster based on a
148148
local (in time)
149-
autoregression or "AR model". AR models can be fit in numerous ways
149+
autoregression or "AR model". AR models can be fit in numerous ways
150150
(using base R
151151
functions and various packages), but here we the `arx_forecaster()`, implemented in `{epipredict}` both
152152
provides a more advanced example of sliding a function over an `epi_df` object,
@@ -165,46 +165,46 @@ considered in this vignette).
165165

166166
```{r eval=FALSE}
167167
arx_forecaster <- function(
168-
epi_df,
168+
epi_df,
169169
outcome, # the outcome column name in `epi_df`
170170
predictors, # a character vector, containing 1 or more predictors in `epi_df`
171-
trainer = quantile_reg(),
171+
trainer = quantile_reg(),
172172
args_list = arx_args_list(
173-
lags = c(0, 7, 14),
173+
lags = c(0, 7, 14),
174174
ahead = 7,
175175
quantile_levels = c(0.05, 0.95)
176176
)
177177
)
178178
179179
```
180180

181-
We go ahead and slide this AR forecaster over the working `epi_df` of COVID-19
182-
cases. Note that we actually model the `cases_7dav` column, to operate on the
181+
We go ahead and slide this AR forecaster over the working `epi_df` of COVID-19
182+
cases. Note that we actually model the `cases_7dav` column, to operate on the
183183
scale of smoothed COVID-19 cases. This is clearly equivalent, up to a constant,
184184
to modeling weekly sums of COVID-19 cases.
185185

186186
```{r, warning=FALSE}
187187
fc_time_values <- seq(
188-
from = as.Date("2020-06-01"),
189-
to = as.Date("2021-12-01"),
188+
from = as.Date("2020-06-01"),
189+
to = as.Date("2021-12-01"),
190190
by = "1 months")
191191
192192
fcasts <- epi_slide(
193-
x,
194-
~ arx_forecaster(
195-
epi_data = .x,
196-
outcome = "cases_7dav",
197-
predictors = "cases_7dav",
198-
trainer = quantile_reg(),
199-
args_list = arx_args_list(ahead = 7))$predictions,
200-
before = 119,
201-
ref_time_values = fc_time_values,
202-
new_col_name = "fc")
193+
x,
194+
.f = ~ arx_forecaster(
195+
epi_data = .x,
196+
outcome = "cases_7dav",
197+
predictors = "cases_7dav",
198+
trainer = quantile_reg(),
199+
args_list = arx_args_list(ahead = 7))$predictions,
200+
.window_size = 120,
201+
.ref_time_values = fc_time_values,
202+
.new_col_name = "fc")
203203
204204
# grab just the relevant columns, and make them easier to plot
205205
fcasts <- fcasts %>%
206-
select(geo_value, time_value, cases_7dav,
207-
contains("_distn"), fc_target_date) %>%
206+
unpack(fc, names_sep = "_") %>%
207+
select(geo_value, time_value, cases_7dav, starts_with("fc")) %>%
208208
pivot_quantiles_wider(contains("_distn"))
209209
fcasts
210210
```
@@ -216,29 +216,29 @@ that correspond to the date the forecast is for (rather than the date it was mad
216216
95\% prediction band.[^1]
217217

218218
[^1]: If instead we had set `as_list_col = TRUE`
219-
in the call to `epi_slide()`, then we would have gotten a list column `fc`,
219+
in the call to `epi_slide()`, then we would have gotten a list column `fc`,
220220
where each element of `fc` contains these results.
221221

222222
To finish off, we plot the forecasts at some times (spaced out by a few months)
223-
over the last year, at multiple horizons: 7, 14, 21, and 28 days ahead. To do
224-
so, we encapsulate the process of generating forecasts into a simple function,
223+
over the last year, at multiple horizons: 7, 14, 21, and 28 days ahead. To do
224+
so, we encapsulate the process of generating forecasts into a simple function,
225225
so that we can call it a few times.
226226

227227
```{r, message = FALSE, warning = FALSE}
228228
k_week_ahead <- function(ahead = 7) {
229229
epi_slide(
230-
x,
230+
x,
231231
~ arx_forecaster(
232-
epi_data = .x,
233-
outcome = "cases_7dav",
234-
predictors = "cases_7dav",
235-
trainer = quantile_reg(),
236-
args_list = arx_args_list(ahead = ahead))$predictions,
237-
before = 119,
238-
ref_time_values = fc_time_values,
239-
new_col_name = "fc") %>%
240-
select(geo_value, time_value, cases_7dav, contains("_distn"),
241-
fc_target_date) %>%
232+
epi_data = .x,
233+
outcome = "cases_7dav",
234+
predictors = "cases_7dav",
235+
trainer = quantile_reg(),
236+
args_list = arx_args_list(ahead = ahead))$predictions,
237+
.window_size = 120,
238+
.ref_time_values = fc_time_values,
239+
.new_col_name = "fc") %>%
240+
unpack(fc, names_sep = "_") %>%
241+
select(geo_value, time_value, cases_7dav, starts_with("fc")) %>%
242242
pivot_quantiles_wider(contains("_distn"))
243243
}
244244
@@ -247,15 +247,16 @@ z <- map(c(7, 14, 21, 28), k_week_ahead) %>% list_rbind()
247247
```
248248

249249
Then we can plot the on top of the observed data
250+
250251
```{r, fig.width=8, fig.height=9}
251252
#| code-fold: true
252253
ggplot(z) +
253-
geom_line(data = x, aes(x = time_value, y = cases_7dav), color = "gray50") +
254+
geom_line(data = x, aes(x = time_value, y = cases_7dav), color = "gray50") +
254255
geom_ribbon(aes(x = fc_target_date, ymin = `0.05`, ymax = `0.95`,
255-
group = time_value, fill = geo_value), alpha = 0.4) +
256-
geom_line(aes(x = fc_target_date, y = `0.5`, group = time_value)) +
257-
geom_point(aes(x = fc_target_date, y = `0.5`, group = time_value), size = 0.5) +
258-
#geom_vline(data = tibble(x = fc_time_values), aes(xintercept = x),
256+
group = time_value, fill = geo_value), alpha = 0.4) +
257+
geom_line(aes(x = fc_target_date, y = `0.5`, group = time_value)) +
258+
geom_point(aes(x = fc_target_date, y = `0.5`, group = time_value), size = 0.5) +
259+
#geom_vline(data = tibble(x = fc_time_values), aes(xintercept = x),
259260
# linetype = 2, alpha = 0.5) +
260261
facet_wrap(vars(geo_value), scales = "free_y", nrow = 3) +
261262
scale_y_continuous(expand = expansion(c(0, 0.05))) +
@@ -269,22 +270,22 @@ spotty. At various points in time, we can see that its forecasts are volatile
269270
(its point predictions are all over the place), or overconfident (its bands are
270271
too narrow), or both at the same time. This is only meant as a simple demo and
271272
not entirely unexpected given the way the AR model is set up. The
272-
[`epipredict`](https://cmu-delphi.github.io/epipredict) package,
273-
offers a suite of predictive modeling tools
274-
that improve on many of the shortcomings of the above simple AR model (simply
273+
[`epipredict`](https://cmu-delphi.github.io/epipredict) package,
274+
offers a suite of predictive modeling tools
275+
that improve on many of the shortcomings of the above simple AR model (simply
275276
using all states for training rather than 6 is a huge improvement).
276277

277278
Second, the AR forecaster here is using finalized data, meaning, it uses the
278279
latest versions of signal values (reported COVID-19 cases) available, for both
279280
training models and making predictions historically. However, this is not
280281
reflective of the provisional nature of the data that it must cope with in a
281282
true forecast task. Training and making predictions on finalized data can lead
282-
to an overly optimistic sense of accuracy; see, for example,
283+
to an overly optimistic sense of accuracy; see, for example,
283284
[@McDonaldBien2021] and references
284285
therein. Fortunately, the `epiprocess` package provides a data structure called
285286
`epi_archive` that can be used to store all data revisions, and furthermore, an
286287
`epi_archive` object knows how to slide computations in the correct
287288
version-aware sense (for the computation at each reference time $t$, it uses
288-
only data that would have been available as of $t$). We will revisit this
289-
example in the [archive
289+
only data that would have been available as of $t$). We will revisit this
290+
example in the [archive
290291
vignette](https://cmu-delphi.github.io/epiprocess/articles/archive.html).

0 commit comments

Comments
 (0)