Evan L. Ray*,1, Logan C. Brooks*,2, Jacob Bien3, Johannes Bracher4, Aaron Gerding1, Aaron Rumack2, Matthew Biggerstaff5, Michael A. Johansson5, Ryan J. Tibshirani2, Nicholas G. Reich1
1 School of Public Health and Health Sciences, University of Massachusetts Amherst
2 Machine Learning Department, Carnegie Mellon University
3 Department of Data Sciences and Operations, University of Southern California
4 Chair of Econometrics and Statistics, Karlsruhe Institute of Technology and Computational Statistics Group, Heidelberg Institute for Theoretical Studies
5 COVID-19 Response, U.S. Centers for Disease Control and Prevention
The * indicates authors who contributed equally to this work.
***The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention.
Correspondence to: Nicholas G Reich, email: nick@umass.edu and Ryan J Tibshirani, email: ryantibs@cmu.edu
In a nutshell
Numerous research groups provide weekly forecasts of key COVID-19 outcomes in the United States. Public health officials and other stakeholders can benefit from the availability of a system that combines these predictions into a single weekly forecast with good performance relative to its components. This post updates a prior report on our work in this area by presenting a few approaches to combining short-term, probabilistic COVID-19 forecasts, both untrained and trained on previous forecast behavior, and analyzes their performance. We find that
- averaging across all forecasts, a “trained” ensemble that weights models based on past performance may show improved accuracy relative to simpler untrained mean and median ensembles that treat all component forecasters equally, particularly for forecasting deaths due to COVID-19;
- mean ensembles (both weighted and unweighted) have been sensitive to occasional problems with the component forecasts; and
- a simple unweighted median ensemble has had very competitive performance on average while maintaining an appealing robustness to outlying component forecasts that is valuable in an applied public health setting.
Introduction
In October 2020, we wrote a post about our efforts to create a trained ensemble to better predict COVID-19 mortality in the United States. Our effort, a collaboration among more than 50 forecasting groups and the US Centers for Disease Control and Prevention (CDC), is based around weekly prospective forecasts submitted since early April 2020 to the US COVID-19 Forecast Hub. CDC communicates the ensemble forecasts that the Hub generates to the public every week. The forecasts have been among the most consistently accurate short-term probabilistic forecasts of incident deaths and incident cases over the past 11 months.
Our initial post analyzed forecasts from late April to mid-September 2020. We assessed the performance of “trained” ensemble approaches; “trained” methods weight the forecasts from individual models or “forecasters” based on past performance. We concluded that these approaches did not show a substantial and consistent improvement over an untrained median ensemble that gave equal weight to all component forecasters, regardless of past performance.
In brief, our trained ensemble approach combines component forecasts by taking a weighted average of quantiles that specify the predictive distribution from each forecast. Weights are estimated for each forecaster by optimizing the Weighted Interval Score (WIS) of the resulting ensemble over a sliding window of weeks before the forecast date. WIS is a measure that generalizes absolute error to probabilistic forecasts and approximates the continuous ranked probability score (CRPS); smaller WIS values are better, indicating that the forecast was more aligned with the eventually observed data. Weights are held constant across all horizons and locations, but separate weights are estimated for cases and deaths. To reduce the number of parameters to estimate, we constrain the model weights to be equal within each of three groups of quantile levels representing the lower tail, middle, and upper tail of the predictive distribution.
In this update on our work and ongoing analyses, we provide an overall update on the performance of various ensemble approaches and then examine two challenges facing trained ensemble approaches: (1) sensitivity to outlying forecasts; and (2) behavior of forecasts near local peaks.
Update on performance of trained ensembles
In updated experiments conducted retrospectively in which we compared trained and untrained approaches, our results differ from our initial report. Specifically, we find that a trained ensemble (a weighted combination of predictive distribution quantiles) shows modest but consistent improvements over untrained approaches when forecasting reported COVID-19 deaths at the state level in the United States (Figures 1 and 2). However, the results are not as clear for incident reported cases per state in the United States. As in the previous analysis, weekly “incidence” is defined as the difference in reported cumulative counts of cases or deaths on consecutive Saturdays.
We note a few key differences from the trained methods outlined in the first post. First, this analysis includes an evaluation of both case and death forecasts. Second, we include a different set of dates, looking at death forecasts from June 22, 2020, through January 18, 2021. We examine case forecasts starting September 14, 2020, because the Hub started processing forecasts of incident cases later. Third, the eligibility criteria for forecasts were modified slightly to include models in the ensemble that did not submit for all 50 states. In addition, in this implementation, a single set of weights is shared across all time horizons, rather than using separate weights for each horizon as was done in the first post. Finally, while the trained ensemble used in the first post estimated weights by minimizing mean WIS of the ensemble forecasts over the four weeks before the forecast week, in this post we introduce an additional variation using the full history of component forecasts for estimation. We also considered a range of training set window sizes, including capturing performance from 3 to 10 weeks prior to each forecast date. For brevity, we present only results from the variations using the 4-week training window and the full submission history here.
The relative accuracy of different ensemble approaches differs between the targets of cases and deaths (Figure 1). For case forecasts, trained approaches appeared to show better accuracy in November 2020, but their performance in January 2021 was worse than simpler ensemble methods, especially at longer horizons. For death forecasts, trained approaches appear to have shown worse (or at least more variable) accuracy in late July and August. However, from November 2020 through January 2021, their performance has been nearly uniformly better than other approaches for 3- and 4-week ahead horizons. In general, WIS tends to be higher near inflection points and at longer forecast horizons. This phenomenon is particularly salient for the trained ensembles near the national peak in cases that occurred in early January 2021; we discuss this further below. Summarizing across all forecasts, the trained ensemble using the full history of past forecasts achieves the lowest mean WIS for both cases and deaths, though the differences are small for cases (Figure 2).
Fig. 1. Newly reported cases and deaths at the national level and mean weighted interval scores (WIS) for state-level forecasts over time for four ensembles: 1) an untrained mean ensemble, 2) an untrained median ensemble, 3) an ensemble trained on a 4-week training window, and 4) an ensemble trained on all historical data. Means are calculated separately for each combination forecast horizon and forecast creation date, averaging across all states and territories. Lower scores indicate better forecast performance.
Fig. 2. Summaries of mean weighted interval scores (WIS) scores for ensemble forecasts of incident cases and deaths. Each point summarized by a box plot is the mean WIS for a single combination of target variable, forecast horizon, and target end date, averaging across all states and territories. Each box plot summarizes the variability of the mean WIS scores across the forecast horizons and target end dates shown in Figure 1. A cross is plotted at the overall mean WIS for each model.
Figure 3 displays the estimated weights assigned to the component forecasters over time. When the full history of component forecasts is used to estimate model weights, the weights are more stable over time, and in general more weight is assigned to a few top-performing models.
Fig. 3. Component forecaster weights over time, faceted by the target variable (cases or deaths) and trained ensemble specification (4 weeks or full history). Separate weights are estimated for each of three groups of quantile levels. This figure shows weights only for the quantile group describing the middle of the predictive distribution; general patterns were similar for the other quantile groups. Each component forecaster is represented with a different color; not all are visibly apparent, as the fitted weights tend to exhibit some degree of sparsity, particularly in the full history approach. Note that not all component forecasters submitted forecasts for both cases and deaths. For the full history ensemble forecasts of incident cases, the five component forecasters with the highest weight in any week were (maximum weekly weight in parentheses): LNQ-ens1 (0.667); DDS-NBDS (0.308); LANL-GrowthRate (0.275); CU-select (0.233); and USACE-ERDC_SEIR (0.187). For the full history ensemble forecasts of incident deaths, the five component forecasters with highest weight in any week were: YYG-ParamSearch (0.595); UMass-MechBayes (0.415); Covid19Sim-Simulator (0.285); UCLA-SuEIR (0.181); and OliverWyman-Navigator (0.167).
Despite the relatively strong performance of the trained ensemble for forecasting incident deaths, several factors have prevented us from using it in place of the median ensemble. The following sections provide illustrative examples of these challenges.
Challenge 1: Sensitivity to outlying forecasts and data anomalies
In forecasting a non-stationary epidemic process with input data of varying quality, it is not uncommon for forecasts to follow unrealistic trajectories1. One key feature of data that arise from many public health surveillance systems is that initially reported data may be revised in subsequent data updates. The impact of these data revisions on outbreak forecast accuracy has been documented previously by Reich et al., Chakraborty et al., and Brooks et al. At the COVID-19 Forecast Hub, forecasts that follow unrealistic trajectories occur primarily for one of two reasons: either the forecaster has estimated an unrealistically large growth rate, which has been carried forward into the forecasts, or the forecaster has been “fooled” by a recent under- or over-reporting of cases or deaths.
We have observed that even forecasters that have been fairly reliable in previous weeks can be susceptible to generating occasional outlying forecasts. Even if such a forecaster is assigned a small weight, its outlying forecasts can still have a strong impact on the ensemble performance in particular weeks.
We illustrate this with an example of forecasts of incident deaths in Ohio generated on February 15, 2021 (Figure 4). In that week, Ohio issued a bulk report of deaths that had occurred previously; these were later back distributed into the time series, but that report caused approximately one third of the forecasts submitted to the Hub that week to predict much higher numbers than had ever been reported previously (examples shown in panels A and B). The median ensemble was robust to these extreme forecasts (panel C), but a trained ensemble that gave even a small amount of weight to them showed unreasonable behavior (panel D).
Fig. 4. Forecasts of incident deaths in Ohio generated on February 15, 2021. We display forecasts from two example component models that had irregular forecasts at the time of the reporting anomaly (panels A and B), the median ensemble (panel C), and a trained ensemble (panel D). The vertical axis scale is different in each facet, reflecting differences across five orders of magnitude in forecasts from different models; the reference data are the same in each plot. The data shown with a solid line are the weekly incident deaths that had been reported as of Monday, February 15, 2021, including a large spike of deaths that were reported in the week ending Saturday, February 13, 2021 but occurred earlier. In the revised time series with data available as of Monday, February 22, 2021 (dashed line), these deaths had been redistributed into the history of the time series.
Problems with reporting and component forecasts can affect trained ensemble forecasts directly as illustrated above but might also affect trained ensemble performance more indirectly. It’s conceivable that a component forecaster that is generally quite good would be given less weight if it had one bad forecast in the past, for example, due to a coding bug that has since been corrected. This could reduce the overall skill of the ensemble. In addition, if the data available at the time a forecast was made were later revised substantially, an otherwise strong model that did not account for the possibility of such data revisions could be heavily penalized. On the other hand, a misspecified model that consistently overpredicts could be given undue high weight if the data were later revised upwards. We have several general ideas for how to address these problems. One option is to use automated approaches to detect and remove outlying component forecasts. Also, a weighted median could be used; however, estimation of component model weights would be computationally challenging. Finally, it might be beneficial to omit forecasts of values that experienced large revisions from the training set used during weight estimation.
1 In early months of generating the ensemble, we manually removed forecasts that appeared to be outliers. However, the switch to using a median instead of a mean ensemble removed this sensitivity and enabled us to include all models that passed a basic data format quality control check. For more on this see the initial preprint on the COVID-19 Forecast Hub ensemble and our previous blog post on this topic.
Challenge 2: Behavior near local peaks
Predicting the peak of an ongoing epidemic surge has proven to be an Achilles’ heel for many forecasters in the last year. Given that a key role a forecast could play is to predict an inflection point, the failure of many models to anticipate peaks has been notable. The Forecast Hub has stated that forecasters will be evaluated based on the mean weighted interval score across all forecasts. When optimized across the entire pandemic, without a special focus on “peaks,” models will naturally be trained to predict well during non-peak times, as the vast majority of predicted time points are at non-peak moments.
Without an explicit, prospective stated interest in performance around the peaks, we risk falling prey to the forecaster’s dilemma by focusing evaluation post hoc on the most extreme moments of the pandemic. Nonetheless, performance around peaks has been a topic of some interest, as there is an awareness that a particularly bad performance at the peak can erode trust in the forecasts.
Trained ensembles have shown a tendency both to predict sustained increases and decreases in cases and deaths more accurately than an untrained ensemble and to overshoot at the peak (Figure 5). Anecdotally, we have observed that the trained ensembles give most weight to component forecasters that tend to predict a continuation of recent trends in incidence. This is a reasonable strategy most of the time but does not perform well at inflection points. In an informal examination of past forecasts, this performance around local peaks shows up consistently for forecasts of incident cases across many different locations and times. It also shows up for forecasts of incident deaths, but it does appear that component forecasters have been able to see the turn coming more often for incident deaths than for incident cases, perhaps because inflections in reported cases often indicate subsequent inflections in reported deaths.
Fig. 5. Forecasts of incident cases in Oklahoma generated January 11, 2021. We display forecasts from the two component models that were assigned highest weight by the full history ensemble method on the week of January 11, along with the median ensemble and a trained ensemble forecast based on the full history. Together, these two component models accounted for about 80% of the weight that week. These models are not the same as the component models displayed in Figure 4.
Conclusions
It is natural to consider trained ensembles as an approach for forecasts of the short-term trajectory of COVID-19, given the intuition that some component forecasters are better than others and the successful application of trained ensemble methods for forecasting other diseases (such as dengue, Ebola, influenza, and respiratory syncytial virus) in recent years. Indeed, for forecasts of incident deaths attributable to COVID-19, we have found evidence that a trained ensemble may offer benefits relative to simpler untrained mean and median ensembles that treat all component forecasters equally—both in terms of overall mean WIS and during periods of increasing deaths. Yet, it is notable that simple approaches that require no training data have remained quite competitive with more complex approaches that try to leverage historical forecast performance.
An important limitation of the ensemble evaluation conducted here is that we display results for retrospectively selected ensemble formulations. Although we are careful to train using only data that would have been available in real time, this is not a true prospective evaluation. In analyses omitted here for brevity, we have found that an approach that selects an ensemble formulation to use each week based on the best-performing method in past weeks would not have done as well as the full history method discussed in this post. On average, however, it was still better than the median ensemble for forecasts of incident deaths.
We believe that it is important to consider not only the mean WIS of all forecasts, but also the stability and consistency of forecast performance across all times and locations. An ensemble that has good average performance but occasional dramatic failures might have limited public health utility and erode public faith in the validity of the forecasts if the failures occur at important times. In our investigations, mean ensembles (both weighted and unweighted) have been sensitive to occasional problems with the component forecasts, as illustrated in the discussion above.
These challenges have been especially prominent during times when the ground truth data were subject to large reporting anomalies. However, limitations of trained mean ensembles have also appeared near local peaks in incidence. It will be necessary to address the limitations of trained ensemble approaches that we have seen here before putting them to use in an official capacity. On the other hand, on average, a simple unweighted median ensemble has had very competitive performance—similar to or better than the best of the component forecasters—while maintaining a robustness to outlying component forecasts.
Reproducibility
The R code used to fit ensemble models, as well as the resulting ensemble forecasts and component model weights and the code used to generate figures in this post, are available in public GitHub repositories.
Acknowledgements
This material is based upon work supported by the U.S. Centers for Disease Control and Prevention under grant numbers 1U01IP001121 and U01IP001122, Google, and the Center for Machine Learning and Health.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily represent the views of the Centers for Disease Control and Prevention, Google, or the Center for Machine Learning and Health.
The work of Johannes Bracher was supported by the Helmholtz Foundation via the SIMCARD Information and Data Science Pilot Project.