Logan C. Brooks*,1, Evan L. Ray*,2, Jacob Bien3, Johannes Bracher4, Aaron Rumack1, Ryan J. Tibshirani†,1, Nicholas G. Reich†,2

1 Machine Learning Department, Carnegie Mellon University
2 School of Public Health and Health Sciences, University of Massachusetts Amherst
3 Department of Data Sciences and Operations, University of Southern California
4 Chair of Econometrics and Statistics, Karlsruhe Institute of Technology and Computational Statistics Group, Heidelberg Institute for Theoretical Studies

The * and † indicate authors who contributed equally to this work.

Correspondence to: Nicholas G Reich, email: [email protected] and Ryan J Tibshirani, email: [email protected]


In a nutshell

    Numerous research groups provide weekly forecasts of key COVID-19 outcomes in the U.S. Public health officials and other stakeholders can benefi t from the availability of a system that combines these predictions into a single weekly forecast with good performance relative to its components. This post presents a few approaches to combining short-term, probabilistic COVID-19 forecasts, both untrained and trained on previous forecast behavior, and analyzes their performance. We find that

  1. combining forecasts in various reasonable ways consistently leads to improved performance over component forecasts and a baseline method;
  2. a trained weighted mean of available forecasts, where training is based on past forecaster behavior, can improve on the performance of an untrained mean (straight average) of forecasts;
  3. a simple untrained median of available forecasts is competitive with the trained weighted mean approach in terms of performance, and appears more robust to changes in component behavior.



Short-term forecasts of the future trajectory of COVID-19 cases and deaths can be an important input to planning public health responses such as allocation of resources and vaccine trial site selection (Lipsitch et al., 2011; Dean et al., 2020). Previous work in epidemiological forecasting has consistently found that ensemble approaches that combine forecasts from many models can yield improved forecast skill relative to the individual models that the ensemble builds on, with some evidence that a weighted (trained) combination of individual forecasts can yield better performance on average and less variability across different targets (Brooks et al., 2018; Ray and Reich, 2018; Viboud et al., 2018; Johansson et al., 2019; McGowan et al. 2019; Reich et al, 2019). In the context of COVID-19 forecasting, existing ensemble analyses have been focused on using unweighted approaches (Taylor and Taylor, 2020; Ray et al., 2020). We expand on this line of work by considering ensemble formulations with estimated weights for the component models, building on the models submitted to the COVID-19 Forecast Hub, a public repository for short-term forecasts of cases, hospitalizations, and deaths in the U.S. (“the Hub” – https://covid19forecasthub.org/ – Reich et al., 2020).

This post will focus on challenges in building multi-model ensemble forecasts, specifically on the combination of numerous forecasts of weekly reported deaths attributable to COVID-19 at the state and territory level in the U.S., starting in April 2020 (Figure 1). In total, more than 70 models have been contributed from over 50 research groups across academia, industry, and government to the Hub, covering multiple surveillance signals and geographical resolutions; over 50 models provide forecasts of the weekly incident death data at the state and territory level.

Fig. 1. Illustration of forecasts of incident weekly deaths due to COVID-19 in California, displayed on the CDC website — https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us.html (Centers for Disease Control and Prevention, 2020). The left panel shows 1 through 4 week ahead point forecasts and 95% prediction intervals from teams contributing to the Hub. The right panel shows the point forecasts from the individual models and the point predictions and 50% and 95% prediction intervals from the QuantMedian ensemble described below.

One important point worth making at the outset is that forecasts submitted to the Hub are all probabilistic, with predictive distributions represented by quantiles. This is already somewhat of a novelty for epidemiological forecasting. Only in recent years have epidemiological forecasts been standardized and specified as probabilistic rather than point predictions, sometimes with prediction intervals (Johanssen et al., 2019; McGowan et al., 2019; Reich et al., 2019). The standard for probabilistic forecasts and therefore ensemble generation in this field has been to consider binned incidence and assign probability mass to each bin, rather than specify predictive quantiles as is the standard format established by the Hub. The two are in general not equivalent perspectives. Moreover, the real-time nature of this forecasting exercise has led to several challenges that any procedure for constructing a weighted ensemble must address.

  • Not all models submit forecasts for all geographic units, and modelling teams submit forecasts at varying times. This means that a procedure for estimating ensemble weights must account for missing forecasts for some locations or past weeks.
  • Teams have made frequent changes to their models, leading to the possibility of changes in the relative skill of individual models, which in turn might call for changing model weights over time.
  • Many of the models contributed to the Hub suffer from a lack of probabilistic calibration; it is critical that an ensemble that might be used to inform public policy decisions be well calibrated.
  • Some of the “ground truth” time-series have seen substantial revisions that retrospectively change the observed data1. These updates make it challenging to obtain useful scores to determine model weights in real-time.

In what follows, we compare approaches to combine these predictive distributions into a single ensemble forecast in the context of this real-time forecasting exercise. The presented analysis is fairly brief; more comprehensive results will be presented in a future paper.

Forecast structure and evaluation

Forecasts were submitted to the Hub for each state and territory (56 locations in total). To be eligible for inclusion in the ensemble, submissions were required to contain probabilistic forecasts for incident deaths observed for at least all 50 states. Predictive distributions were represented by 23 quantiles \{0.01, 0.025, 0.05, 0.10, 0.15, \ldots, 0.90, 0.95, 0.975, 0.99\} and a point estimate (which may or may not be equivalent to the median). In this post, we examine approaches to distributional ensemble forecasting for forecast horizons of 1 to 4 weeks ahead at the state and territory level.

All ensembles were evaluated using the weighted interval score (WIS) (Bracher et al., 2020) to simultaneously evaluate the 23 reported forecast quantiles. This is a proper score, a well-known quantile-based approximation of the commonly used continuous ranked probability score (CRPS), and a generalization of the absolute error (AE) (Gneiting and Raftery, 2007). It can be interpreted as a measure of distance between the predictive distribution and the observed value, where the units are those of the absolute error, on the natural scale of the data. Ground truth was considered newly reported death “incidence” data (technically, differences in cumulative counts) from Johns Hopkins University Center for Systems Science and Engineering (Dong et al., 2020). These data were accessed via the COVIDcast API (Delphi Group at Carnegie Mellon University); we removed daily observations containing abnormal spikes up or down in reporting2, and aggregated to weekly resolution.

Ensemble approaches

For each combination of spatial unit s, time point t, and forecast horizon h, we receive forecasts from up to M models. The forecast from model m comprises a set of K=23 predictive quantiles: q^m_{s,t,h,k}, k=1,\ldots,K. All approaches we consider obtain the ensemble forecast at each quantile level as a combination of predictive quantiles from all contributing models for the given spatial unit, time point, forecast horizon, and quantile level:

    \[ q_{s,t,h,k} = f(q^1_{s,t,h,k}, \ldots, q^M_{s,t,h,k}), \quad k=1,\ldots,K. \]

In this post, we describe the results from two simple ensemble approaches, and a third more sophisticated one, corresponding to different choices of the combination function f:

  • QuantMedian: takes an equal-weights median at every quantile level.
  • QuantMean: takes an equal-weights mean at every quantile level.
  • QuantTrained: takes a trained-weights mean at every quantile level (meaning, the weights are trained in real-time, to optimize WIS based on recent data, as detailed below).

The equally-weighted approaches (QuantMedian and QuantMean) were explored in Taylor and Taylor (2020) and Ray et al. (2020), respectively. The trained ensemble approach (QuantTrained) that we additionally consider here obtains the ensemble forecast as a linear combination of predictive quantiles from the component forecasters:

    \[ q_{s,t,h,k} = \sum_{m=1}^M \beta^m_{t,h,k} \cdot q^m_{s,t,h,k}, \quad k=1,\ldots,K. \]

For each time t and forecast horizon h, the set of weights \beta^m_{t,h,k}, m=1,\ldots,M, k=1,\ldots,K are fit by optimizing WIS of the corresponding ensemble, summed over the last 4 weeks of reported data as of time t and all spatial units s. We constrain the optimization so that these weights are nonnegative, and so that \sum_{m=1}^M \beta^m_{t,h,k} = 1 (which makes the combination a weighted average), for each quantile level k. In cases where a forecaster did not submit predictions for a given location, time point, and forecast horizon, the missing predictive quantiles were handled by preprocessing via mean imputation3. Estimation is performed using the quantgen R package (Tibshirani, 2020).

The trained ensemble therefore seeks weights to directly optimize forecast skill (as measured by WIS). Two important notes are in order. First, we allow separate weights per quantile level to accommodate the fact that forecasters may have varying performance in the centers versus the tails of their predictive distributions (e.g., some forecasters may have more accurate point predictions, others may be better calibrated)4. Second, we optimize WIS over the most recent 4 weeks of data to accommodate nonstationarity in forecaster performance over time.

For reference, we also consider a baseline forecaster that propagates the last observed incidence forward as the point prediction (the “flat-line” forecaster), with fanning uncertainty based on quantiles of symmetrized week-to-week historical differences. (This is the baseline method implemented in the Hub.)

Ensemble comparisons

Figure 2 illustrates these three approaches for an example of developing a forecast distribution for California. In this example, there are some outlying component models with extreme forecast quantiles, particularly in the upper tail. As can be seen, these outliers have a large effect on the mean ensemble forecasts, but the impact on the median ensemble and the trained ensemble is smaller. The median is less impacted by outlying forecasts than the mean because the median is robust to outliers. The trained ensemble is also less impacted by outlying forecasts because the models generating those outlying submissions have been assigned small weights based on recent performance.

Fig. 2. Illustration of the three ensemble methods for forecasting incident deaths in California at a forecast horizon of 1 week from June 13, 2020. Each line corresponds to the forecast distribution from one component model or ensemble, and is obtained by interpolating between the 23 predictive quantiles; the resulting curves approximate the predictive CDFs associated with these forecasts. At each quantile level along the vertical axis, the ensemble forecasts are obtained as a combination of the component model forecasts at that quantile level.

Figure 3 compares the mean WIS of the ensemble methods and a baseline for different forecast horizons and forecast times. Comparing each of the forecasters against each other, we make the following observations:

  • QuantMedian and QuantTrained “strictly dominate” the baseline in the sense that their mean WIS is comparable or better for every forecast horizon and time, with substantial differences in some forecast horizon-time combinations.
  • QuantMean “nearly dominates” the baseline in that only a small number of forecast horizon-time pairs (4-week-ahead forecasts for two dates in August) prevent it from strictly dominating the baseline, while at the same time showing substantial improvements for many other horizon-time combinations.
  • QuantMedian and QuantTrained nearly dominate QuantMean, with QuantMedian slightly more consistent (i.e., having more forecast time-horizon combinations where it outperforms QuantMean, although fewer with larger improvements).
  • Compared head to head, QuantMedian and QuantTrained appear competitive with each other, with neither dominating or nearly dominating the other across all forecast ahead-time combinations.

Fig. 3. Mean weighted interval score by forecast horizon and forecast time for three ensemble methods and the baseline; each point averages WIS over all locations. Evaluations are limited to test instances where forecasts are available for all four systems. Different sets of weeks are shown for each horizon because additional weeks of observed data are needed to provide inputs for the QuantTrained model at longer horizons and to provide ground-truth evaluation data for all models for the forecasts made in late September and early October.

These findings agree with a variety of other forecast comparisons (not shown here due to space constraints) based on other methods of comparing aggregated WIS.

Contributing to the dominance relation noted above, Figure 3 visually suggests that QuantMedian may be more consistent in performance than the two mean-based ensembles, as it does not display some of the larger variations in mean WIS from one forecast date to the next seen in QuantTrained and QuantMean in the available test data. QuantMean appears the least robust in this sense.

Repeating the same type of performance comparisons against individual ensemble components with top-ranking mean WIS over this period rather than the baseline, we find that QuantMedian and QuantTrained have better overall WIS, with QuantMedian nearly dominating these components and QuantTrained somewhat less consistent; see Figure 4. That ensemble forecasters tend to outperform all other individual forecasters is consistent with recent findings of other authors (Gu, 2020; Burant, 2020; McConnell, 2020). Compared instead against several other experimental ensemble forecasters, QuantMedian and QuantTrained are either competitive or dominate these alternatives. Thus, QuantMedian and QuantTrained are “admissible” in the sense that, in the pool of Hub forecasters and several ensemble methods, their performance is not dominated by any alternative.

Fig. 4. Mean weighted interval score by forecast horizon and forecast time for two ensemble methods and two top-ranking individual components (in terms of mean WIS); each point averages WIS over all locations. Evaluations are limited to test instances where forecasts are available for all four systems, resulting in a slightly different test set than in Figure 3, due to a different set of forecasters being analyzed. Different sets of weeks are shown for each horizon because additional weeks of observed data are needed to provide ground-truth evaluation data at longer horizons for the forecasts made in late September and early October.


The performance comparisons above indicate that both QuantMedian and QuantTrained would fulfill the purpose of providing a single set of forecasts with good performance relative to their components. The simplicity and higher robustness of the QuantMedian approach support its current use (as of late October 2020) as the “official” ensemble being implemented by the Hub and visualized on the CDC COVID-19 Forecasting website (Figure 1) (Center for Disease Control and Prevention, 2020). The performance gains moving from an untrained mean approach to either a trained mean or untrained median method suggest additional work to investigate a trained median approach, and in updated experiments over the last few weeks, such a method has started to show competitive results. However, the range and breadth of approaches explored alongside the particular QuantTrained method above that did not exhibit significant improvements over the simple QuantMedian method (details not shown in this post) suggest that for the early months of forecasting, where limited historical data were available and component models were rapidly evolving, there may truly have been limited room for improvement. Additional investigations are ongoing to identify what the central obstacles to improvement in ensemble accuracy are (e.g., variability in individual model performance over time, lack of a predictable signal in the observed data, lack of sufficient training data) and which approaches show the most promise for improving ensemble accuracy in the coming months.


1 For example, a change in the criteria used to identify deaths associated with COVID-19, from reporting only deaths with a laboratory confirmed diagnosis of COVID-19 to additionally including deaths that are likely attributable to COVID-19.
2 Due to, e.g., recalculations of cumulative counts using new reporting criteria.
3 This is coupled with discounting coefficients assigned to forecasters with higher amounts of missingness; this weight redistribution process can result in different model weights for different locations depending on missingness patterns.
4 To decrease the number of free parameters, as a form of regularization, we actually constrain the 23 weights per model so that the bottom 4, middle 15, and top 4—as ordered by quantile level—are equal.



The authors acknowledge funding from the U.S. Centers for Disease Control and Prevention and collaborative support from CDC colleagues, in particular from Michael Johansson, Matthew Biggerstaff, Rachel Slayton, Velma Lopez, and Jo Walker. The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the CDC.



Johannes Bracher, Evan L. Ray, Tilmann Gneiting, and Nicholas G. Reich. Evaluating epidemic forecasts in an interval format, 2020. URL https://arxiv.org/abs/2005.12881.

Logan C Brooks, David C Farrow, Sangwon Hyun, Ryan J Tibshirani, and Roni Rosenfeld. Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions. PLoS Computational Biology, 14(6): e1006134, 2018.

John Burant. Covid-19 power rankings, 2020. URL https://twitter.com/JohnBurant/status/1317044535709630466. Accessed October 16, 2020.

Centers for Disease Control and Prevention. COVID-19 forecasts: Deaths, 2020. URL https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us.html. Accessed October 16, 2020.

Natalie E Dean, Ana Pastore y Piontti, Zachary J Madewell, Derek AT Cummings, Matthew DT Hitchings, Keya Joshi, Rebecca Kahn, Alessandro Vespignani, M Elizabeth Halloran, and Ira M Longini Jr. Ensemble forecast modeling for the design of COVID-19 vaccine efficacy trials. Vaccine, 38(46): 7213-7216, 2020.

Delphi Group at Carnegie Mellon University. Delphi Epidemiological Data API. https://github.com/cmu-delphi/delphi-epidata, Accessed October 11, 2020.

Ensheng Dong, Hongru Du, and Lauren Gardner. An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases, 20(5): 533-534, 2020.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477): 359-378, 2007.

Youyang Gu. COVID-19Projections.com: Historical performance, 2020. URL https://covid19-projections.com/about/#historical-performance. Accessed October 16, 2020.

Michael A Johansson, Karyn M Apfeldorf, Scott Dobson, Jason Devita, Anna L Buczak, Benjamin Baugher, Linda J Moniz, Thomas Bagley, Steven M Babin, Erhan Guven, et al. An open challenge to advance probabilistic forecasting for dengue epidemics. Proceedings of the National Academy of Sciences, 116(48): 24268-24274, 2019.

Marc Lipsitch, Lyn Finelli, Richard T He ernan, Gabriel M Leung, and Stephen C Redd; for the 2009 H1N1 Surveillance Group. Improving the evidence base for decision making during a pandemic: the example of 2009 influenza A/H1N1. Biosecurity and Bioterrorism: Biodefense Strategy, Practice, and Science, 9(2): 89-115, 2011.

Steve McConnell. COVID-19 CDC model forecast scorecards, 2020. URL https://stevemcconnell.com/covid-19-cdc-forecast-model-scorecards/. Accessed October 16, 2020.

Craig J McGowan, Matthew Biggersta , Michael Johansson, Karyn M Apfeldorf, Michal Ben-Nun, Logan Brooks, Matteo Convertino, Madhav Erraguntla, David C Farrow, John Freeze, et al. Collaborative efforts to forecast seasonal influenza in the United States, 2015{2016. Scienti c Reports, 9(1): 1-13, 2019.

Evan L Ray and Nicholas G Reich. Prediction of infectious disease epidemics via weighted density ensembles. PLoS Computational Biology, 14(2): e1005910, 2018.

Evan L Ray, Nutcha Wattanachit, Jarad Niemi, Abdul Hannan Kanji, Katie House, Estee Y Cramer, Johannes Bracher, Andrew Zheng, Teresa K Yamana, Xinyue Xiong, Spencer Woody, Yuanjia Wang, Lily Wang, Robert LWalraven, Vishal Tomar, Katherine Sherratt, Daniel Sheldon, Robert C Reiner, B. Aditya Prakash, Dave Osthus, Michael Lingzhi Li, Elizabeth C Lee, Ugur Koyluoglu, Pinar Keskinocak, Youyang Gu, Quanquan Gu, Glover E George, Guido Espa~na, Sabrina Corsetti, Jagpreet Chhatwal, Sean Cavany, Hannah Biegel, Michal Ben-Nun, Jo Walker, Rachel Slayton, Velma Lopez, Matthew Biggersta, Michael A Johansson, Nicholas G Reich, and COVID-19 Forecast Hub Consortium. Ensemble forecasts of coronavirus disease 2019 (COVID-19) in the U.S. medRxiv, 2020. doi: 10.1101/2020.08.19.20177493. URL https://www.medrxiv.org/content/early/2020/08/22/2020.08.19.20177493.

Nicholas G Reich, Craig J McGowan, Teresa K Yamana, Abhinav Tushar, Evan L Ray, Dave Osthus, Sasikiran Kandula, Logan C Brooks, Willow Crawford-Crudell, Graham Casey Gibson, et al. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the us. PLoS Computational Biology, 15(11): e1007486, 2019.

Nicholas G Reich, Jarad Niemi, Abdul House, Katie; Hannan, Estee Cramer, Steve Horstman, Shanghong Xie, Youyang Gu, Nutcha Wattanachit, Johannes Bracher, Serena Yijin Wang, Casey Gibson, Spencer Woody, Michael Lingzhi Li, Robert Walraven, Xinyu Zhang, X Xinyue, Hannah Biegel, Lauren Castro, Elizabeth Lee, Arden Baxter, Sangeeta Bhatia, Evan Ray, Andrea Brennen, and ERDC CV19 Modeling Team. reichlab/covid19-forecast-hub: pre-publication snapshot, 2020. URL http://dx.doi.org/10.5281/zenodo.3963372.

Kathryn S Taylor and James W Taylor. A comparison of aggregation methods for probabilistic forecasts of COVID-19 mortality in the United States. arXiv preprint arXiv:2007.11103, 2020.

Ryan J Tibshirani. quantgen: Tools for generalized quantile modeling, 2020. URL https://github.com/ryantibs/quantgen.

Cecile Viboud, Kaiyuan Sun, Robert Ga ey, Marco Ajelli, Laura Fumanelli, Stefano Merler, Qian Zhang, Gerardo Chowell, Lone Simonsen, Alessandro Vespignani, et al. The RAPIDD ebola forecasting challenge: Synthesis and lessons learnt. Epidemics, 22: 13-21, 2018.

Share Post!

Subscribe to Blog