Summary, introduction, data sources, accounting for delays, statistical methods, summary of results and limitations.
Our methods are also outlined in this preprint.
Authors: Sam Abbott *, Joel Hellewell *, Robin N Thompson, Katharine Sherratt, Hamish P Gibbs, Nikos I Bosse, James D Munday, Sophie Meakin, Emma L Doughty, June Young Chun, Yung-Wai Desmond Chan, Flavio Finger, Paul Campbell, Akira Endo, Carl A B Pearson, Amy Gimma, Tim Russell, CMMID COVID modelling group, Stefan Flasche, Adam J Kucharski, Rosalind M Eggo, Sebastian Funk
* contributed equally
Background: Interventions are now in place worldwide to reduce transmission of the novel coronavirus. Assessing temporal variations in transmission in different countries is essential for evaluating the effectiveness of public health interventions and the impact of changes in policy.
Methods: We use case notification data to generate daily estimates of the time-dependent reproduction number in different regions and countries. Our modelling framework, based on open source tooling, accounts for reporting delays, so that temporal variations in reproduction number estimates can be compared directly with the times at which interventions are implemented.
Results: We provide three example uses of our framework. First, we demonstrate how the toolset displays temporal changes in the reproduction number. Second, we show how the framework can be used to reconstruct case counts by date of infection from case counts by date of notification, as well as to estimate the reproduction number. Third, we show how maps can be generated to clearly show if case numbers are likely to decrease or increase in different regions. Results are shown for regions and countries worldwide on our website (https://epiforecasts.io/covid/) and are updated daily. Our tooling is provided as an open-source R package to allow replication by others.
Conclusions: This decision-support tool can be used to assess changes in virus transmission in different regions and countries worldwide. This allows policymakers to assess the effectiveness of current interventions, and will be useful for inferring whether or not transmission will increase when interventions are lifted. As well as providing daily updates on our website, we also provide adaptable computing code so that our approach can be used directly by researchers and policymakers on confidential datasets. We hope that our tool will be used to support decisions in countries worldwide throughout the ongoing COVID-19 pandemic.
Keywords: Covid-19, SARS-CoV-2, surveillance, forecasting, time-varying reproduction number
The coronavirus disease 2019 (COVID-19) pandemic that emerged in December 2019 has since spread to over 100 countries in every continent except Antarctica. While some information on the progress of an outbreak in a given country can be gained from the reported numbers of confirmed cases and deaths, these numbers can obscure changes in the underlying dynamics of the outbreak due to delays between infection and the eventual reporting of a case or death. Accounting for the delays from infection to symptom onset, and delays from symptom onset to hospital admission, diagnostic testing or potential death, followed by further delays until data are recorded in official statistics, requires the use of specific statistical methods for handling right-truncated data
Another method for tracking the progress of an outbreak is measuring changes in the time-varying reproduction number (effective reproduction number), which represents the average number of secondary infections generated by each new infectious case
This paper details the methods we have developed for nowcasting and forecasting global time-varying reproduction number estimates, which are presented on a regularly updated website (https://epiforecasts.io/covid/). We first estimate cases by date of infection based on reported cases, accounting for right truncation and uncertainty in the reporting delay and incubation period. We then estimate the time-varying reproduction number and use an ensemble of time series models to forecast future changes in the reproduction number by extrapolating underlying temporal trends. We then reverse the process of estimating the reproduction number from cases by date of infection to derive forecasts of future reported cases by date of infection. Our estimates use reports of confirmed cases at the national, or sub-national, level that are extracted from publicly available repositories. This work builds on previously published tools
We use daily counts of confirmed cases reported by the European Centre for Disease Control for all analyses conducted at the national level
For sub-national analyses, the source of the data is reported on the respective page on our website. The data are fetched from government departments or from individuals who maintain a data source if no official data are available. Subnational entities within countries are only reported if at least 40 cases have been reported in a single day. A lower limit is possible for sub-national compared to national data due to more consistent case reporting in the source datasets.
All analyses described below are run independently for each national or subnational entity under consideration.
To estimate the reporting delay with appropriate uncertainty, we fit exponential and gamma distributions to 100 subsampled bootstraps (each with 250 samples drawn with replacement) of the delay between symptom onset and case notification, accounting for left and right censoring occurring in the data as each date is rounded to the nearest day and truncated to the maximum observed delay. We fit each model in the statistical modelling program stan
The distribution that gave the lowest LOOIC was selected as the most appropriate and 10 samples of the fitted distribution parameters were then drawn per bootstrap (giving 1000 in total). For a given country, we used sample from the posterior distribution of delay distribution parameters, , to draw a sample of delays, , to transform each observed notification date, , into a sample onset date, , as follows:
,
where or
This resulted in 1000 date of onset samples for each confirmed case. For countries/regions with high case loads (more than 10,000 reported cases) the sampling step is approximated using the probability density function of the reporting delay.
When moving from notification dates to onset dates it is important to consider that the total number of confirmed cases lags behind the number of cases that have onset, since there is a delay between onset occurring and the case being counted upon notification. To account for this right truncation, we used binomial upscaling to increase the estimated numbers of case onsets close to the present. After transforming the observed notification dates to onset dates and tallying case onset numbers by day, we then drew a sample of the number of case onsets that occurred but have not yet been confirmed.
If is the last date on which cases were reported then the number of onsets on day , denoted as , is regarded as the result of a Bernoulli trial from the total true number of cases with symptom onsets on day . is then distributed according to a negative binomial distribution as follows:
where is the cumulative distribution function for the given delay distribution. It gives the proportion of onset cases from days ago that are expected to have been confirmed over the days from that time until the present. The final numbers of case onsets that were used to estimate the time varying reproduction numbers for day are consequently given by . As our approach could not fully reconstruct unreported cases without bias we truncated our results and did not use estimates from the last 3 days. To prevent spurious estimates we truncated the allowed amount of upscaling to be less than 10 times the reported cases on any given day.
We repeated the steps outlined above for adjusting for the delay from report to symptom onset to account for the delay between symptom onset and infection by replacing the bootstrapped report delay distribution with an incubation period distribution with a mean of 5 days
We used the EpiEstim R package
We incorporated uncertainty in the generation time distribution by providing EpiEstim with 1000 samples each derived using a different sample of the log mean and log standard deviation of the assumed log normal distribution. We evaluated the reproduction number by assuming that it is constant over a backwards looking sliding time window
The estimates of the time-varying reproduction number at each time point were combined over 1000 samples, using the optimal window for each, to give a credible interval that incorporates uncertainty from the delay from case onset to notification, the incubation period and the generation time.
We defined the estimated change in daily cases to correspond to the proportion of reproduction number estimates for the current day that are below 1 (the value at which an outbreak is in decline). It was assumed that if less than 5% of samples were subcritical then an increase in cases was definite, if less than 20% of samples were subcritical then an increase in cases was likely, if more than 80% of samples were subcritical then a decrease in cases was likely and if more than 95% of samples were subcritical then a decrease in cases was definite. For countries/regions with between 20% and 80% of samples being subcritical we could not make a statement about the likely change in cases (defined as unsure).
As another metric of outbreak progression, we estimated the rate of spread () using a quasipoisson regression model
The results presented here are sensitive to changes in COVID-19 testing practices and the level of effort put into detecting COVID-19 cases, e.g. through contact tracing. For example, if numbers of incident infections remain constant but a country begins to find and report a higher proportion of cases, then an increasing value of the reproduction number will be inferred. This is because all changes in the number of cases are attributed to changes in the number of infections resulting from previously reported cases, and are not assumed to be a result of improved testing and surveillance. On the other hand, if a country reports a lower proportion of cases because a lower number of tests are performed (which can happen if reagents required for testing are no longer available, for example) or the surveillance system captures a lower proportion of infections, then the model will attribute this to a drop in the reproduction number that may not be a true reduction. In order for our estimates to be unbiased not all cases have to be reported, but the level of testing effort (and therefore the proportion of detected cases) must be constant
Countries may also change the focus of their surveillance over the course of the outbreak. They may initially focus on identifying travellers returning from areas of known COVID-19 transmission and performing contact tracing on the contacts of known cases. As the outbreak evolves this may change to passive surveillance at hospitals. Here, the case definition may also change from tests based on polymerase chain reaction (PCR) to diagnoses based on symptoms and computed tomography (CT) scans. In the future, different kinds of COVID-19 tests may be deployed that could influence results, such as tests that detect both active and past infections.
We forecast the time-varying effective reproduction number over a 14 day time horizon using the best performing ensemble of time series models
We report the median and 90% highest density credible intervals for all measures with 50% and 90% high density regions shown in figures. The analysis was conducted independently for all regions and is updated regularly as new data becomes available. As our credible intervals do not capture the proportion of cases that have been upscaled (when correcting for right truncation), we represent this in figures using translucency. This is presented as our confidence in the estimates which we define as the proportion of symptom onsets that are expected to have been reported by the date of estimation. All results are available in the source repository in a comma-separated values file.
Daily updated estimates of the time-varying reproduction number, epidemic doubling time, and rate of spread at the national level are given for more than 90 countries on our website. New countries are being added as data become available. An example plot created on the 23rd of May 2020, showing the numbers of cases by date of infection for Austria and the inferred time-varying reproduction number, is shown in Figure 1.
Figure 1: Confirmed cases by date of report (bars; top) and their estimated date of infection (ribbon; top) and time-varying reproduction number (bottom) in Austria. The light and dark blue ribbons show the 90% and 50% credible interval, respectively. The estimates were generated on the 23rd May 2020. Due to the delay between being infectious and becoming a confirmed case, the estimates lag behind the present. Confidence in the estimated values is indicated by translucency with increased translucency corresponding to increased uncertainty deriving from right-truncation of reported cases in the estimate. The forecast (red) is coloured differently from the nowcast (blue) to indicate the drop in reliability when trying to forecast future cases.
Sub-national breakdowns can highlight differences in how the outbreak is progressing within a country (Figure 2) and are currently provided for 6 countries (Italy, Germany, United Kingdom, United States of America, Brazil, India). For example, on the 23rd of May 2020 we estimated that cases were likely decreasing in every country in the United Kingdom except Northern Ireland where the estimate was classified as unsure.
Figure 2: Estimates of numbers of cases that will be newly infected on the date the estimate was made and that will end up being reported (top panel), and the time-varying reproduction number (bottom panel) across different nations/regions of the United Kingdom. Estimates were produced on 23rd of May 2020. Nations/regions with fewer than 40 confirmed cases reported on a single day are not included in the analysis.
We present the nowcasting results on a map to effectively visualise regional differences in transmission (Figure 3). This helps identify areas where intervention policies are more effective at reducing transmission than others, which can inform decision-making going forwards.
Figure 3: A map of the estimates of the expected change in daily cases at the state level for the United States of America. Estimates were produced on 23rd of May 2020. Regions with fewer than 40 confirmed cases reported on a single day are not included in the analysis (light grey).
The methodology and toolset described here has also been used separately to produce estimates of the time-varying reproduction number at the state level in Australia, an example of how researchers and policymakers can apply the methods to their own data
We provide a centralised resource, which generates comparable daily estimates of the time-varying reproduction number and a daily nowcast of the number of cases newly infected derived using a standardised method. The estimates are free of any hypotheses about the impact of interventions, since they are derived only from reported case counts and an estimate of the generation time. We explicitly account for the delay between infection and case notification and include all sources of quantifiable uncertainty. This resource may be useful for policymakers to track the progression of the COVID-19 outbreak and evaluate the effectiveness of intervention measures. As new data become available, we will include sub-national estimates for additional countries, and provide additional support for public health agencies or researchers interested in applying our methods to their data. We routinely utilise our own tooling to provide estimates of the reproduction number in the United Kingdom for policymakers using confidential hospital admissions data.
There are several advantages associated with our approach. Firstly, reported case counts are the only data required, which allows our approach to be used in a wide variety of contexts. Secondly, we apply the same methodology to all countries. This means that estimates can be compared without having to consider differences in the underlying methodology (even if differences in testing should still be accounted for as discussed below). Finally, we have constructed our approach using open source tools and all of our code, raw data, and results are available online. This means our approach can be applied by others to non-public data and be fully evaluated by end users.
Our approach is also subject to several limitation. Firstly, the model requires that the proportion of infections that are notified is constant. In other words, it requires consistency in the focus of the surveillance method, level of effort spent on testing, and case definition. Yet it is often the case that the level of under-reporting in a country changes over the course of an outbreak
In addition, the model is limited by how representative the delay that we use from infection to notification distribution is for a given location. As there is limited data to assess this, we estimate a bootstrapped global delay distribution using the combined data from every country. In particular, the delay from onset to notification can especially impact the upscaling of cases by date of onset that accounts for cases that have onset but not yet been reported. If the true delay from onset to notification for a given country is shorter than our global delay, then we will overestimate onset case numbers, and vice versa for true delays longer than the distribution we used. Additionally, estimates of the reporting delay distribution are known to be biased early in an epidemic and may vary over time
Our model is also limited by the data avaliable to us. For example, the publically available linelists contain little data on the importation status of cases. This means that cases counts may be biased upwards by attributing imported cases to local transmission. This bias is particularly problematic when case counts are low. Unfortunately, in the absence of data, this issue can only be explored via scenario analysis. However, if and when data on the importation of cases is available, our approach (via EpiEstim
As more data becomes available, future work should look to refine the distributions used for generation time, incubation period, and the report delay. There is also the potential to extend the present model to account for overdispersion in the number of secondary infections
Our approach, providing real-time estimates of the reproduction number, serves as a valuable tool for decision makers looking to track the course of COVID-19 outbreaks. The nowcasts explicitly account for delays, using the same methodology across all countries and sub-national regions. These reproduction number estimates can be used during the initial stages of an outbreak to ascertain the likely outbreak trajectory if no interventions have been implemented. They can also provide real-time feedback on whether transmission is decreasing following a particular intervention, or whether it is increasing following the relaxing or lifting of current intervention measures. We hope that our website and the related toolkit will provide a valuable resource for devising strategies to contain COVID-19 outbreaks worldwide.
Latest data: https://github.com/epiforecasts/covid
Archived data at the time of publication: https://doi.org/10.5281/zenodo.3841818
License: MIT
License: MIT
This project was enabled through access to the MRC eMedLab Medical Bioinformatics infrastructure, supported by the Medical Research Council (MR/L016311/1). Additional compute infrastructure and support was provided by the Met office. We thank Venexia Walker for comments on a version of this draft. The following authors were part of the Centre for Mathematical Modelling of Infectious Disease 2019-nCoV working group. Each contributed in processing, cleaning and interpretation of data, interpreted findings, contributed to the manuscript, and approved the work for publication: Samuel Clifford, Mark Jit, Stéphane Hué, Eleanor M Rees, Petra Klepac, Damien C Tully, Rachel Lowe, Kathleen O’Reilly, Nicholas G. Davies, Quentin J Leclerc, Arminder K Deol, Gwenan M Knight, C Julian Villabona-Arenas, Fiona Yueqian Sun, Emily S Nightingale, Alicia Rosello, Adam J Kucharski, Yang Liu, Billy J Quilty, Matthew Quaife, Jon C Emery, Katherine E. Atkins, Simon R Procter, W John Edmunds, Megan Auzenbergs, Christopher I Jarvis, David Simons, Kiesha Prem, Graham Medley, Thibaut Jombart, Charlie Diamond, Anna M Foss, Rein M G J Houben, Kevin van Zandvoort, Georgia R Gore-Langton.
The following funding sources are acknowledged as providing funding for the named authors. Alan Turing Institute (AE). This research was partly funded by the Bill & Melinda Gates Foundation (NTD Modelling Consortium OPP1184344: CABP). DFID/Wellcome Trust (Epidemic Preparedness Coronavirus research programme 221303/Z/20/Z: CABP). This research was partly funded by the Global Challenges Research Fund (GCRF) project ‘RECAP’ managed through RCUK and ESRC (ES/P010873/1: AG). HDR UK (MR/S003975/1: RME). Nakajima Foundation (AE). UK DHSC/UK Aid/This research was partly funded by the National Institute for Health Research (NIHR) using UK aid from the UK Government to support global health research. The views expressed in this publication are those of the author(s) and not necessarily those of the NIHR or the UK Department of Health and Social Care (ITCRZ 03010: HPG). UK MRC (MC_PC 19065: RME). Wellcome Trust (206250/Z/17/Z: TWR; 208812/Z/17/Z: SFlasche; 210758/Z/18/Z: JDM, JH, NIB, SA, SFunk, SRM).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/epiforecasts/covid, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".