Estimating $R_e$ and overdispersion in secondary cases from the size of identical sequence clusters of SARS-CoV-2
This repository contains the code of the statistical analysis of the paper "Estimating
The aim of this repository is to provide everything necessary to reproduce the statistical analysis of the paper cited above. All files (R-scripts as well as data) used to obtain the results are contained in this repository. Furthermore, the complete simulated data used for the validation of the model as well as all stanfit files containing the results of the parameter estimation can be found in this repository.
- The whole R code is structured in an R-project (
R_overdispersion_cluster_size.Rproj). - Before running any other R file, the file
setup.R(contained in folderR) needs to be run. In this file, all paths to data and results files are defined (with respect to the path ofR_overdispersion_cluster_size.Rproj). - R files are grouped by topic (data processing, creating plots, ...).
- Running the file
main.R(contained in folderR) calls all R scripts necessary to redo the processing of data and results as well as the creation of plots and tables. Data and result files need to be stored at the paths defined insetup.R. - Parameter estimation, both from simulated data and from data from Switzerland, Denmark and Germany, has been run on the high performance computing cluster of the University of Bern, UBELIX.
- Simulation of clusters for the simulation study and for the posterior predictive check have been run on the high performance computing cluster of the University of Bern, UBELIX.
The simulation of identical sequence clusters as well as the models to estimate parameters from the sequence cluster size distribution are implemented as functions in an R-package, called estRodis. The estRodis package can be found here: GitHub Martin Wohlfender: estRodis
The structuring of sequence data into clusters of identical sequences was done by Emma Hodcroft. Her code can be found here: GitHub Emma Hodcroft: sc2_rk_public
Whenever "model five" is mentioned in comments in the code, this refers to the standard model presented in the paper. "model one", "model two", "model three", "model four" and "model six" refer to the alternative models described in the section "Sensitivity analysis" of the supplementary material of the paper "Estimating
Load all necessary R-packages and define paths. setup.R needs to be run first.
Contains all steps needed to process data and results and to create figures and tables.
Custom functions for creating plots. All files in this folder are sourced when running setup.R.
All R-scripts covering the processing of data and results.
(a) model one
01_sim_setup_model_one.Rsetting up the simulation study: define parameter combinations for which clusters of identical sequences shall be simulated02_sim_simulate_data_model_one_parallel.Rsimulation of identical sequence clusters (run in parallel on the high performance computing cluster of the University of Bern, UBELIX)03_sim_estimate_parameters_model_one_parallel.Restimation of parameters from simulated data using model one (run in parallel on the high performance computing cluster of the University of Bern, UBELIX)
(b) model five
01_sim_setup_model_five.Rsetting up the simulation study: define parameter combinations for which clusters of identical sequences shall be simulated02_sim_simulate_data_model_five_parallel.Rsimulation of identical sequence clusters (run in parallel on the high performance computing cluster of the University of Bern, UBELIX)03_sim_estimate_parameters_model_five_parallel.Restimation of parameters from simulated data using model five (run in parallel on the high performance computing cluster of the University of Bern, UBELIX)
All R-scripts to estimate parameters from data from Switzerland, Denmark and Germany using the main (model five) or the alternative models (models one, two, three four and six). These files were run in parallel on the high performance computing cluster of the University of Bern, UBELIX.
(a) model one
01_ppc_model_one_setup.Rsetting up the posterior predictive check: define parameter combinations for which clusters of identical sequences shall be simulated02_ppc_model_one_simulations_parallel.Rsimulation of identical sequence clusters (run in parallel on the high performance computing cluster of the University of Bern, UBELIX)
(b) model five
01_ppc_model_five_setup.Rsetting up the posterior predictive check: define parameter combinations for which clusters of identical sequences shall be simulated02_ppc_model_five_simulations_parallel.Rsimulation of identical sequence clusters (run in parallel on the high performance computing cluster of the University of Bern, UBELIX)
All R-scripts covering the creation of figures (contained in paper and supplementary material).
All R-scripts covering the creation of overview tables of data and results (contained in supplementary material).
R-scripts used to do some minor extra analysis.
clusters_months.Rcheck how many clusters extend across more than one month.summary_statistics.Rbasic description of cluster data set (number of identical sequence clusters, number of cases in identical sequence clusters, fraction of identical sequence clusters that are of size one and average cluster size)
The repository contains the following data files (see folder data):
Switzerland_cluster_distribution_dates_100whole.tsvdistribution of size of identical sequence clusters (obtained from GitHub Emma Hodcroft: sc2_rk_public)data_new_confirmed_cases_ch_raw.csvnumber of new confirmed cases (obtained from COVID-19 Switzerland)data_r_e_ch_raw.csvestimate of effective reproduction number on daily basis based on number of confirmed cases (obtained from GitHub covid-19-Re: dailyRe-Data)switzerland_date_only.csvdate of sampling of all sequences contained in identical sequence clusters used for the analysis
data_cases_sequences_clusters_ch_2021_months.csvoverview of number of confirmed cases, number of sequences sampled, number of clusters and size of largest cluster per month in 2021data_cluster_sizes_ch_2021_months.csvnumber of clusters of each size in each month of 2021data_clusters_ch_processed.csvdistribution of size of identical sequence clusters, clusters without a valid smapling date eliminateddata_new_confirmed_cases_ch_processed.csvnumber of new confirmed cases, filtered to 2021data_r_e_ch_processed.csvestimate of effective reproduction number on daily basis based on number of confirmed cases, filtered to 2021sequencing_probas_ch_2021_months.csvprobability of a confirmed case being sequenced on monthly basis during 2021
Denmark_cluster_distribution_dates_100whole.tsvdistribution of size of identical sequence clusters (obtained from GitHub Emma Hodcroft: sc2_rk_public)data_new_confirmed_cases_dk_raw.csvnumber of new confirmed cases (obtained from Statens Serum Institut)data_r_e_dk_raw.csvestimate of effective reproduction number on daily basis based on number of confirmed cases (obtained from GitHub covid-19-Re: dailyRe-Data)denmark_date_only.csvdate of sampling of all sequences contained in identical sequence clusters used for the analysis
data_cases_sequences_clusters_dk_2021_months.csvoverview of number of confirmed cases, number of sequences sampled, number of clusters and size of largest cluster per month in 2021data_cluster_sizes_dk_2021_months.csvnumber of clusters of each size in each month of 2021data_clusters_dk_processed.csvdistribution of size of identical sequence clusters, clusters without a valid smapling date eliminateddata_new_confirmed_cases_dk_processed.csvnumber of new confirmed cases, filtered to 2021data_r_e_dk_processed.csvestimate of effective reproduction number on daily basis based on number of confirmed cases, filtered to 2021sequencing_probas_dk_2021_months.csvprobability of a confirmed case being sequenced on monthly basis during 2021
Germany_cluster_distribution_dates_100whole.tsvdistribution of size of identical sequence clusters (obtained from GitHub Emma Hodcroft: sc2_rk_public)data_new_confirmed_cases_de_raw.csvnumber of new confirmed cases (obtained from GitHub Robert Koch Institut: COVID-19_7-Tage-Inzidenz_in_Deutschland)data_r_e_de_raw.csvestimate of effective reproduction number on daily basis based on number of confirmed cases (obtained from GitHub covid-19-Re: dailyRe-Data)germany_date_only.csvdate of sampling of all sequences contained in identical sequence clusters used for the analysis
data_cases_sequences_clusters_de_2021_months.csvoverview of number of confirmed cases, number of sequences sampled, number of clusters and size of largest cluster per month in 2021data_cluster_sizes_de_2021_months.csvnumber of clusters of each size in each month of 2021data_clusters_de_processed.csvdistribution of size of identical sequence clusters, clusters without a valid smapling date eliminateddata_new_confirmed_cases_de_processed.csvnumber of new confirmed cases, filtered to 2021data_r_e_de_processed.csvestimate of effective reproduction number on daily basis based on number of confirmed cases, filtered to 2021sequencing_probas_de_2021_months.csvprobability of a confirmed case being sequenced on monthly basis during 2021
df_cluster_by_period_NZ.rdsnumber of clusters of each size in different periods (April 2020 - July 2021) (obtained from GitHub CecileTK: size-genetic-clusters)df_p_trans_before_mut_with_uncertainty.rdsprobability of mutation before transmission for different pathogens (obtained from GitHub CecileTK: size-genetic-clusters)df_prop_sequenced_per_period.rdsprobability of a confirmed case being sequenced for each period (obtained from GitHub CecileTK: size-genetic-clusters)
mutation_probas_sarscov2_omicron.rdsprobability of mutation before transmission for Omicron variant of SARS-CoV-2mutation_probas_sarscov2_pre_omicron.rdsprobability of mutation before transmission before Omicron variant of SARS-CoV-2
data_variants_shares_ch_dk_de_raw.csvshares of SARS-CoV-2 variants (alpha, delta, omicron and other) among sequences on bi-weekly interval during 2021 (obtained from CoVariants)
data_variants_shares_ch_dk_de_processed.csvshares of SARS-CoV-2 variants (alpha, delta, omicron and other) among sequences on bi-weekly interval during 2021 and auxiliary variables needed for plotting
(a) model one
data_parameters_ppc_model_one.csvall parameter combinations for which identical sequence clusters were simulated during the posterior predictive checkindex_parameters_ppc_model_one.txtauxiliary file needed during the posterior predictive check for the parallel execution of cluster simulation
(b) model five
data_parameters_ppc_model_five.csvall parameter combinations for which identical sequence clusters were simulated during the posterior predictive checkindex_parameters_ppc_model_five.txtauxiliary file needed during the posterior predictive check for the parallel execution of cluster simulation
(a) model one
parameters_grid_simulation_model_one.csvall parameter combinations for which identical sequence clusters were simulated during the simulation studyindices_simulation_model_one.txtandindices_estimation_model_one.txtauxiliary files needed during the simulation study for the parallel execution of cluster simulation, respectively parameter estimationsimulated_clusterssimulated data based on parameters defined inparameters_grid_simulation_model_one.csv
(b) model five
parameters_grid_simulation_model_five.csvall parameter combinations for which identical sequence clusters were simulated during the simulation studyindices_simulation_model_five.txtandindices_estimation_model_five.txtauxiliary files needed during the simulation study for the parallel execution of cluster simulation, respectively parameter estimationsimulated_clusterssimulated data based on parameters defined inparameters_grid_simulation_model_five.csv
stanfit files containing the results of the parameter estimation from data of Switzerland using models one, two, three, four, five and six
stanfit files containing the results of the parameter estimation from data of Denmark using models one, two, three, four, five and six
stanfit files containing the results of the parameter estimation from data of Germany using models one, two, three, four, five and six
results_model_one_ch_dk_de_2021_months.csvsummary of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model oneresults_model_two_ch_dk_de_2021_months.csvsummary of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model tworesults_model_three_ch_dk_de_2021_months.csvsummary of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model threeresults_model_four_ch_dk_de_2021_months.csvsummary of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model fourresults_model_five_ch_dk_de_2021_months.csvsummary of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model fiveresults_model_six_ch_dk_de_2021_months.csvsummary of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model six
results_ppc_model_one_ch_dk_de.csvsummary of the results of the posterior predictive check of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model oneresults_ppc_model_five_ch_dk_de.csvsummary of the results of the posterior predictive check of the results of the parameter estimation from data of Switzerland, Denmark and Germany using model five
(a) model one
results_goodness_fit_mean_model_one_ch_dk_de.csvcluster size distribution parametrized by mean estimates of parametersresults_goodness_fit_low_model_one_ch_dk_de.csvcluster size distribution parametrized by 2.5% and 97.5% quantile estimates of parameters (lower limit)results_goodness_fit_high_model_one_ch_dk_de.csvcluster size distribution parametrized by 2.5% and 97.5% quantile estimates of parameters (upper limit)
(b) model five
results_goodness_fit_mean_model_five_ch_dk_de.csvcluster size distribution parametrized by mean estimates of parametersresults_goodness_fit_low_model_five_ch_dk_de.csvcluster size distribution parametrized by 2.5% and 97.5% quantile estimates of parameters (lower limit)results_goodness_fit_high_model_five_ch_dk_de.csvcluster size distribution parametrized by 2.5% and 97.5% quantile estimates of parameters (upper limit)
(a) model one
stanfit files containing the results of the parameter estimation from simulated data using model one (see data/simulation/01_model_one/simulated_clusters)
results_sim_model_one_processed.csvsummary of the results of the parameter estimation from simulated data using model one
(b) model five
stanfit files containing the results of the parameter estimation from simulated data using model five (see data/simulation/02_model_five/simulated_clusters)
results_sim_model_five_processed.csvsummary of the results of the parameter estimation from simulated data using model five
Graphical and tabular representations of data, model and results of simulation study, parameter estimation, posterior predictive check and goodness of fit check both for Switzerland, Denmark and Germany individually and for all three countries together and parameter estimation for New Zealand.