diff --git a/020-Data-generating-models.Rmd b/020-Data-generating-models.Rmd index e0332da..5d05d73 100644 --- a/020-Data-generating-models.Rmd +++ b/020-Data-generating-models.Rmd @@ -242,6 +242,7 @@ Here is a plot of 30 observations from the bivariate Poisson distribution with m #| fig.width: 6 #| fig.height: 4 +set.seed( 40440 ) bvp_dat <- r_bivariate_Poisson( 30, rho = 0.65, @@ -252,9 +253,12 @@ ggplot(bvp_dat) + aes(C1, C2) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + - scale_x_continuous(expand = expansion(0,c(0,1))) + - scale_y_continuous(expand = expansion(0,c(0,1))) + - geom_point(position = position_jitter(width = 0.2, height = 0.2)) + + scale_x_continuous(expand = expansion(0,c(0,1)), + breaks = c( 2, 4, 6, 8, 10, 12, 14, 16 )) + + scale_y_continuous(expand = expansion(0,c(0,1)), + breaks = c( 2, 4, 6, 8, 10, 12 )) + + coord_fixed() + + geom_point(position = position_jitter(width = 0.075, height = 0.075)) + theme_minimal() ``` Plots like these are useful for building intuitions about a model. For instance, we can inspect \@ref(fig:bivariate-Poisson-scatter) to get a sense of the order of magnitude and range of the observations, as well as the likelihood of obtaining multiple observations with identical counts. @@ -444,9 +448,9 @@ The mathematical model gives us the details we need to execute with each of thes Here is the skeleton of a DGP function with arguments for each of the parameters we might want to control, including defaults for each (see \@ref(default-arguments) for more on function defaults): ```{r, echo = FALSE} -source( "case_study_code/gen_cluster_RCT.R" ) +source( "case_study_code/gen_cluster_RCT_bug.R" ) -cluster_RCT_fun <- readLines("case_study_code/gen_cluster_RCT.R") +cluster_RCT_fun <- readLines("case_study_code/gen_cluster_RCT_bug.R") skeleton <- c(cluster_RCT_fun[1:10], " # Code (see below) goes here", "}") ``` @@ -477,7 +481,7 @@ Next, we use the site characteristics to generate the individual-level variables A key piece here is the `rep()` function that takes a list and repeats each element of the list a specified number of times. In particular, `rep()` repeats each number ($1, 2, \ldots, J$), the corresponding number of times as listed in `nj`. -Putting the code above into the function skeleton will produce a complete DGP function (view the [complete function here](/case_study_code/gen_cluster_RCT.R)). +Putting the code above into the function skeleton will produce a complete DGP function (view the [complete function here](/case_study_code/gen_cluster_RCT_bug.R)). We can then call the function as so: ```{r} dat <- gen_cluster_RCT( @@ -486,7 +490,7 @@ dat <- gen_cluster_RCT( sigma2_u = 0.4, sigma2_e = 1 ) -dat +head( dat ) ``` With this function, we can control the average size of the clusters (`n`), the number of clusters (`J`), the proportion treated (`p`), the average outcome in the control group (`gamma_0`), the average treatment effect (`gamma_1`), the site size by treatment interaction (`gamma_2`), the amount of cross site variation (`sigma2_u`), the residual variation (`sigma2_e`), and the amount of site size variation (`alpha`). @@ -534,10 +538,12 @@ A further consequence of setting the overall scale of the outcome to 1 is that t The standardized mean difference for a treatment impact is defined as the average impact over the standard deviation of the outcome among control observations.^[An alternative definition is based on the pooled standard deviation, but this is usually a bad choice if one suspects treatment variation. More treatment variation should not reduce the effect size for the same absolute average impact.] Letting $\delta$ denote the standardized mean difference parameter, $$ \delta = \frac{E(Y | Z_j = 1) - E(Y | Z_j = 0)}{SD( Y | Z_j = 0 )} = \frac{\gamma_1}{\sqrt{ \sigma^2_u + \sigma^2_\epsilon } } $$ -Because we have constrained the total variance, $\gamma_1$ is equivalent to $\delta$. This equivalence holds for any value of $\gamma_0$, so we do not have to worry about manipulating $\gamma_0$ in the simulations---we can simply leave it at its default value. - +Because we have constrained the total variance to 1, $\gamma_1$ is equivalent to $\delta$. +This equivalence holds for any value of $\gamma_0$, so we do not have to worry about manipulating $\gamma_0$ in the simulations---we can simply leave it at its default value. +The $\gamma_2$ parameter only impacts outcomes on the treatment side, with larger values induces more variation. Standardizing on something stable (here the control side) rather than something that changes in reaction to various parameters (which would happen if we standardized by overall variation or pooled variation) will lead to more interpretable quantities. - + + ## Sometimes a DGP is all you need {#three-parameter-IRT} @@ -719,8 +725,7 @@ The exercises below will give you further practice in developing, testing, and e Of course, we have not covered every consideration and challenge that arises when writing DGPs for simulations---there is _much_ more to explore! We will cover some further challenges in subsequent chapters. - - +See, for example, Chapters \@ref(sec:power) and \@ref(potential-outcomes). Yet more challenges will surely arise as you apply simulations in your own work. Even though such challenges might not align with the examples or contexts that we have covered here, the concepts and principles that we have introduced should allow you to reason about potential solutions. In doing so, we encourage you adopt the attitude of a chef in a test kitchen by trying out your ideas with code. diff --git a/030-Estimation-procedures.Rmd b/030-Estimation-procedures.Rmd index 01a2088..c6778b5 100644 --- a/030-Estimation-procedures.Rmd +++ b/030-Estimation-procedures.Rmd @@ -16,7 +16,7 @@ set.seed(20250606) source("case_study_code/generate_ANOVA_data.R") source("case_study_code/ANOVA_Welch_F.R") source("case_study_code/r_bivariate_Poisson.R") -source("case_study_code/gen_cluster_RCT_rev.R") +source("case_study_code/gen_cluster_RCT.R") source("case_study_code/analyze_cluster_RCT.R") ``` @@ -40,7 +40,7 @@ The easiest approach here is to implement each estimator in turn, and then bundl We describe how to do this in Section \@ref(multiple-estimation-procedures). In Section \@ref(validating-estimation-function), we describe strategies for validating the coded-up estimator before running a full simulation. -We offer a few strategies for validating one's code, including checking against existing implementations, checking theoretical properties, and using simulations to check for bugs. +We offer a few strategies for validating ones code, including checking against existing implementations, checking theoretical properties, and using simulations to check for bugs. For a full simulation, an estimator needs to be reliable, not crashing and giving answers across the wide range of datasets to which it is applied. An estimator will need to be able to handle odd edge cases or pathological datasets that might be generated as we explore a full range of simulation scenarios. @@ -119,11 +119,14 @@ Several different procedures might be used to estimate an overall average effect All three of these methods are widely used and have some theoretical guarantees supporting their use. Education researchers tend to be more comfortable using multi-level regression models, whereas economists tend to use OLS with clustered standard errors. - + + + We next develop estimation functions for each of these procedures, focusing on a simple model that does not include any covariates besides the treatment indicator. Each function needs to produce a point estimate, standard error, and $p$-value for the average treatment effect. -To have data to practice on, we generate a sample dataset using [a revised version of `gen_cluster_RCT()`](/case_study_code/gen_cluster_RCT_rev.R), which corrects the bug discussed in Exercise \@ref(cluster-RCT-checks): +To have data to practice on, we generate a sample dataset using [a revised version of `gen_cluster_RCT()`](/case_study_code/gen_cluster_RCT.R), which corrects the bug discussed in Exercise \@ref(cluster-RCT-checks): + ```{r} dat <- gen_cluster_RCT( J=16, n_bar = 30, alpha = 0.8, p = 0.5, @@ -396,7 +399,6 @@ Furthermore, warnings can clutter up the console and slow down code execution, s On a conceptual level, we need to decide how to use the information contained in errors and warnings, whether that be by further elaborating the estimators to address different contingencies or by evaluating the performance of the estimators in a way that appropriately accounts for these events. We consider both these problems here, and then revisit the conceptual considerations in Chapter \@ref(performance-measures), where we discuss assessing estimator performance. - ### Capturing errors and warnings Some estimation functions will require complicated or stochastic calculations that can sometimes produce errors. @@ -458,6 +460,7 @@ However, `quietly()` does not trap errors: rs <- replicate(20, my_quiet_function( 7 )) ``` To handle both errors and warnings, we double-wrap the function, first with `possibly()` and then with `quietly()`: + ```{r} my_safe_quiet_function <- quietly( possibly( my_complex_function, otherwise = NA ) ) my_safe_quiet_function(7) @@ -470,39 +473,53 @@ set.seed(101012) # (hand-picked to show a warning.) dat <- gen_cluster_RCT( J = 50, n_bar = 100, sigma2_u = 0 ) mod <- analysis_MLM(dat) ``` + Wrapping `lmer()` with `quietly()` makes it possible to catch such output and store it along with other results, as in the following: ```{r} quiet_safe_lmer <- quietly( possibly( lmerTest::lmer, otherwise=NULL ) ) M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) M1 ``` -We then pick apart the pieces and construct a dataset of results: + +In our analysis method, we then pick apart the pieces and make a dataframe out of them (we also add in a generally preferred optimizer, bobyqa, to try and avoid the warnings and errors in the first place): + ```{r} -if ( is.null( M1$result ) ) { - # we had an error! - tibble( ATE_hat = NA, SE_hat = NA, p_value = NA, - message = M1$message, - warning = M1$warning, - error = TRUE ) -} else { - sum <- summary( M1$result ) - tibble( - ATE_hat = sum$coefficients["Z","Estimate"], - SE_hat = sum$coefficients["Z","Std. Error"], - p_value = sum$coefficients["Z", "Pr(>|t|)"], - message = list( M1$message ), - warning = list( M1$warning ), - error = FALSE ) +analysis_MLM_safe <- function( dat ) { + + M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), + data=dat, + control = lme4::lmerControl( + optimizer = "bobyqa" + ) ) + + if ( is.null( M1$result ) ) { + # we had an error! + tibble( ATE_hat = NA, SE_hat = NA, p_value = NA, + message = M1$message, + warning = M1$warning, + error = TRUE ) + } else { + sum <- summary( M1$result ) + + tibble( + ATE_hat = sum$coefficients["Z","Estimate"], + SE_hat = sum$coefficients["Z","Std. Error"], + p_value = sum$coefficients["Z", "Pr(>|t|)"], + message = length( M1$messages ) > 0, + warning = length( M1$warning ) > 0, + error = FALSE + ) + } } ``` -Now we can plug in the above code to make a nicely quieted and safe function. -This version runs without extraneous messages: + +Our improved function runs without extraneous messages and returns the estimation results along with any diagnostic information from messages or warnings: + ```{r} mod <- analysis_MLM_safe(dat) mod ``` -Now we have the estimation results along with any diagnostic information from messages or warnings. Storing this information will let us evaluate what proportion of the time there was a warning or message, run additional analyses on the subset of replications where there was no such warning, or even modify the estimator to take the diagnostics into account. We have solved the technical problem---our code will run and give results---but not the conceptual one: what does it mean when our estimator gives an NA or a convergence warning with a nominal answer? How do we decide how good our estimator is when it does this? diff --git a/035-running-simulation.Rmd b/035-running-simulation.Rmd index 8ee5dfd..0e97a7d 100644 --- a/035-running-simulation.Rmd +++ b/035-running-simulation.Rmd @@ -20,15 +20,15 @@ source("case_study_code/analyze_cluster_RCT.R") # Running the Simulation Process {#running-the-simulation-process} In the prior two chapters we saw how to write functions that generate data according to a particular model and functions that implement data-analysis procedures on simulated data. -The next step in a simulation involves putting these two pieces together, running the DGP function and the data-analysis function repeatedly to obtain results (in the form of point estimates, standard errors, confidence intervals, p-values, or other quantities) from many replications of the whole process. +The next step to build a simulation involves putting these two pieces together, running the DGP function and the data-analysis function repeatedly to obtain results (in the form of point estimates, standard errors, confidence intervals, $p$-values, or other quantities) from many replications of the whole process. As with most things R-related, there are many different techniques that can be used to repeat a set of calculations over and over. -In this chapter, we demonstrate several techniques for doing so. +In this chapter, we demonstrate two key techniques for doing so. We then explain how to ensure reproducibility of simulation results by setting the seed used by R's random number generator. ## Repeating oneself {#repeating-oneself} -Suppose that we want to simulate Pearson's correlation coefficient calculated based on a sample from the bivariate Poisson function. +Suppose that we want to repeatedly generate bivariate Poisson data and then estimate the correlation for those data. We saw a DGP function for the bivariate Poisson in Section \@ref(DGP-functions), and an estimation function in Section \@ref(estimation-functions). To produce a simulated correlation coefficient, we need to run these two functions in turn: ```{r} @@ -56,7 +56,7 @@ The second argument is an R expression that will be evaluated. The expression is wrapped in curly braces (`{}`) because it involves more than a single line of code. Including the option `id = "rep"` returns a dataset that includes a variable called `rep` to identify each replication of the process. Setting the option `stack = TRUE` will stack up the output of each expression into a single tibble, which will facilitate later calculations on the results. -Setting this option is not necessary because it is `TRUE` by default; setting `stack = FALSE` will return the results in a list rather than a tibble (try this for yourself to see!). +Explicitly setting this option is not necessary because it is `TRUE` by default; setting `stack = FALSE` will return the results in a list rather than a tibble (try this for yourself to see!). There are many other functions that work very much like `repeat_and_stack()`, including the base-R function `replicate()` and the now-deprecated function `rerun()` from `{purrr}`. The functions in the `map()` family from `{purrr}` can also be used to do the same thing as `repeat_and_stack()`. @@ -124,31 +124,15 @@ Note how the `bind_rows()` method can take naming on the fly, and give us a colu Once we have a function to execute a single run, we can produce multiple results using `repeat_and_stack()`: -```{r secret_run_cluster_rand_sim, include=FALSE} -R <- 1000 -ATE <- 0.30 - -if ( !file.exists("results/cluster_RCT_simulation.rds") ) { - tictoc::tic() # Start the clock! - set.seed( 40404 ) - runs <- repeat_and_stack(R, one_run( n_bar = 30, J=20, gamma_1 = ATE ), id = "runID") - tictoc::toc() - - saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) -} else { - runs <- readRDS( file = "results/cluster_RCT_simulation.rds" ) -} -``` - - ```{r cluster_rand_sim, eval=FALSE} R <- 1000 ATE <- 0.30 -runs <- repeat_and_stack(R, - one_run( n_bar = 30, J=20, gamma_1 = ATE ), - id = "runID") -saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) +runs <- repeat_and_stack( R, + one_run( n_bar = 30, J = 20, + gamma_1 = ATE ), + id = "runID") ``` + Setting `id = "runID"` lets us keep track of which iteration number produced which result. Once our simulation is complete, we save our results to a file so that we can avoid having to re-run the full simulation if we want to explore the results in some future work session. @@ -157,11 +141,12 @@ The next step is to evaluate how well the estimators did. For example, we will want to examine questions about bias, precision, and accuracy of the three point estimators. In Chapter \@ref(performance-measures), we will look systematically at ways to quantify the performance of estimation methods. -### Reparameterizing {#one-run-reparameterization} +## Reparameterizing {#one-run-reparameterization} In Section \@ref(DGP-standardization), we discussed how to index the DGP of the cluster-randomized experiment using an intra-class correlation (ICC) instead of using two separate variance components. This type of re-parameterization can be handled as part of writing a wrapper function for executing the DGP and estimation procedures. Here is a revised version of `one_run()`, which also renames some of the more obscure model parameters using terms that are easier to interpret: + ```{r revised_CRT, eval=FALSE} one_run <- function( n_bar = 30, J = 20, ATE = 0.3, size_coef = 0.5, @@ -184,11 +169,66 @@ one_run <- function( Note the `stopifnot`: it is wise to ensure our parameter transforms are all reasonable, so we do not get unexplained errors or strange results later on. It is best if your code fails as soon as possible! Otherwise debugging can be quite hard. + Controlling how we use the foundational elements such as our data generating code is a key technique for making the higher level simulations sensible and more easily interpretable. In the revised `one_run()` function, we transform the `ICC` input parameter into the parameters used by `gen_cluster_RCT()` so as to maintain the effect size interpretation of our simulation. We have not modified `gen_cluster_RCT()` at all: instead, we specify the parameters of the DGP function in terms of the parameters we want to directly control in the simulation. -Here we have put our entire simulation into effect size units, and are now providing "knobs" to the simulation that are directly interpretable. + +We can then wrap the reparameterization and replication into a function that runs an entire simulation: + + +```{r secret_run_cluster_rand_sim, include=FALSE} +R <- 1000 +ATE <- 0.30 +source( here::here( "case_study_code/clustered_data_simulation.R" ) ) +if ( !file.exists("results/cluster_RCT_simulation.rds") ) { + tictoc::tic() # Start the clock! + set.seed( 40404 ) + run_CRT_sim( reps = 1000, + n_bar = 30, J = 20, + ATE = 0.3, ICC = 0.2, + size_coef = 0.5, alpha = 0.5 ) + + tictoc::toc() + + saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) +} else { + runs <- readRDS( file = "results/cluster_RCT_simulation.rds" ) +} +``` + +```{r} +run_CRT_sim <- function( reps, + n_bar = 10, J = 30, p = 0.5, + ATE = 0, ICC = 0.4, + size_coef = 0, alpha = 0 ) { + + res <- + simhelpers::repeat_and_stack( reps, { + dat <- gen_cluster_RCT( n_bar = n_bar, J = J, p = p, + gamma_0 = 0, gamma_1 = ATE, gamma_2 = size_coef, + sigma2_u = ICC, sigma2_e = 1 - ICC, + alpha = alpha ) + quiet_analyze_data(dat) + }) %>% + bind_rows( .id="runID" ) + + res +} +``` + +Using it is then quite simple: +```{r, eval = FALSE} +run_CRT_sim( reps = 1000, + n_bar = 30, J = 20, + ATE = 0.3, ICC = 0.2, + size_coef = 0.5, alpha = 0.5 ) +saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) +``` +Notice our simulation takes easy to interpret parameters (`ATE`, `ICC`) rather than the raw variance components. +We have put our simulation into effect size units, and are now providing "knobs" to the simulation that are directly interpretable. + ## Bundling simulations with `simhelpers` {#bundle-sim-demo} @@ -197,17 +237,20 @@ The `simhelpers` package provides a function `bundle_sim()` that abstracts this Thus, `bundle_sim()` provides a convenient alternative to writing your own `one_run()` function for each simulation, thereby saving a bit of typing (and avoiding an opportunity for bugs to creep into your code). `bundle_sim()` takes a DGP function and an estimation function as inputs and gives us back a new function that will run a simulation using whatever parameters we give it. -Here is a basic example, which creates a function for simulating Pearson correlation coefficients with a bivariate Poisson distribution: +The `bundle_sim()` function is a "factory" in that it creates a new function for you on the fly. +Think of it as a (rather dumb) coding assistant who will take some direction and give you code back. + +Here is a basic example, where we create a function for running our simulation to evaluate how well our Pearson correlation coefficient confidence interval works on a bivariate Poisson distribution: ```{r} -sim_r_Poisson <- bundle_sim(f_generate = r_bivariate_Poisson, - f_analyze = r_and_z, - id = "rep") +sim_r_Poisson <- bundle_sim( f_generate = r_bivariate_Poisson, + f_analyze = r_and_z, + id = "rep" ) ``` -If we specify the optional argument `id = "rep"`, the function will include a variable called `rep` with a unique identifier for each replication of the simulation process. +If we specify the optional argument `id = "rep"`, the built function will generate a variable called `rep` with a unique identifier for each replication of the simulation. We can use the newly created function like so: ```{r, messages=FALSE} -sim_r_Poisson( 4, N = 30, rho = 0.4, mu1 = 8, mu2 = 14) +sim_r_Poisson( 4, N = 30, rho = 0.4, mu1 = 8, mu2 = 14 ) ``` To create this simulation function, `bundle_sim()` examined `r_bivariate_Poisson()`, figured out what its input arguments are, and made sure that the simulation function includes the same input arguments. @@ -223,7 +266,8 @@ The `bundle_sim()` function requires specifying a DGP function and a _single_ es For our cluster-randomized experiment example, we would then need to use our `estimate_Tx_Fx()` function that organizes all of the estimators (see Chapter \@ref(multiple-estimation-procedures)). We then use `bundle_sim()` to create a function for running an entire simulation: ```{r, messages=FALSE} -sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, id = "runID" ) +sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, + id = "runID" ) ``` We can call the newly created function like so: ```{r, messages=FALSE} @@ -236,20 +280,20 @@ Again, `bundle_sim()` produces a function with input names that exactly match th It is not possible to re-parameterize or change argument names, as we did with `one_run()` in Section \@ref(one-run-reparameterization). See Exercise \@ref(reparameterization-redux) for further discussion of this limitation. -To use the simulation function in practice, we call it by specifying the number of replications desired (which we have stored in `R`) and any relevant input parameters. +To use the simulation function in practice, we would call it by specifying the number of replications desired (which we have stored in `R`) and any relevant input parameters. ```{r, eval=FALSE} runs <- sim_cluster_RCT( reps = R, J = 20, n_bar = 30, alpha = 0.5, gamma_1 = ATE, gamma_2 = 0.5, sigma2_u = 0.2, sigma2_e = 0.8 ) -saveRDS( runs, file = "results/cluster_RCT_simulation.rds" ) ``` The `bundle_sim()` function is just a convenient way to create a function that pieces together the steps in the simulation process, which is especially useful when the component functions include many input parameters. The function has several further features, which we will demonstrate in subsequent chapters. + ## Seeds and pseudo-random number generators {#seeds-and-pseudo-RNGs} -In prior chapters, we have used built-in functions to generate random numbers and also written our own data-generating functions that produce artificial data following a specific random process. +In prior chapters, we used built-in functions to generate random numbers and wrote our own data-generating functions that produce artificial data following a specific random process. With either type of function, re-running it with the exact same input parameters will produce different results. For instance, running the `rchisq` function with the same set of inputs will produce two different sequences of $\chi^2$ random variables: ```{r} @@ -269,7 +313,8 @@ sim_B <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14) identical(sim_A, sim_B) ``` Of course, this is the intended behavior of these functions, but it has an important consequence that needs some care and attention. -Using functions like `rchisq()`, `r_bivariate_Poisson()`, or `r_bivariate_Poisson()` in a simulation study means that the results will not be fully reproducible. +Using functions like `rchisq()`, `r_bivariate_Poisson()`, or `r_bivariate_Poisson()` in a simulation study means that the results will not be fully reproducible, meaning that if we run our simulation and write up our results, and then someone else tries to run our very same simulation using our very same code, they might not get the very same results. +This can undermine transparency, and make it hard to verify if a set of presented results really came from a given codebase. When using DGP functions for simulations, it is useful to be able to exactly control the process of generating random numbers. This is much more feasible than it sounds: Monte Carlo simulations are random, at least in theory, but computers are deterministic. @@ -293,6 +338,8 @@ set.seed(6) c2 <- rchisq(4, df = 3) rbind(c1, c2) ``` +The sequences are the same! + Similarly, we can set the seed and run a series of calculations involving multiple functions that make use of the random number generator: ```{r} # First time @@ -308,9 +355,23 @@ dat_B <- r_bivariate_Poisson(20, rho = 0.5, mu1 = 4, mu2 = 7) bind_rows(A = r_and_z(dat_A), B = r_and_z(dat_B), .id = "Rep") ``` -The `bundle_sim()` function demonstrated in Section \@ref(bundle-sim-demo) creates a function for repeating the process of generating and analyzing data. -By default, the function it produces includes an argument `seed`, which allows the user to set a seed value before repeatedly evaluating the DGP function and estimation function. -By default, the `seed` argument is `NULL` and so the current seed is not modified. +When writing code, you can pass the seed as a parameter, such as the following: + +```{r, eval=FALSE} +run_CRT_sim <- function(reps, + n_bar = 10, J = 30, p = 0.5, + ATE = 0, ICC = 0.4, + size_coef = 0, alpha = 0, + seed = NULL ) { + + if (!is.null(seed)) set.seed(seed) + ... +} +``` +A `seed` argument allows the user to set a seed value before running the full simulation. +By default, the `seed` argument is `NULL` and so the current seed is not modified, but if you set the seed, you will get the same run each time. + +The `bundle_sim()` function, demonstrated in Section \@ref(bundle-sim-demo), gives back a function with a `seed` seed argument like above. Specifying an integer-valued seed will make the results exactly reproducible: ```{r} sim_A <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14, @@ -330,7 +391,19 @@ If we set a seed and find that the code crashes, we can debug and then rerun the If it now runs without error, we know we fixed the problem. If we had not set the seed, we would not know if we were just getting (un)lucky, and avoiding the error by chance. - + +## Conclusions + +The act of running a simulation is not itself the goal of a simulation study. Rather, it is the mechanism by which we generate many realizations of a data analysis procedure under a specified data-generating process. +As we have seen in this chapter, doing so requires carefully stitching together the one-two step of data generation and analysis, repeating that combined process many times, and then organizing the resulting output in a form that can be easily and systematically evaluated. +Different coding strategies can be used to accomplish this, but they all serve the same purpose: to make repeated evaluation of a procedure reliable, transparent, and easy to extend. + +Finally, because simulation studies rely on stochastic data generation, careful attention must be paid to reproducibility. +By controlling the pseudo-random number generator through explicit use of seeds, we can ensure that simulation results can be exactly reproduced, checked, and debugged. +This reproducibility is essential for both scientific transparency and practical development of simulation code. + +With the tools of this chapter in place, we are now positioned to move from running simulations to evaluating what they tell us about estimator behavior, which is the focus of the next chapter. + ## Exercises diff --git a/040-Performance-criteria.Rmd b/040-Performance-criteria.Rmd index 5bedcee..5aa0c4b 100644 --- a/040-Performance-criteria.Rmd +++ b/040-Performance-criteria.Rmd @@ -6,16 +6,19 @@ output: ```{r include = FALSE} library(tidyverse) +library(simhelpers) conflicted::conflicts_prefer(dplyr::select) conflicted::conflicts_prefer(dplyr::filter) - options(list(dplyr.summarise.inform = FALSE)) -source("case_study_code/gen_cluster_RCT_rev.R") +source("case_study_code/gen_cluster_RCT.R") cluster_RCT_res <- readRDS("results/cluster_RCT_simulation.rds") %>% mutate( - method = case_match(method, "Agg" ~ "Aggregation", "LR" ~ "Linear regression", "MLM" ~ "Multilevel model") + method = case_match(method, + "Agg" ~ "Aggregation", + "LR" ~ "Linear regression", + "MLM" ~ "Multilevel model") ) ``` @@ -23,17 +26,25 @@ cluster_RCT_res <- # Performance Measures {#performance-measures} Once we run a simulation, we end up with a pile of results to sort through. -For example, Figure \@ref(fig:CRT-ATE-hist) depicts the distribution of average treatment effect estimates from the cluster-randomized experiment simulation, which we generated in Chapter \@ref(running-the-simulation-process). +For example, Figure \@ref(fig:CRT-ATE-hist) depicts the distribution of average treatment effect estimates from the cluster-randomized experiment simulation we generated in Section \@ref(one-run-reparameterization). There are three different estimators, each with 1000 replications. Each histogram is an approximation of the _sampling distribution_ of the estimator, meaning its distribution across repetitions of the data-generating process. -With results such as these, the question before us is now how to evaluate how well each procedure works. If we are comparing several different estimators, we also need to determine which ones work better or worse than others. In this chapter, we look at a variety of __performance measures__ that can answer these questions. +Visually, we see the three estimators look about the same---they sometimes estimate too low, and sometimes estimate too high. +We also see linear regression is shifted up a bit, and maybe aggregation and MLM have a few more outlines. +These observations are assessing how well each estimator _performs_, and we next dig into quantifying that performance more precisely with what we call _performance measures_. + + ```{r CRT-ATE-hist} #| fig.width: 8 #| fig.height: 3 +#| out.width: "100%" #| echo: false #| message: false -#| fig.cap: "Sampling distribution of average treatment effect estimates from a cluster-randomized trial with a true average treatment effect of 0.3." +#| fig.cap: "Sampling distribution of average treatment effect estimates from a cluster-randomized trial with a true average treatment effect of 0.3 (dashed line)." cluster_RCT_mean <- cluster_RCT_res %>% @@ -42,72 +53,91 @@ cluster_RCT_mean <- ggplot(cluster_RCT_res) + aes(ATE_hat, fill = method) + geom_vline(xintercept = 0.3, linetype = "dashed") + - geom_vline(data = cluster_RCT_mean, aes(xintercept = ATE_hat, color = method)) + - geom_point(data = cluster_RCT_mean, aes(x = ATE_hat, y = -1, color = method)) + + geom_vline(data = cluster_RCT_mean, aes(xintercept = ATE_hat), color="black") + + geom_point(data = cluster_RCT_mean, aes(x = ATE_hat, y = -1), color = "black") + geom_histogram(alpha = 0.5) + facet_wrap(~ method, nrow = 1) + theme_minimal() + labs(x = "Average treatment effect estimate", y = "") + scale_y_continuous(labels = NULL) + + scale_x_continuous( breaks=c( -.3, 0, 0.3, 0.6, 0.9 ) ) + theme(legend.position = "none") ``` -Performance measures are summaries of a sampling distribution that describe how an estimator or data analysis procedure behaves on average if we could repeat the data-generating process an infinite number of times. +A _performance measure_ summarizes an estimator's sampling distribution to describe how an estimator would behave on average if we were to repeat the data-generating process an infinite number of times. For example, the bias of an estimator is the difference between the average value of the estimator and the corresponding target parameter. -Bias measures the central tendency of the sampling distribution, capturing how far off, on average, the estimator would be from the true parameter value if we repeated the data-generating process an infinite number of times. -In Figure \@ref(fig:CRT-ATE-hist), black dashed lines mark the true average treatment effect of 0.3 and the colored vertical lines with circles at the end mark the means of the estimators. -The distance between the colored lines and the black dashed lines corresponds to the bias of the estimator. +Bias measures the central tendency of the sampling distribution, capturing whether we are systematically off, on average, from the true parameter value. + +In Figure \@ref(fig:CRT-ATE-hist), the black dashed lines mark the true average treatment effect of 0.3 and the solid lines with the dot mark the means of the estimators. +The horizontal distance between the solid and dashed lines corresponds to the bias of the estimator. This distance is nearly zero for the aggregation estimator and the multilevel model estimator, but larger for the linear regression estimator. +Our performance measure of bias would be about zero for Agg and MLM, and around 0.05 to 0.1 for linear regression. + +Many performance measures exist. +For point estimates, conventional performance measures include bias, variance, and root mean squared error. +Our definition of "success" could depend on which measure we find more important: is lower bias good? Is it better than lower variance? -Different types of data-analysis results produce different types of information, and so the relevant set of performance measures depends on the type of data analysis result we are studying. -For procedures that produce point estimates or point predictions, conventional performance measures include bias, variance, and root mean squared error. If the point estimates come with corresponding standard errors, then we may also want to evaluate how accurately the standard errors represent the true uncertainty of the point estimators; conventional performance measures for capturing this include the relative bias and relative root mean squared error of the variance estimator. For procedures that produce confidence intervals or other types of interval estimates, conventional performance measures include the coverage rate and average interval width. Finally, for inferential procedures that involve hypothesis tests (or more generally, classification tasks), conventional performance measures include Type I error rates and power. We describe each of these measures in Sections \@ref(assessing-point-estimators) through \@ref(assessing-inferential-procedures). -Performance measures are defined with respect to sampling distributions, or the results of applying a data analysis procedure to data generated according to a particular process across an infinite number of replications. -In defining specific measures, we will follow statistical conventions to denote the mean, variance, and other moments of the sampling distribution. -For a random variable $T$, we will use the expectation operator $\E(T)$ to denote the mean of the sampling distribution of $T$, $\M(T)$ to denote the median of its sampling distribution, $\Var(T)$ to denote the variance of its sampling distribution, and $\Prob()$ to denote probabilities of specific outcomes with respect to its sampling distribution. -We will use $\Q_p(T)$ to denote the $p^{th}$ quantile of a distribution, which is the value $x$ such that $\Prob(T \leq x) = p$. With this notation, the median of a continuous distribution is equivalent to the 0.5 quantile: $\M(T) = \Q_{0.5}(T)$. - -For some simple combinations of data-generating processes and data analysis procedures, it may be possible to derive exact mathematical formulas for calculating some performance measures (such as exact mathematical expressions for the bias and variance of the linear regression estimator). -But for many problems, the math is difficult or intractable---that's why we do simulations in the first place. -Simulations do not produce the _exact_ sampling distribution or give us _exact_ values of performance measures. -Instead, simulations generate _samples_ (usually large samples) from the the sampling distribution, and we can use these to compute _estimates_ of the performance measures of interest. -In Figure \@ref(fig:CRT-ATE-hist), we calculated the bias of each estimator by taking the mean of 1000 observations from its sampling distribution. If we were to repeat the whole set of calculations (with a different seed), then our bias results would shift slightly because they are imperfect estimates of the actual bias. - -In working with simulation results, it is important to keep track of the degree of uncertainty in performance measure estimates. +One might have other performance measures than these, depending on what kind of estimator or data analysis procedure we are evaluating. +Perhaps it is some measure of classification accuracy such as the F1 rate. +Or perhaps it is something very specific, such as the chance of an error larger than some threshold. + +Regardless of measure chosen, the true performance of an estimator depends on the data generating process being used. +We might see good performance against some DGPs, and bad performance against other DGPs. +Performance measures are defined with respect to the sampling distribution of our estimator, which is the distribution of values we would see across different datasets that all came from the same DGP. + +In the following, we define many of the most common measures in terms of standard statistical quantities such as the mean, variance, and other moments of the sampling distribution. +Notation-wise, for a random variable $T$ (e.g., for some estimator $T$), we will use the expectation operator $\E(T)$ to denote the mean, $\M(T)$ to denote the median, $\Var(T)$ to denote the variance, and $\Prob()$ to denote probabilities of specific outcomes, all with respect to the sampling distribution of $T$. +We also use $\Q_p(T)$ to denote the $p^{th}$ quantile of a distribution, which is the value $x$ such that $\Prob(T \leq x) = p$. +With this notation, the median of a continuous distribution is equivalent to the 0.5 quantile: $\M(T) = \Q_{0.5}(T)$. +Using this notation we can write, for example, that the bias of $T$ is $\E(T) - \theta$, where $\theta$ is the target parameter. +We develop all of this further, below. + +For some simple combinations of data-generating processes and data analysis procedures, it may be possible to derive exact mathematical formulas for some performance measures (such as exact mathematical expressions for the bias and variance of the linear regression estimator). +But for many problems, the math is difficult or intractable---which is partly why we do simulations in the first place. +Simulations, however, do not produce the _exact_ sampling distribution or give us _exact_ values of performance measures. +Instead, simulations generate _samples_ (usually large samples) from the the sampling distribution, and we can use these to _estimate_ the performance measures of interest. +In Figure \@ref(fig:CRT-ATE-hist), for example, we estimated the bias of each estimator by taking the mean of 1000 observations from its sampling distribution. +If we were to repeat the whole simulation (with a different seed), then our bias results would shift slightly because they are imperfect estimates of the actual bias. + +In working with simulation results, it is important to track how much uncertainty we have in our performance measure estimates. We call such uncertainty _Monte Carlo error_ because it is the error arising from using a finite number of replications of the Monte Carlo simulation process. -One way to quantify it is with the _Monte Carlo standard error (MCSE)_, or the standard error of a performance estimate based on a finite number of replications. +One way to quantify the simulation uncertainty is with the _Monte Carlo Standard Error (MCSE)_, which is the standard error of a performance estimate given a specific number of replications. Just as when we analyze real data, we can apply statistical techniques to estimate the MCSE and even to generate confidence intervals for performance measures. -The magnitude of MCSE is driven by how many replications we use: if we only use a few, we will have noisy estimates of performance with large MCSEs; if we use millions of replications, the MCSE will usually be tiny. -It is important to keep in mind that the MCSE is not measuring anything about how a data analysis procedure performs in general. -It only describes how precisely we have approximated a performance criterion, an artifact of how we conducted the simulation. -Moreover, MCSEs are under our control. -Given a desired MCSE, we can determine how many replications we would need to ensure our performance estimates have the specified level of precision. +The magnitude of the MCSE is driven by how many replications we use: if we only use a few, we will have noisy estimates of performance with large MCSEs; if we use millions of replications, the MCSE will usually be tiny. +It is important to keep in mind that the MCSE is not measuring anything about how our estimator performs. +It only describes how precisely we have _estimated_ its performance; MCSE is an artifact of how we conducted the simulation. +The MCSE are also under our control: Given a desired MCSE, we can determine how many replications we would need to ensure our performance estimates have a MCSE of that size. Section \@ref(MCSE) provides details about how to compute MCSEs for conventional performance measures, along with some discussion of general techniques for computing MCSE for less conventional measures. ## Measures for Point Estimators {#assessing-point-estimators} -The most common performance measures used to assess a point estimator are bias, variance, and root mean squared error. -Bias compares the mean of the sampling distribution to the target parameter. +The most common performance measures used to assess a point estimator are **bias**, **variance**, and **root mean squared error**. + +**Bias** compares the mean of the sampling distribution to the target parameter. +It answers, "Are we systematically too high or too low, on average?" Positive bias implies that the estimator tends to systematically over-state the quantity of interest, while negative bias implies that it systematically under-shoots the quantity of interest. If bias is zero (or nearly zero), we say that the estimator is unbiased (or approximately unbiased). -Variance (or its square root, the true standard error) describes the spread of the sampling distribution, or the extent to which it varies around its central tendency. -All else equal, we would like estimators to have low variance (or to be more precise). -Root mean squared error (RMSE) is a conventional measure of the overall accuracy of an estimator, or its average degree of error with respect to the target parameter. + +**Variance** (or its square root, the **true standard error**) describes the spread of the sampling distribution, or the extent to which it varies around its central tendency. +All else equal, we would like estimators to have low variance (which means being more precise). + +**Root mean squared error (RMSE)** is a conventional measure of the overall accuracy of an estimator, or its average degree of error with respect to the target parameter. For absolute assessments of performance, an estimator with low bias, low variance, and thus low RMSE is desired. In making comparisons of several different estimators, one with lower RMSE is usually preferable to one with higher RMSE. If two estimators have comparable RMSE, then the estimator with lower bias would usually be preferable. -To define these quantities more precisely, let us consider a generic estimator $T$ that is targeting a parameter $\theta$. +To define these quantities more precisely, consider a generic estimator $T$ that is targeting a parameter $\theta$. We call the target parameter the _estimand_. -In most cases, in running our simulation we set the estimand $\theta$ and then generate a (typically large) series of $R$ datasets, for each of which $\theta$ is the true target parameter. -We then analyze each dataset, obtaining a sample of estimates $T_1,...,T_R$. -Formally, the bias, variance, and RMSE of $T$ are defined as +In most cases, in running our simulation we set the estimand $\theta$ and then generate a (typically large) series of $R$ datasets, with $\theta$ being the true target parameter for each. +We also analyze each dataset, obtaining a sample of estimates $T_1,...,T_R$. +Formally, the bias, variance, and RMSE of $T$ are then defined as $$ \begin{aligned} \Bias(T) &= \E(T) - \theta, \\ @@ -116,6 +146,10 @@ $$ \end{aligned} (\#eq:bias-variance-RMSE) $$ +In words, bias is the average estimate minus the truth---is it systematically too high or too low, on average? +The variance is the average squared deviation of the estimator from its own mean---how much does it vary around its average estimate? Is it precise? +RMSE is the square root of the average squared deviation of the estimator from the truth---how far off are we, on average, from the truth? Does it predict well? + These three measures are inter-connected. In particular, RMSE is the combination of (squared) bias and variance, as in $$ @@ -137,8 +171,11 @@ $$ S_T^2 = \frac{1}{R - 1}\sum_{r=1}^R \left(T_r - \bar{T}\right)^2. (\#eq:var-estimator) $$ -$S_T$ (the square root of $S^2_T$) is an estimate of the empirical standard error of $T$, or the standard deviation of the estimator across an infinite set of replications of the data-generating process.[^SE-meaning] -We usually prefer to work with the empirical SE $S_T$ rather than the sampling variance $S_T^2$ because the former quantity has the same units as the target parameter. +$S_T$ (the square root of $S^2_T$) is an estimate of the _true_ standard error, $SE$, of $T$, which is the standard deviation of the estimator across an infinite set of replications of the data-generating process. +Generally, when people say "Standard Error" they actually mean _estimated_ Standard Error, ($\widehat{SE}$), as we generally calculate in a real data analysis (where we have only a single realization of the data-generating process). +It is easy to forget that this standard error is itself an estimate of a parameter---the true SE---and thus has its own uncertainty. +We usually prefer to work with the true SE $S_T$ rather than the sampling variance $S_T^2$ because the former quantity has the same units as the target parameter. + Finally, the RMSE estimate can be calculated as $$ \widehat{\RMSE}(T) = \sqrt{\frac{1}{R} \sum_{r = 1}^R \left( T_r - \theta\right)^2 }. @@ -147,29 +184,31 @@ $$ Often, people talk about the MSE (Mean Squared Error), which is just the square of RMSE. Just as the true SE is usually easier to interpret than the sampling variance, units of RMSE are easier to interpret than the units of MSE. -[^SE-meaning]: Generally, when people say "Standard Error" they actually mean _estimated_ Standard Error, ($\widehat{SE}$), as we would calculate in a real data analysis (where we have only a single realization of the data-generating process). It is easy to forget that this standard error is itself an estimate of a parameter--the true or empirical SE---and thus has its own uncertainty. - -It is important to recognize that the above performance measures depend on the scale of the parameter. +The above performance measures depend on the scale of the parameter. For example, if our estimators are measuring a treatment impact in dollars, then the bias, SE, and RMSE of the estimators are all in dollars. (The variance and MSE would be in dollars squared, which is why we take their square roots to put them back on the more intepretable scale of dollars.) In many simulations, the scale of the outcome is an arbitrary feature of the data-generating process, making the absolute magnitude of performance measures less meaningful. -To ease interpretation of performance measures, it is useful to consider their magnitude relative to the baseline level of variation in the outcome. -One way to achieve this is to generate data so the outcome has unit variance (i.e., we generate outcomes in _standardized units_). -Doing so puts the bias, empirical standard error, and root mean squared error on the scale of standard deviation units, which can facilitate interpretation about what constitutes a meaningfully large bias or a meaningful difference in RMSE. +To ease interpretation of performance measures, it is useful to consider their magnitude relative to some reference amount. +We generally recommend using the baseline level of variation in the outcome. +For example, we might report a bias of $0.2\sigma$, where $\sigma$ is the standard deviation of the outcome in the population. -In addition to understanding the scale of these performance measures, it is also important to recognize that their magnitude depends on the metric of the parameter. -A non-linear transformation of a parameter will generally lead to changes in the magnitude of the performance measures. +One way to automatically get all your performance metrics on such a standardized scale is to generate data so the outcome has unit variance (i.e., we generate outcomes in _standardized units_). +This automatically puts the bias, empirical standard error, and root mean squared error on the scale of standard deviation units, which can facilitate interpretation about what constitutes a meaningfully large bias or a meaningful difference in RMSE. + +Finally, a non-linear transformation of a parameter will also generally change a performance measure. For instance, suppose that $\theta$ measures the proportion of time that something occurs. -One natural way to transform this parameter would be to put it on the log-odds (logit) scale. +This parameter could also be given on the log-odds (logit) scale. However, because the log-odds transformation is non-linear, $$ \text{Bias}\left[\text{logit}(T)\right] \neq \text{logit}\left(\text{Bias}[T]\right), \qquad \text{RMSE}\left[\text{logit}(T)\right] \neq \text{logit}\left(\text{RMSE}[T]\right), $$ and so on. + + ### Comparing the Performance of the Cluster RCT Estimation Procedures {#clusterRCTperformance} @@ -178,9 +217,7 @@ runs <- readRDS( file = "results/cluster_RCT_simulation.rds" ) ``` We now demonstrate the calculation of performance measures for the point estimators of average treatment effects in the cluster-RCT example. -In Chapter \@ref(running-the-simulation-process), we generated a large set of replications of several different treatment effect estimators. -Using these results, we can assess the bias, standard error, and RMSE of three different estimators of the ATE. -These performance measures address the following questions: +Our three performance measures address the following questions: - Is the estimator systematically off? (bias) - Is it precise? (standard error) @@ -192,7 +229,7 @@ Let us see how the three estimators compare on these measures. Bias is defined with respect to a target estimand. Here we assess whether our estimates are systematically different from the $\gamma_1$ parameter, which we defined in standardized units by setting the standard deviation of the student-level distribution of the outcome equal to one. -For these data, we generated data based on a school-level ATE parameter of 0.30 SDs. +In our simulation, we generated data with a school-level ATE parameter of 0.30 SDs. ```{r cluster-bias} ATE <- 0.30 @@ -206,9 +243,9 @@ runs %>% ``` There is no indication of major bias for aggregation or multi-level modeling. -Linear regression, with a bias of about 0.09 SDs, appears about ten times as biased as the other estimators. +Linear regression, with a bias of about 0.08 SDs, appears much more biased than the other estimators. This is because the linear regression is targeting the person-level average average treatment effect. -The data-generating process of this simulation makes larger sites have larger effects, so the person-level average effect is going to be higher because those larger sites will count more. +Our data-generating process makes larger sites have larger effects, so the person-level average effect is going to be higher because those larger sites will count more. In contrast, our estimand is the school-level average treatment effect, or the simple average of each school's true impact, which we have set to 0.30. The aggregation and multi-level modeling methods target this school-level average effect. If we had instead decided that the target estimand should be the person-level average effect, then we would find that linear regression is unbiased whereas aggregation and multi-level modeling are biased. @@ -216,9 +253,9 @@ This example illustrates how crucial it is to think carefully about the appropri #### Which method has the smallest standard error? {-} -The empirical standard error measures the degree of variability in a point estimator. -It reflects how stable our estimates are across replications of the data-generating process. -We calculate the standard error by taking the standard deviation of the replications of each estimator. +The standard error measures the degree of variability in a point estimator. +It reflects how stable an estimator is across replications of the data-generating process. +We estimate the standard error by taking the standard deviation of the set of estimates of each estimator. For purposes of interpretation, it is useful to compare the empirical standard errors to the variation in a benchmark estimator. Here, we treat the linear regression estimator as the benchmark and compute the magnitude of the empirical SEs of each method _relative_ to the SE of the linear regression estimator: @@ -233,7 +270,7 @@ true_SE ``` In a real data analysis, these standard errors are what we would be trying to approximate with a standard error estimator. -Aggregation and multi-level modeling have SEs about 8% smaller than linear regression. +Aggregation and multi-level modeling have SEs about 4 to 5% smaller than linear regression, meaning these estimates tend to be a bit less variable across different datasets. For these data-generating conditions, aggregation and multi-level modeling are preferable to linear regression because they are more precise. #### Which method has the smallest Root Mean Squared Error? {-} @@ -260,31 +297,38 @@ RMSE takes into account both bias and variance. For aggregation and multi-level modeling, the RMSE is the same as the standard error, which makes sense because these estimators are not biased. For linear regression, the combination of bias plus increased variability yields a higher RMSE, with the standard error dominating the bias term (note how RMSE and SE are more similar than RMSE and bias). The difference between the estimators are pronounced because RMSE is the square root of the _squared_ bias and _squared_ standard error. -Overall, aggregation and multi-level modeling have RMSEs around 17% smaller than linear regression---a consequential difference in accuracy. +Overall, aggregation and multi-level modeling have RMSEs around 10% smaller than linear regression, meaning they tend to give estimates that are closer to the truth than linear regression. -### Less Conventional Performance Measures {#less-conventional-measures} +### Robust Performance Measures {#less-conventional-measures} -Depending on the model and estimation procedures being examined, a range of different measures might be used to assess estimator performance. -For point estimation, we have introduced bias, variance and RMSE as three core measures of performance. + +For point estimation, we introduced bias, variance and RMSE as three core measures of performance. However, all of these measures are sensitive to outliers in the sampling distribution. -Consider an estimator that generally does well, except for an occasional large mistake. Because conventional measures are based on arithmetic averages, they will indicate that the estimator performs very poorly overall. -Other measures such as the median bias and the median absolute deviation of $T$ are less sensitive to outliers in the sampling distribution compared to the conventional measures. -Estimating these measures will involve calculating sample quantiles of $T_1,...,T_R$, which are functions of the sample ordered from smallest to largest. -We will denote the $r^{th}$ order statistic as $T_{(r)}$ for $r = 1,...,R$. +Consider an estimator that generally does well, except for an occasional large mistake. +Because conventional measures are based on arithmetic averages, they could indicate that the estimator performs very poorly overall, when in fact it performs very well most of the time and terribly only rarely. + +If we are more curious about typical performance, then we could use other measures, such as the __median bias__ and the __median absolute deviation__ of $T$, that are less sensitive to outliers in the sampling distribution. +These are called "robust" measures because they are not massively changed by only a few extreme values. +The overall approach for most robust measures is to simply chop off some percent of the most extreme values, and then look at the center or spread of the remaining values. -Median bias is an alternative measure of the central tendency of a sampling distribution. +For example, median bias is an alternative measure of the central tendency of a sampling distribution. +Here we "chop off" essentially all the values, focusing on the middle one only. Positive median bias implies that more than 50% of the sampling distribution exceeds the quantity of interest, while negative median bias implies that more than 50% of the sampling distribution fall below the quantity of interest. Formally, $$ \text{Median-Bias}(T) = \M(T) - \theta (\#eq:median-bias). $$ -An estimator of median bias is computed using the sample median, as +where $\M(T)$ is the median of $T$. + +We estimate median bias with the sample median, as $$ \widehat{\text{Median-Bias}}(T) = M_T - \theta (\#eq:sample-median-bias) $$ -where $M_T = T_{((R+1)/2)}$ if $R$ is odd or $M_T = \frac{1}{2}\left(T_{(R/2)} + T_{(R/2+1)}\right)$ if $R$ is even. +where, if we define $T_{(r)}$ as the $r^{th}$ order statistic for $r = 1,...,R$, we have $M_T = T_{((R+1)/2)}$ if $R$ is odd or $M_T = \frac{1}{2}\left(T_{(R/2)} + T_{(R/2+1)}\right)$ if $R$ is even. + + Another robust measure of central tendency uses the $p \times 100\%$-trimmed mean, which ignores the estimates in the lowest and highest $p$-quantiles of the sampling distribution. Formally, the trimmed-mean bias is @@ -297,17 +341,17 @@ To estimate the trimmed bias, we take the mean of the middle $1 - 2p$ fraction o $$ \widehat{\text{Trimmed-Bias}}(T; p) = \tilde{T}_{\{p\}} - \theta. (\#eq:sample-trimmed-bias) $$ -where +where (assuming $pR$ is a whole number) $$ \tilde{T}_{\{p\}} = \frac{1}{(1 - 2p)R} \sum_{r=pR + 1}^{(1-p)R} T_{(r)} $$ For a symmetric sampling distribution, trimmed-mean bias will be the same as the conventional (mean) bias, but its estimator $\tilde{T}_{\{p\}}$ will be less affected by outlying values (i.e., values of $T$ very far from the center of the distribution) compared to $\bar{T}$. -However, if a sampling distribution is not symmetric, trimmed-mean bias become distinct performance measures, which put less emphasis on large errors compared to the conventional bias measure. +However, if a sampling distribution is not symmetric, trimmed-mean bias will be a distinct performance measure, different from mean bias, that puts less emphasis on large errors compared to the conventional bias measure. A further robust measure of central tendency is based on winsorizing the sampling distribution, or truncating all errors larger than a certain maximum size. -Using a winsorized distribution amounts to arguing that you don't care about errors beyond a certain size, so anything beyond a certain threshold will be treated the same as if it were exactly on the threshold. -The threshold for truncation is usually defined relative to the first and third quartiles of the sampling distribution, along with a given span of the inter-quartile range. -The thresholds for truncation are taken as +A winsorized distribution amounts to arguing that you don't care about errors beyond a certain size, so anything beyond a certain threshold will be treated the same as if it were exactly on the threshold. +The threshold for truncation is often defined relative to the first and third quartiles of the sampling distribution, along with a given span of the inter-quartile range. +In this case, the thresholds for truncation would be $$ \begin{aligned} L_w &= \Q_{0.25}(T) - w \times (\Q_{0.75}(T) - \Q_{0.25}(T)) \\ @@ -315,7 +359,7 @@ U_w &= \Q_{0.75}(T) + w \times (\Q_{0.75}(T) - \Q_{0.25}(T)), \end{aligned} $$ where $\Q_{0.25}(T)$ and $\Q_{0.75}(T)$ are the first and third quartiles of the distribution of $T$, respectively, and $w$ is the number of inter-quartile ranges below which an observation will be treated as an outlier.[^fences] -Let $X = \min\{\max\{T, L_w\}, U_w\}$. +Once we have our thresholds, we let our truncated estimate be $X = \min\{\max\{T, L_w\}, U_w\}$. The winsorized bias, variance, and RMSE are then defined using winsorized values in place of the raw values of $T$, as $$ \begin{aligned} @@ -343,78 +387,142 @@ The `robustbase` package [@robustbase] provides functions for calculating many o ## Measures for Variance Estimators -Statistics is concerned not only with how to estimate things, but also with understanding the extent of uncertainty in estimates of target parameters. -These concerns apply in Monte Carlo simulation studies as well. -In a simulation, we can simply compute an estimator's actual properties. -When we use an estimator with real data, we need to _estimate_ its associated standard error and generate confidence intervals and other assessments of uncertainty. -To understand if these uncertainty assessments work in practice, we need to evaluate not only the behavior of the estimator itself, but also the behavior of these associated quantities. - -Commonly used measures for quantifying the performance of estimated standard errors include relative bias, relative standard error, and relative root mean squared error. -These measures are defined in relative terms (rather than absolute ones) by comparing their magnitude to the _true_ degree of uncertainty. -Typically, performance measures are computed for _variance_ estimators rather than standard error estimators. -There are a few reasons for working with variance rather than standard error. -First, in practice, so-called unbiased standard errors usually are not actually unbiased.[^Jokes-on-us] +Statistics is concerned not only with how to estimate things (e.g., point estimates), but also with understanding how well we have estimated them (e.g., standard errors). +Monte Carlo simulation studies can help us evaluate both these concerns. +The prior section focused on assessing how well our estimators work. +In this section we focus on assessing how well our uncertainty assessments work. +We focus on the standard error. + +The first order question is whether our estimated standard errors are __calibrated__, meaning they are the right size, on average. +The second order question is whether our estimated standard errors are __stable__, meaning they do not vary too much from one dataset to another. +For this question we can think of the standard errors in terms of effective degrees of freedom, or look at the standard error (or RMSE) of the standard error. + + + + +### Calibration + +Calibration is a relative measure, where we look at the ratio of the expected (average) value of our uncertainty estimator to the true uncertainty. +The relative measure removes scale, which is important because the absolute magnitude of uncertainty will depend on many features of the data-generating process, such as sample size. +In particular, across simulation scenarios, even those with the same target parameter $\theta$, the true standard error of our estimator $T$ might change considerably. +With a relative measure, we will, even under such conditions, end up with statements such as "Our standard errors are 10% too large, on average," which are easy to interpret. + +Typically, we assess performance for _variance_ estimators ($\widehat{SE^2}$) rather than standard error estimators. +There are a few reasons it is more common to work with variance rather than standard error. +First, in practice, so-called unbiased standard errors usually are not actually unbiased, while variance estimators might be.[^Jokes-on-us] For linear regression, for example, the classic standard error estimator is an unbiased _variance_ estimator, but the standard error estimator is not exactly unbiased because $$ \E[ \sqrt{ V } ] \neq \sqrt{ \E[ V ] }. $$ -Variance is also the measure that gives us the bias-variance decomposition of Equation \@ref(eq:RMSE-decomposition). Thus, if we are trying to determine whether MSE is due to instability or systematic bias, operating in this squared space may be preferable. +Variance is also the measure that gives us the bias-variance decomposition of Equation \@ref(eq:RMSE-decomposition). +Thus, if we are trying to determine whether an overall MSE is due to instability or systematic bias, operating in this squared space may be preferable. [^Jokes-on-us]: See the delightfully titled section 11.5, "The Joke Is on Us: The Standard Deviation Estimator is Biased after All," in @westfall2013understanding for further discussion. -To make this concrete, let us consider a generic standard error estimator $\widehat{SE}$ to go along with our generic estimator $T$ of target parameter $\theta$, and let $V = \widehat{SE}^2$. -We can simulate to obtain a large sample of standard errors, $\widehat{SE}_1,...,\widehat{SE}_R$ and variance estimators $V_r = \widehat{SE}_r^2$ for $r = 1,...,R$. -Formally, the relative bias, standard error, and RMSE of $V$ are defined as +To make assessing calibration concrete, consider a generic standard error estimator $\widehat{SE}$ to go along with our generic estimator $T$ of target parameter $\theta$, and let $V = \widehat{SE^2}$ be the corresponding variance estimator. + +In our simulation we obtain a large sample of standard errors, $\widehat{SE}_1,...,\widehat{SE}_R$ and corresponding variance estimators $V_r = \widehat{SE}_r^2$ for $r = 1,...,R$. +Formally, the __relative bias__ or __calibration__ of $V$ is then defined as $$ \begin{aligned} \text{Relative Bias}(V) &= \frac{\E(V)}{\Var(T)} \\ -\text{Relative SE}(V) &= \frac{\Var(V)}{\Var(T)} \\ -\text{Relative RMSE}(V) &= \frac{\sqrt{\E\left[\left(V - \Var(T)\right)^2 \right]}}{\Var(T)}. \end{aligned} (\#eq:relative-bias-SE-RMSE) $$ -In contrast to performance measures for $T$, we define these measures in relative terms because the raw magnitude of $V$ is not a stable or interpretable parameter. +Relative bias describes whether the central tendency of $V$ aligns with the actual degree of uncertainty in the point estimator $T$. +We say our variance estimator $V$ is _calibrated_ if the relative bias is close to one. +When calibrated, $V$ accurately reflects how much uncertainty there actually is, on average. +If relative bias is less than one, $V$ is _anti-conservative_, meaning it tends to under-estimate the true variance of $T$. +This is usually considered quite bad. +If it is more than one, it is _conservative_, meaning it tends to over-estimate the true variance of $T$. +This is usually considered non-ideal, but livable. + +Calibration also has implications for the performance of other uncertainty assessments such as hypothesis tests and confidence intervals. +A relative bias of less than 1 implies that $V$ tends to under-state the amount of uncertainty, which will lead to confidence intervals that are overly narrow and do not cover the true parameter value at the desired rate. +It also will lead to rejecting the null more often than desired, which elevates Type I error. +Relative bias greater than 1 implies that $V$ tends to over-state the amount of uncertainty in the point estimator, making confidence intervals too large and null hypotheses hard to reject. + + + -To estimate these relative performance measures, we proceed by substituting sample quantities in place of the expectations and variances. -In contrast to the performance measures for $T$, we will not generally be able to compute the true degree of uncertainty exactly. -Instead, we must estimate the target quantity $\Var(T)$ using $S_T^2$, the sample variance of $T$ across replications. -Denoting the arithmetic mean of the variance estimates as +To estimate relative bias, simply substitute sample quantities in place of the $\E(V)$ and $\Var(T)$. +Unlike with performance measures for $T$, where we know the target quantity $\theta$, here we do not know $\Var(T)$ exactly, +Instead, we must estimate it using $S_T^2$, the sample variance of $T$ across replications. +Denoting the arithmetic mean of the variance estimates across simulation trials as $$ \bar{V} = \frac{1}{R} \sum_{r=1}^R V_r $$ -and the sample variance as -$$ -S_V^2 = \frac{1}{R - 1}\sum_{r=1}^R \left(V_r - \bar{V}\right)^2, -$$ -we estimate the relative bias, standard error, and RMSE of $V$ using +we estimate the relative bias (calibration) of $V$ using $$ \begin{aligned} -\widehat{\text{Relative Bias}}(V) &= \frac{\bar{V}}{S_T^2} \\ -\widehat{\text{Relative SE}}(V) &= \frac{S_V}{S_T^2} \\ -\widehat{\text{Relative RMSE}}(V) &= \frac{\sqrt{\frac{1}{R}\sum_{r=1}^R\left(V_r - S_T^2\right)^2}}{S_T^2}. +\widehat{\text{Relative Bias}}(V) &= \frac{\bar{V}}{S_T^2} . \\ \end{aligned} (\#eq:relative-bias-SE-RMSE-estimators) $$ -These performance measures are informative about the properties of the uncertainty estimator $V$ (or standard error $\widehat{SE}$), which have implications for the performance of other uncertainty assessments such as hypothesis tests and confidence intervals. +If we then take the square root of our estimate, we can obtain the estimated relative bias of the standard error. + + + + + +### Stability {#performance-stability} + +Beyond calibration, we also care about the stability of our variance estimator. +Are the estimates of uncertainty $V$ close to the true uncertainty $Var(T)$, from dataset to dataset? + +We offer two ways of assessing this. +The first is to look at the relative SE or RMSE of the variance estimator $V$. +The second is to assess the effective "degrees of freedom" of $V$, which takes into account how one would account for uncertainty estimator uncertainty when generating confidence intervals or conducting hypothesis tests (this is what a degrees of freedom correction does---for example, when $V$ is unstable, as represented by a low degree of freedom, we hedge against it being randomly too small by making our confidence intervals systematically larger). + +The relative Variance and MSE of $V$ as an estimate of $Var(T)$ are the following: +$$ +\begin{aligned} +\text{Relative Var}(V) &= \frac{\Var(V)}{\Var(T)} \\ +\text{Relative MSE}(V) &= \frac{\E\left[\left(V - \Var(T)\right)^2 \right]}{\Var(T)}. +\end{aligned} +$$ +As usual, we would generally work with the the square root of these values (to get relative SE and relative RMSE). Relative standard errors describe the variability of $V$ in comparison to the true degree of uncertainty in $T$---the lower the better. -A relative standard error of 0.5 would mean that the variance estimator has average error of 50% of the true uncertainty, implying that $V$ will often be off by a factor of 2 compared to the true sampling variance of $T$. -Ideally, a variance estimator will have small relative bias, small relative standard errors, and thus small relative RMSE. +A relative standard error of 0.5 would mean that the variance estimator has average error of 50% of the true uncertainty, implying that $V$ could often be off by a factor of 2---this would be very bad. +Ideally, a variance estimator will have small relative bias (i.e., it is calibrated), small relative standard error (i.e., it is stable), and thus a small relative RMSE. + +We can estimate the relative variance and MSE simply by plugging in estimates of the various quantities: +$$ +\begin{aligned} +\widehat{\text{Relative Var}}(V) &= \frac{S^2_V}{S_T^2} \\ +\widehat{\text{Relative MSE}}(V) &= \frac{\frac{1}{R}\sum_{r=1}^R\left(V_r - S_T^2\right)^2}{S_T^2}. +\end{aligned} +(\#eq:relative-Var-RMSE-estimators) +$$ +where the sample variance of $V$ is +$$ +S_V^2 = \frac{1}{R - 1}\sum_{r=1}^R \left(V_r - \bar{V}\right)^2, +$$ -### Satterthwaite degrees of freedom Another more abstract measure of the stability of a variance estimator is its Satterthwaite degrees of freedom. -For some simple statistical models such as classical analysis of variance and linear regression with homoskedastic errors, the variance estimator is computed by taking a sum of squares of normally distributed errors. -In such cases, the sampling distribution of the variance estimator is a multiple of a $\chi^2$ distribution, with degrees of freedom corresponding to the number of independent observations used to compute the sum of squares. -In the context of analysis of variance problems, @Satterthwaite1946approximate described a method of approximating the variability of more complex statistics, involving linear combinations of sums of squares, by using a chi-squared distribution with a certain degrees of freedom. -When applied to an arbitrary variance estimator $V$, these degrees of freedom can be interpreted as the number of independent, normally distributed errors going into a sum of squares that would lead to a variance estimator that is equally precise as $V$. +For some simple statistical models such as classical analysis of variance and linear regression with homoskedastic errors, the variance estimator is a sum of squares of normally distributed errors. +In such cases, the sampling distribution of the variance estimator is a multiple of a $\chi^2$ distribution, with degrees of freedom corresponding to the number of independent observations used to compute the sum of squares. +This is what leads to $t$ statistics and so forth. + +In the context of analysis of variance problems, @Satterthwaite1946approximate described a method of extending this idea by approximating the variability of more complex statistics, involving linear combinations of sums of squares, by using a chi-squared distribution with a certain degrees of freedom. + +When applied to an arbitrary variance estimator $V$, these degrees of freedom can be interpreted as the number of independent, normally distributed errors going into a sum of squares that would lead to a variance estimator that is as equally precise as $V$. More succinctly, these degrees of freedom correspond to the amount of independent observations used to estimate $V$. Following @Satterthwaite1946approximate, we define the degrees of freedom of $V$ as @@ -430,9 +538,9 @@ $$ For simple statistical methods in which $V$ is based on a sum-of-squares of normally distributed errors, then the Satterthwaite degrees of freedom will be constant and correspond exactly to the number of independent observations in the sum of squares. Even with more complex methods, the degrees of freedom are interpretable: higher degrees of freedom imply that $V$ is based on more observations, and thus will be a more precise estimate of the actual degree of uncertainty in $T$. -### Assessing SEs for the Cluster RCT Simulation +### Assessing SEs for the Cluster RCT Simulation {#assess-cluster-RCT-SEs} -Returning to the cluster RCT example, we will assess whether our estimated SEs are about right by comparing the average _estimated_ (squared) standard error versus the empirical sampling variance. +Returning to the cluster RCT example, we will assess whether our estimated SEs are about right by comparing the average _estimated_ (squared) standard error to the empirical sampling variance. Our standard errors are _inflated_ if they are systematically larger than they should be, across the simulation runs. We will also look at how stable our variance estimates are by comparing their standard deviation to the empirical sampling variance and by computing the Satterthwaite degrees of freedom. @@ -444,20 +552,21 @@ SE_performance <- summarise( SE_sq = var( ATE_hat ), V_bar = mean( V ), - rel_bias = V_bar / SE_sq, + calib = V_bar / SE_sq, S_V = sd( V ), rel_SE_V = S_V / SE_sq, df = 2 * mean( V )^2 / var( V ) ) -SE_performance +SE_performance %>% + knitr::kable( digits = c(0,3,3,2,3,2,1) ) ``` -The variance estimators for the aggregation estimator and multilevel model estimator appear to be a bit conservative on average, with relative bias of around `r round(SE_performance$rel_bias[1], 2)`, or about `r round(100 * SE_performance$rel_bias[1]) - 100`% higher than the true sampling variance. +The variance estimators all appear to be a bit conservative on average, with relative bias of around `r round(mean(SE_performance$calib), 2)`, or about `r round(100 * mean(SE_performance$calib)) - 100`% higher than the true sampling variance. The column labelled `rel_SE_V` reports how variable the variance estimators are relative to the true sampling variances of the estimators. The column labelled `df` reports the Satterthwaite degrees of freedom of each variance estimator. Both of these measures indicate that the linear regression variance estimator is less stable than the other methods, with around `r round(diff(SE_performance$df[2:1]))` fewer degrees of freedom. -The linear regression method uses a cluster-robust variance estimator, which is known to be a bit unstable [@cameronPractitionerGuideClusterRobust2015]. +The linear regression method uses a cluster-robust variance estimator, which is known to be unstable with few clusters [@cameronPractitionerGuideClusterRobust2015]. Overall, it is a bad day for linear regression. ## Measures for Confidence Intervals @@ -469,19 +578,21 @@ However, with the exception of some simple methods and models, methods for const We typically measure confidence interval performance along two dimensions: __coverage rate__ and __expected width__. Suppose that the confidence interval is for the target parameter $\theta$ and has intended coverage level $\beta$ for $0 < \beta < 1$. Denote the lower and upper end-points of the $\beta$-level confidence interval as $A$ and $B$. -$A$ and $B$ are random quantities---they will differ each time we compute the interval on a different replication of the data-generating process. +$A$ and $B$ are random quantities---they will differ each time we compute the interval on a different dataset. The coverage rate of a $\beta$-level interval estimator is the probability that it covers the true parameter, formally defined as $$ \text{Coverage}(A,B) = \Prob(A \leq \theta \leq B). (\#eq:coverage) $$ For a well-performing interval estimator, $\text{Coverage}$ will at least $\beta$ and, ideally will not exceed $\beta$ by too much. + The expected width of a $\beta$-level interval estimator is the average difference between the upper and lower endpoints, formally defined as $$ \text{Width}(A,B) = \E(B - A). (\#eq:expected-width) $$ -Smaller expected width means that the interval tends to be narrower, on average, and thus more informative about the value of the target parameter. +Smaller expected width means that the interval tends to be narrower, on average, and thus more informative about the value of the target parameter. +Expected width is related to the average estimated standard error, but also includes how much wider we need to make our interval if we have a degree-of-freedom correction. In practice, we approximate the coverage and width of a confidence interval by summarizing across replications of the data-generating process. Let $A_r$ and $B_r$ denote the lower and upper end-points of the confidence interval from simulation replication $r$, and let $W_r = B_r - A_r$, all for $r = 1,...,R$. @@ -502,7 +613,7 @@ For example, a conventional Wald-type confidence interval is centered on a point $$ A = T - c \times \widehat{SE}, \quad B = T + c \times \widehat{SE} $$ -for some critical value $c$ (e.g.,for a normal critical value with a $\beta = 0.95$ confidence level, $c = `r round(qnorm(0.975), 3)`$). +for some critical value $c$ (e.g., for a normal critical value with a $\beta = 0.95$ confidence level, $c = `r round(qnorm(0.975), 3)`$). Because of these connections, confidence interval coverage will often be closely related to the performance of the point estimator and variance estimator. Biased point estimators will tend to have confidence intervals with coverage below the desired level because they are not centered in the right place. Likewise, variance estimators that have relative bias below 1 will tend to produce confidence intervals that are too short, leading to coverage below the desired level. @@ -531,19 +642,20 @@ runs_CIs %>% ) ``` -The coverage rate is close to the desired level of 0.95 for the multilevel model and aggregation estimators, but it is around 5 percentage points too low for linear regression. +The coverage rate is close to the desired level of 0.95 for the multilevel model and aggregation estimators, but it is a bit too low for linear regression. The lower-than-nominal coverage level occurs because of the bias of the linear regression point estimator. The linear regression confidence intervals are also a bit wider than the other methods due to the larger sampling variance of its point estimator and higher variability (lower degrees of freedom) of its standard error estimator. The normal Wald-type confidence intervals we have examined here are based on fairly rough approximations. In practice, we might want to examine more carefully constructed intervals such as ones that use critical values based on $t$ distributions or ones constructed by profile likelihood. Especially in scenarios with a small or moderate number of clusters, such methods might provide better intervals, with coverage closer to the desired confidence level. +In our earlier assessment of the variance estimator (see \@ref(assess-cluster-RCT-SEs)), we also found Linear Regression had lower degrees of freedom, meaning it might have notably wider intervals if they were calculated more carefully. See Exercise \@ref(cluster-RCT-t-confidence-intervals). ## Measures for Inferential Procedures (Hypothesis Tests) {#assessing-inferential-procedures} Hypothesis testing entails first specifying a null hypothesis, such as that there is no difference in average outcomes between two experimental groups. One then collects data and evaluates whether the observed data is compatible with the null hypothesis. -Hypothesis test results are often describes in terms of a $p$-value, which measures how extreme or surprising a feature of the observed data (a test statistic) is relative to what one would expect if the null hypothesis is true. +Hypothesis test results are often described in terms of a $p$-value, which measures how extreme or surprising a feature of the observed data (a test statistic) is relative to what one would expect if the null hypothesis is true. A small $p$-value (such as $p < .05$ or $p < .01$) indicates that the observed data would be unlikely to occur if the null is true, leading the researcher to reject the null hypothesis. Alternately, testing procedures might be formulated by comparing a test statistic to a specified critical value; a test statistic exceeding the critical value would lead the researcher to reject the null. @@ -551,6 +663,7 @@ Hypothesis testing procedures aim to control the level of false positives, corre The level of a testing procedure is often denoted as $\alpha$, and it has become conventional in many fields to conduct tests with a level of $\alpha = .05$.[^significance] Just as in the case of confidence intervals, hypothesis testing procedures can sometimes be developed that will have false positive rates exactly equal to the intended level $\alpha$. However, in many other problems, hypothesis testing procedures involve approximations or assumption violations, so that the actual rate of false positives might deviate from the intended $\alpha$-level. +Knowing the risk of this is important. When we evaluate a hypothesis testing procedure, we are concerned with two primary measures of performance: *validity* and *power*. [^significance]: The convention of using $\alpha = .05$ does not have a strong theoretical rationale. Many scholars have criticized the rote application of this convention and argued for using other $\alpha$ levels. See @Benjamin2017redefine and @Lakens2018justify for spirited arguments about choosing $\alpha$ levels for hypothesis testing. @@ -559,7 +672,7 @@ When we evaluate a hypothesis testing procedure, we are concerned with two prima Validity pertains to whether we erroneously reject a true null more than we should. An $\alpha$-level testing procedure is valid if it has no more than an $\alpha$ chance of rejecting the null, when the null is true. -If we were using the conventional $\alpha = .05$ level, then a valid testing procedure will reject the null in only 50 of 1000 replications of a data-generating process where the null hypothesis actually holds true. +If we were using the conventional $\alpha = .05$ level, then a valid testing procedure will reject the null in only about 50 of 1000 replications of a data-generating process where the null hypothesis actually holds true. To assess validity, we will need to specify a data generating process where the null hypothesis holds (e.g., where there is no difference in average outcomes between experimental groups). We then generate a large series of data sets with a true null, conduct the testing procedure on each dataset and record the $p$-value or critical value, then score whether we reject the null hypothesis. @@ -580,26 +693,38 @@ In other words, we will need to ensure that the null hypothesis is violated (and The process of evaluating the power of a testing procedure is otherwise identical to that for evaluating its validity: generate many datasets, carry out the testing procedure, and track the rate at which the null hypothesis is rejected. The only difference is the _conditions_ under which the data are generated. -We find it useful to think of power as a _function_ rather than as a single quantity because its absolute magnitude will generally depend on the sample size of a dataset and the magnitude of the effect of interest. +We often find it useful to think of power as a _function_ rather than as a single quantity because its absolute magnitude will generally depend on the sample size of a dataset and the magnitude of the effect of interest. Because of this, power evaluations will typically involve examining a _sequence_ of data-generating scenarios with varying sample size or varying effect size. -Further, if our goal is to evaluate several different testing procedures, the absolute power of a procedure will be of less concern than the _relative_ performance of one procedure compared to another. +If we are actually planning a future analysis, we would then look to determine what sample size or effect size would give us 80% power (the traditional target power used for planning in many cases). +See Chapter \@ref(sec:power) for more on power analysis for study planning. + +Separately, if our goal is to compare testing procedures, then absolute power at substantively large effects is often less informative than relative performance near the null. +In this setting, attention naturally shifts to infinitesimal power---that is, how quickly power increases as we move from a null effect to very small alternatives. +Examining infinitesimal power helps identify which tests are most sensitive to small departures from the null in a more standardized way. + + - + + + ### Rejection Rates -When evaluating either validity or power, the main performance measure is the __rejection rate__ of the hypothesis test. Letting $P$ be the p-value from a procedure for testing the null hypothesis that a parameter $\theta = 0$, generated under a data-generating process with parameter $\theta$ (which could in truth be zero or non-zero). The rejection rate is then +When evaluating either validity or power, the main performance measure is the __rejection rate__ of the hypothesis test. +Let $P$ be the $p$-value from a procedure for testing the null hypothesis that a parameter $\theta = 0$. +The rejection rate is then $$ \rho_\alpha(\theta) = \Prob(P < \alpha) (\#eq:rejection-rate) $$ -When data are simulated from a process in which the null hypothesis is true, then the rejection rate is equivalent to the Type-I error rate of the test, which should ideally be near the desired $\alpha$ level. -When the data are simulated from a process in which the null hypothesis is violated, then the rejection rate is equivalent to the power of the test (for the given alternate hypothesis specified in the data-generating process). -Ideally, a testing procedure should have actual Type-I error equal to the nominal level $\alpha$ (this is the definition of validity), but such exact tests are rare. + +When data are simulated from a process in which the null hypothesis is true (i.e., $\theta$ really does equal 0), then the rejection rate is equivalent to the Type-I error rate of the test. +Ideally, a testing procedure should have actual Type-I error exactly equal to the nominal level $\alpha$. +When the data are simulated from a process in which the null hypothesis is violated ($\theta \neq 0$), then the rejection rate is equivalent to the power of the test (for the given scenario defined by the DGP and the alternate hypothesis specified by the value of $\theta$). To estimate the rejection rate of a test, we calculate the proportion of replications where the test rejects the null hypothesis. -Letting $P_1,...,P_R$ be the p-values simulated from $R$ replications of a data-generating process with true parameter $\theta$, we estimate the rejection rate by calculating +Letting $P_1,...,P_R$ be the $p$-values simulated from $R$ replications of a data-generating process with true parameter $\theta$, we estimate the rejection rate by calculating $$ r_\alpha(\theta) = \frac{1}{R} \sum_{r=1}^R I(P_r < \alpha). (\#eq:rejection-rate-estimate) @@ -612,13 +737,18 @@ When simulating from a data-generating process where the null hypothesis holds, Methodologists hold a variety of perspectives on how close the actual Type-I error rate should be in order to qualify as suitable for use in practice. Following a strict statistical definition, a hypothesis testing procedure is said to be __level-$\alpha$__ if its actual Type-I error rate is _always_ less than or equal to $\alpha$, for any specific conditions of a data-generating process. Among a collection of level-$\alpha$ testing procedures, we would prefer the one with highest power. If looking only at null rejection rates, then the test with Type-I error closest to $\alpha$ would usually be preferred. -However, some scholars prefer to use a less stringent criterion, where the Type-I error rate of a testing procedure would be considered acceptable if it is within 50\% of the desired $\alpha$ level. + + + + + ### Inference in the Cluster RCT Simulation -Returning to the cluster RCT simulation, we will evaluate the validity and power of hypothesis tests for the average treatment effect based on each of the three estimation methods. +Returning to the cluster RCT simulation, we evaluate the validity and power of hypothesis tests for the average treatment effect based on each of the three estimation methods. The data used in previous sections of the chapter was simulated under a process with a non-null treatment effect parameter (equal to 0.3 SDs), so the null hypothesis of zero average treatment effect does not hold. Thus, the rejection rates for this scenario correspond to estimates of power. We compute the rejection rate for tests with an $\alpha$ level of $.05$: @@ -631,28 +761,21 @@ runs %>% For this particular scenario, none of the tests have especially high power, and the linear regression estimator apparently has higher power than the aggregation method and the multi-level model. -To make sense of this power pattern, we need to also consider the validity of the testing procedures. -We can do so by re-running the simulation using code we constructed in Chapter \@ref(running-the-simulation-process) using the `simhelpers` package. -To evaluate the Type-I error rate of the tests, we will set the average treatment effect parameter to zero by specifying `ATE = 0`: +To make sense of this power pattern, we should consider the validity of the testing procedures. +We can do so by running a new simulation using the function we wrote in Section \@ref(one-run-reparameterization), but with an average treatment effect of zero (i.e., we make the null hypothesis true), an elevated `size_coef` and `alpha` parameter to induce additional treatment effect heterogeneity, and a larger number of clusters to get more stable estimation: -```{r secret-run-cluster-rct, include=FALSE} -library(simhelpers) -source("case_study_code/gen_cluster_RCT.R") -source("case_study_code/analyze_cluster_RCT.R") -sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, id = "runID" ) +```{r secret-run-null-cluster-rct, include=FALSE} +source("case_study_code/clustered_data_simulation.R") if ( !file.exists("results/cluster_RCT_simulation_validity.rds" )) { tictoc::tic() # Start the clock! - set.seed( 404044 ) - runs_val <- sim_cluster_RCT( - reps = 1000, - J = 20, n_bar = 30, alpha = 0.75, - gamma_1 = 0, gamma_2 = 0.5, - sigma2_u = 0.2, sigma2_e = 0.8 - ) - + set.seed( 4404044 ) + runs_val <- run_CRT_sim( reps = 1000, + n_bar = 30, J = 30, + ATE = 0, ICC = 0.2, + size_coef = 0.75, alpha = 0.75 ) tictoc::toc() saveRDS( runs_val, file = "results/cluster_RCT_simulation_validity.rds" ) @@ -664,39 +787,58 @@ if ( !file.exists("results/cluster_RCT_simulation_validity.rds" )) { ``` ```{r run-cluster-rct, eval=FALSE} -set.seed( 404044 ) -runs_val <- sim_cluster_RCT( - reps = 1000, - J = 20, n_bar = 30, alpha = 0.75, - gamma_1 = 0, gamma_2 = 0.5, - sigma2_u = 0.2, sigma2_e = 0.8 -) +set.seed( 4404044 ) +runs_val <- run_CRT_sim( reps = 1000, + n_bar = 30, J = 30, + ATE = 0, ICC = 0.2, + size_coef = 0.75, alpha = 0.75 ) ``` -Assessing validity involves repeating the exact same rejection rate calculations as we did for power: +Assessing validity involves repeating the exact same rejection rate calculations as we did for power. +We also estimate the bias of the point estimator, since bias can affect the validity of a testing procedure: ```{r demo-calc-validity} runs_val %>% - group_by( estimator ) %>% - summarise( power = mean( p_value <= 0.05 ) ) + group_by( method ) %>% + summarise( bias = mean(ATE_hat), + power = mean( p_value <= 0.05 ) ) %>% + knitr::kable( digits=3 ) ``` -The Type-I error rates of the tests for the aggregation and multi-level modeling approaches are around 5%, as desired. -The test for the linear regression estimator has Type-I error above the specified $\alpha$-level due to the upward bias of the point estimator used in constructing the test. -The elevated rejection rate might be part of the reason that the linear regression test has higher power than the other procedures. -It is not entirely fair to compare the power of these testing procedures, because one of them has Type-I error in excess of the desired level. - + +The aggregation and multi-level modeling approaches both seem to reject a bit below 0.05, at 0.04, while LR is here at 0.10. +The test for the linear regression estimator has an elevated Type-I error, probably driven by the upward bias of the point estimator used in constructing the test (see the bias column). +The elevated rejection rate due to bias might then be part of the reason that the linear regression had higher power than the other procedures in our power example, above. +If so, it is not entirely fair to compare the power of the three testing procedures, because one of them has Type-I error in excess of the desired level. As discussed above, linear regression targets the person-level average treatment effect. -In the scenario we simulated for evaluating validity, the person-level average effect is not zero because we have specified a non-zero impact heterogeneity parameter ($\gamma_2=0.2$), meaning that the school-specific treatment effects vary around 0. -To see if this is why the linear regression test has an inflated Type-I error rate, we could re-run the simulation using settings where both the school-level and person-level average effects are truly zero. +In the scenario we just simulated for evaluating validity, the person-level average effect is not zero even though our cluster-average effect is zero, because we have specified a non-zero impact heterogeneity parameter (`size_coef=0.75`) along with school size variation (`alpha = 0.75`), meaning that the school-specific treatment effects vary around 0. +To see if this is why the linear regression test has an inflated Type-I error rate, we could re-run the simulation using settings where both the school-level and person-level average effects are truly zero by setting `size_coef = 0`. + + ## Relative or Absolute Measures? {#sec-relative-performance} +A __relative measure__ is when you compare a performance measure to the target quantity by taking a ratio. +For example, __relative bias__ is the expected value of an estimator divided by the true parameter value. +As we saw when evaluating variance, above, relative measures can be very powerful, giving statements such as "this estimator's standard errors are 10% too big, on average." +In their best case, they are easy to interpret, since they are unitless. +That said, they also suffer from some notable drawbacks. + +In particular, ratios can be deceiving when the denominator is near zero. +This can be a problem when, for example, examining bias relative to a benchmark estimator. +They can also be problematic when the denominator, i.e. reference value, can take on either negative or positive values. +Finally, if the reference value changes unexpectedly with changes in the simulation context, then relative measures can be misleading. + +We next discuss these issues in more detail, and provide guidance on when to use relative versus absolute performance measures. + +### Performance relative to the target parameter + In considering performance measures for point estimators, we have defined the measures in terms of differences (bias, median bias) and average deviations (variance and RMSE), all of which are on the scale of the target parameter. -In contrast, for evaluating estimated standard errors we have defined measures in relative terms, calculated as _ratios_ of the target quantity rather than as differences. -In the latter case, relative measures are justified because the target quantity (the true degree of uncertainty) is always positive and is usually strongly affected by design parameters of the data-generating process. +In contrast, for evaluating estimated standard errors we have defined measures in relative terms, calculated as _ratios_ of the average estimate to the target quantity, rather than as differences. +In the latter case, relative measures are justified because the target quantity (the true degree of uncertainty) is always positive and is usually strongly affected by design parameters of the data-generating process. Is it ever reasonable to use relative measures for point estimators? If so, how should we decide whether to use relative or absolute measures? -Many published simulation studies have used relative performance measures for evaluating point estimators. For instance, studies might use relative bias or relative RMSE, defined as +Many published simulation studies do use relative performance measures for evaluating point estimators. +For instance, studies might use relative bias or relative RMSE, defined as $$ \begin{aligned} \text{Relative }\Bias(T) &= \frac{\E(T)}{\theta}, \\ @@ -716,15 +858,27 @@ As justification for evaluating bias in relative terms, authors often appeal to However, @hoogland1998RobustnessStudiesCovariance were writing about a very specific context---robustness studies of structural equation modeling techniques---that have parameters of a particular form. In our view, their proposed rule-of-thumb is often generalized far beyond the circumstances where it might be defensible, including to problems where it is clearly arbitrary and inappropriate. -A more principled approach to choosing between absolute and relative measures is to consider how the magnitude of the measure changes across different values of the target parameter $\theta$. -If the estimand of interest is a location parameter, then shifting $\theta$ by 0.1 or by 10.1 would not usually lead to changes in the magnitude of bias, variance, or RMSE. -The relationship between bias and the target parameter might be similar to Scenario A in Figure \@ref(fig:absolute-relative), where bias is roughly constant across a range of different values of $\theta$. +The problem with relative bias, in particular, is when the target estimand is near zero or zero. +Say you have a bias of 0.1 when estimating a parameter with true value of 0.2. +Then the relative bias would be 0.5. +But if the true value were 0.01, then the relative bias is 10. +As the true value gets smaller and smaller, the relative bias will explode. + +Another problem with relative bias is that when estimating a location parameter the location is often fairly arbitrary. +In our Cluster RCT case, for example, the relative bias of our estimate of the ATE would change if we moved the ATE from 0.2 to 0.5, or from 0.2 to 0 (where relative bias would now be entirely undefined). +The absolute bias, however, would stay the same regardless of the ATE. + +A more principled approach to choosing between whether to use absolute and relative measures is to first consider whether the magnitude of the measure changes across different values of the target parameter $\theta$. +If the estimand of interest is a location parameter, then the bias, variance or RMSE of an estimator will likely not change as $\theta$ changes (see Figure \@ref(fig:absolute-relative), left). +Do not use relative measures in this case! + + -Another possibility is that shifting $\theta$ by 0.1 or 10.1 will lead to proportionate changes in the magnitude of bias, variance, or RMSE. -The relationship between bias and the target parameter might be similar to Scenario B in \@ref(fig:absolute-relative), where bias is is roughly a constant multiple of the target parameter $\theta$. -Focusing on relative measures in this scenario is useful because it leads to a simple story: relative bias is always around 1.12 across all values of $\theta$, even though the raw bias varies considerably. -We would usually expect this type of pattern to occur for scale parameters. +If, however, change $\theta$ would lead to proportionate changes in the magnitude of bias, variance, or RMSE, then relative bias may be useful because it could lead to a simple story such as "the estimator is usually estimating around 12% too high," or similar. +This is what is shown at right of Figure \@ref(fig:absolute-relative). +We usually expect this type of pattern to occur for scale parameters, such as standard deviations or, as we saw earlier, standard errors. ```{r absolute-relative} #| fig.width: 8 @@ -734,7 +888,7 @@ We would usually expect this type of pattern to occur for scale parameters. #| fig.cap: "Hypothetical relationships between bias and a target parameter $\\theta$. In Scenario A, bias is unrelated to $\\theta$ and absolute bias is a more appropriate measure. In Scenario B, bias is proportional to $\\theta$ and relative bias is a more appropriate measure." -theta <- seq(-0.8, 0.8, 0.2) +theta <- seq(0, 0.8, 0.2) bias1 <- rnorm(length(theta), mean = 0.06, sd = 0.004) bias2 <- rnorm(length(theta), mean = theta * 0.12, sd = 0.004) type <- rep(c("Scenario A","Scenario B"), each = length(theta)) @@ -753,33 +907,37 @@ ggplot(dat, aes(theta, bias, color = type)) + ``` How do we know which of these scenarios is a better match for a particular problem? -For some estimators and data-generating processes, it may be possible to analyze a problem with statistical theory and examine the how bias or variance would be expected to change as a function of $\theta$. +For some estimators and data-generating processes, it may be possible to use statistical theory to examine the how bias or variance would be expected to change as a function of $\theta$. However, many problems are too complex to be tractable. -Another, much more feasible route is to evaluate performance for _multiple_ values of the target parameter. -As done in many simulation studies, we can simulate sampling distributions and calculate performance measures (in raw terms) for several different values of a parameter, selected so that we can distinguish between constant and multiplicative relationships. -Then, in analyzing the simulation results, we can generate graphs such as those in Figure \@ref(fig:absolute-relative) to understand how performance changes as a function of the target parameter. -If the absolute bias is roughly the same for all values of $\theta$ (as in Scenario A), then it makes sense to report absolute bias as the summary performance criterion. +A generally more feasible route is to evaluate performance for _multiple_ values of the target parameter and then generate a graph such as shown in Figure \@ref(fig:absolute-relative) to understand how performance changes as a function of the target parameter. +If the absolute bias is roughly the same for all values of $\theta$ (as in Scenario A), then use absolute bias. On the other hand, if the bias grows roughly in proportion to $\theta$ (as in Scenario B), then relative bias might be a better summary criterion. +Generally, unless you believe bias must be zero if the parameter $\theta$ is zero, absolute bias is usually the better choice. + ### Performance relative to a benchmark estimator -Another way to define performance measures in relative terms to by taking the ratio of the performance measure for one estimator over the performance measure for a benchmark estimator. -We have already demonstrated this approach in calculating performance measures for the cluster RCT example (Section \@ref(clusterRCTperformance)), where we used the linear regression estimator as the benchmark against which to compare the other estimators. +Another way to define performance measures in relative terms is to take the ratio of the performance measure for one estimator over the performance measure for a benchmark estimator (or some other benchmark quantity). +We have already demonstrated this approach in comparing standard errors and RMSEs for the cluster RCT example (Section \@ref(clusterRCTperformance)), where we used the linear regression estimator as the benchmark. This approach is natural in simulations that involve comparing the performance of multiple estimators and where one of the estimators could be considered the current standard or conventional method. Comparing the performance of one estimator relative to another can be especially useful when examining measures whose magnitude varies drastically across design parameters. -For most statistical methods, we would usually expect precision and accuracy to improve (variance and RMSE to decrease) as sample size increases. -Comparing estimators in terms of _relative_ precision or _relative_ accuracy may make it easier to identify consistent patterns in the simulation results. -For instance, this approach might allow us to summarize findings by saying that "the aggregation estimator has standard errors that are consistently 6-10% smaller than the standard errors of the linear regression estimator." -This is much easier to interpret than saying that "aggregation has standard errors that are around 0.01 smaller than linear regression, on average." -In the latter case, it is very difficult to determine whether a difference of 0.01 is large or small, and focusing on an average difference conceals relevant variation across scenarios involving different sample sizes. +For example, we typically expect precision and accuracy to improve (variance and RMSE to decrease) as sample size increases. +Comparing estimators in terms of _relative_ precision or _relative_ accuracy may then make it easier to identify consistent patterns in the simulation results. +A relative measure might allow us to summarize findings by saying that "the aggregation estimator has standard errors that are consistently 6-10% smaller than the standard errors of the linear regression estimator." +This is much easier to interpret than saying that "aggregation has standard errors that are around 0.01 smaller than linear regression, on average." +In the latter case, it is very difficult to determine whether a difference of 0.01 is large or small. +Furthermore, focusing on an average difference conceals relevant variation across scenarios involving different sample sizes. Comparing performance relative to a benchmark method can be an effective tool, but it also has potential drawbacks. Because these relative performance measures are inherently comparative, higher or lower ratios could either be due to the behavior of the method of interest (the numerator) or due to the behavior of the benchmark method (the denominator). -Ratio comparisons are also less effective for performance measures that are on a constrained scale, such as power. -If we have a power of 0.05, and we improve it to 0.10, we have doubled our power, but if it is 0.10 and we increase to 0.15, we have only increased by 50%. -Ratios can also be very deceiving when the denominator quantity is near zero or when it can take on either negative or positive values; this can be a problem when examining bias relative to a benchmark estimator. -Because of these drawbacks, it is prudent to compute and examine performance measures in absolute terms in addition to examining relative comparisons between methods. +Consider if the benchmark completely falls apart under some circumstance, but the comparison only worsens a slight bit. +The relative measure, focused on the comparison, would appear to be a marked improvement, which could be deceptive. + +Ratio comparisons are also less effective for performance measures, such as power, that are on a constrained scale. +If we have a power of 0.05, and we improve it to 0.10, we have doubled our power, but if it is 0.50 and we increase to 0.55, we have only increased by 10%. +Whether this is misleading or not will depend on context. + ## Estimands Not Represented By a Parameter {#implicit-estimands} @@ -787,18 +945,17 @@ In our Cluster RCT example, we focused on the estimand of the school-level ATE, What if we were instead interested in the person-level average effect? This estimand does not correspond to any input parameter in our data generating process. Instead, it is defined _implicitly_ by a combination of other parameters. -In order to compute performance characteristics such as bias and RMSE, we would need to calculate the parameter based on the inputs of the data-generating processes. -There are at least three possible ways to accomplish this. +In order to compute performance characteristics such as bias and RMSE, we would need to first calculate what the target estimand is based on the inputs of the data-generating processes. +There are at least three possible ways to accomplish this: mathematical theory, generating a massive dataset, or tracking the estimand during simulation. -One way is to use mathematical distribution theory to compute an implied parameter. +**Mathematical theory.** The first way is to use mathematical distribution theory to compute the implied parameter. Our target parameter will be some function of the parameters and random variables in the data-generating process, and it may be possible to evaluate that function algebraically or numerically (i.e., using numerical integration functions such as `integrate()`). This can be a very worthwhile exercise if it provides insights into the relationship between the target parameter and the inputs of the data-generating process. However, this approach requires knowledge of distribution theory, and it can get quite complicated and technical.[^cluster-RCT-estimand] -Other approaches are often feasible and more closely aligned with the tools and techniques of Monte Carlo simulation. -[^cluster-RCT-estimand]: In the cluster-RCT example, the distribution theory is tractable. See Exercise \@ref(cluster-RCT-SPATE) +[^cluster-RCT-estimand]: In the cluster-RCT example, the distribution theory is tractable. See Exercise \@ref(cluster-RCT-SPATE). -An alternative approach is to simply generate a massive dataset---so large that it can stand in for the entire data-generating model---and then simply calculate the target parameter of interest in this massive dataset. In the cluster-RCT example, we can apply this strategy by generating data from a very large number of clusters and then simply calculating the true person-average effect across all generated clusters. +**Massive dataset.** Instead, one can simply generate a massive dataset---one so large that it can stand in for the entire data-generating model---and then simply calculate the target parameter of interest in this massive dataset. In the cluster-RCT example, we can apply this strategy by generating data from a very large number of clusters and then simply calculating the true person-average effect across all generated clusters. If the dataset is big enough, then the uncertainty in this estimate will be negligible compared to the uncertainty in our simulation. We implement this approach as follows, generating a dataset with 100,000 clusters: @@ -807,14 +964,15 @@ dat <- gen_cluster_RCT( n_bar = 30, J = 100000, gamma_1 = 0.3, gamma_2 = 0.5, sigma2_u = 0.20, sigma2_e = 0.80, - alpha = 0.75 + alpha = 0.5 ) + ATE_person <- mean( dat$Yobs[dat$Z==1] ) - mean( dat$Yobs[dat$Z==0] ) -ATE_person ``` -The extremely precise estimate of the person-average effect is `r round( ATE_person, 2)`, which is consistent with what we would expect given the bias we saw earlier for the linear model. +Our extremely precise estimate of the person-average effect is `r round( ATE_person, 3)`, which is consistent with what we would expect given the bias we saw earlier for the linear model. -If we recalculate performance measures for all of our estimators with respect to the `ATE_person` estimand, the bias and RMSE of our estimators will shift but the standard errors will stay the same as in previous performance calculations using the school-level average effect: +If we recalculate performance measures for all of our estimators with respect to the `ATE_person` estimand, the bias and RMSE of our estimators will shift. +The standard errors will stay the same, however. ```{r} performance_person_ATE <- @@ -841,9 +999,10 @@ RMSE_ratio <- For the person-weighted estimand, the aggregation estimator and multilevel model are biased but the linear regression estimator is unbiased. However, the aggregation estimator and multilevel model estimator still have smaller standard errors than the linear regression estimator. RMSE now captures the trade-off between bias and reduced variance. -Overall, aggregation and multilevel modeling have RMSE that is around `r round(100 * (RMSE_ratio - 1))`% larger than linear regression. +Overall, aggregation and multilevel modeling have an RMSE that is around `r round(100 * (RMSE_ratio - 1))`% larger than linear regression. +The greater precision is worth the bias price. -A further approach for calculating `ATE_person` would be to record the true person average effect of the dataset with each simulation iteration, and then average the sample-specific parameters at the end. +**Track during the simulation.** A further approach for calculating `ATE_person` would be to record the true person average effect of the dataset with each simulation iteration, and then average the sample-specific parameters at the end. The overall average of the dataset-specific `ATE_person` parameters corresponds to the population person-level ATE. This approach is equivalent to generating a single massive dataset---we just generate it piece by piece. @@ -875,23 +1034,28 @@ analyze_data = function( dat ) { Now when we run our simulation, we will have a column corresponding to the true person-level average treatment effect for each dataset. We could then take the average of these value across replications to estimate the true person average treatment effect in the population, and then use this as the target parameter for performance calculations. -An estimand not represented by any single input parameter is more difficult to work with than one that corresponds directly to an input parameter. -Still, it is feasible to examine such estimands with a bit of forethought and careful programming. +An estimand not represented by any single input parameter is more difficult to work with than one that corresponds directly to an input parameter. +That said, it might be a more important estimand than ones represented directly by your DGP. +As the above shows, it is quite feasible to examine such estimands with a bit of forethought and careful programming. The key is to be clear about what you are trying to estimate because the performance of an estimator depends critically on the estimand against which it is compared. + + ## Uncertainty in Performance Estimates (the Monte Carlo Standard Error) {#MCSE} -The performance measures we have described are all defined with respect to the sampling distribution of an estimator, or its distribution across an infinite number of replications of the data-generating process. -Of course, simulations will only involve a finite set of replications, based on which we calculate _estimates_ of the performance measures. +The performance measures we have described are all defined in terms of an infinite number of replications of the data-generating process. +Of course, simulations will only involve a finite set of replications, based on which we _estimate_ the measures. These estimates involve some Monte Carlo error because they are based on a limited number of replications. -It is important to understand the extent of Monte Carlo error when interpreting simulation results, so we need methods for asssessing this source of uncertainty. +If we re-ran the simulation (with a different seed), our estimates would change. +We next discuss how to quantify how much they might change. -To account for Monte Carlo error, we can think of our simulation results as a sample from a population. -Each replication is an independent and identically distributed draw from the population of the sampling distribution. -Once we frame the problem in these terms, standard statistical techniques for independent and identically distributed random variables can be applied to calculate standard errors. +To account for Monte Carlo error, we can think of our simulation results as a sample from a population. +Each replication is independent drawn from the population of the sampling distribution. +Once we frame the problem in these terms, we can easily apply standard statistical techniques to calculate standard errors. We call these standard errors Monte Carlo Simulation Errors, or MCSEs. -For most of the performance measures, closed-form expressions are available for calculating MCSEs. -For a few of the measures, we can apply techniques such as the jackknife to calculate reasonable approximations for MCSEs. +For most performance measures, we have closed-form expressions for their MCSEs. +We can estimate MCSEs for the rest via techniques such as the jackknife or bootstrap. + ### Conventional measures for point estimators @@ -907,21 +1071,22 @@ $$ (\#eq:skewness-kurtosis) $$ -[^estimated-MCSEs]: To be precise, the formulas that we give are _estimators_ for the Monte Carlo standard errors of the performance measure estimators. Our presentation does not emphasize this point because the performance measures will usually be estimated using a large number of replications from an independent and identically distributed process, so the distinction between empirical and estimated standard errors will not be consequential. +[^estimated-MCSEs]: To be precise, the formulas that we give are _estimators_ for the Monte Carlo standard errors of the performance measure estimators. Our presentation does not emphasize this point because the performance measures will usually be estimated using a large number of replications from an independent and identically distributed process, so the distinction between empirical and estimated Monte Carlo standard errors will not be consequential. One could write $SE$ and $\widehat{SE}$ to distinguish between the two, but we avoid this notation for clarity and simplicity. See, e.g., -The bias of $T$ is estimated as $\bar{T} - \theta$, so the MCSE for bias is equal the MCSE of $\bar{T}$. It can be estimated as +The bias of $T$ is estimated as $\bar{T} - \theta$, so the MCSE for bias, equal to the MCSE of $\bar{T}$ as $\theta$ is fixed, can be estimated as $$ MCSE\left(\widehat{\Bias}(T)\right) = \sqrt{\frac{S_T^2}{R}}. (\#eq:MCSE-bias) $$ -The sampling variance of $T$ is estimated as $S_T^2$, with MCSE of +The sampling variance of $T$ is estimated as $S_T^2$, with a MCSE of $$ MCSE\left(\widehat{\Var}(T)\right) = S_T^2 \sqrt{\frac{k_T - 1}{R}}. (\#eq:MCSE-var) $$ -The empirical standard error (the square root of the sampling variance) is estimated as $S_T$. Using a delta method approximation[^delta-method], the MCSE of $S_T$ is +The empirical standard error (the square root of the sampling variance) is estimated as $S_T$. +Using a delta method approximation,[^delta-method] the MCSE of $S_T$ is approximately $$ -MCSE\left(S_T\right) = \frac{S_T}{2}\sqrt{\frac{k_T - 1}{R}}. +MCSE\left(S_T\right) \approx \frac{S_T}{2}\sqrt{\frac{k_T - 1}{R}}. (\#eq:MCSE-SE) $$ @@ -931,7 +1096,7 @@ where $g'(\phi)$ is the derivative of $g(\cdot)$ evaluated at $\phi$. Following this approximation, $$ SE( g(X) ) \approx \left| g'(\theta) \right| \times SE(X) .$$ For estimation, we plug in $\hat{\theta}$ and our estimate of $SE(X)$ into the above. -To find the MCSE for $S_T$, we can apply the delta method approximation to $X = S_T^2$ with $g(x) = \sqrt(x)$ and $g'(x) =\frac{1}{2\sqrt{x}}$. +To find the MCSE for $S_T$, we can apply the delta method approximation to $X = S_T^2$ with $g(x) = \sqrt(x)$ and $g'(x) =\frac{1}{2\sqrt{x}}$. See [ths blog post](https://jepusto.com/posts/Multivariate-delta-method/index.html) for a friendly introduction. We estimate RMSE using Equation \@ref(eq:rmse-estimator), which can also be written as $$ @@ -949,7 +1114,7 @@ MCSE( \widehat{RMSE} ) = \frac{\sqrt{\frac{1}{R}\left[S_T^4 (k_T - 1) + 4 S_T^3 $$ Section \@ref(sec-relative-performance) discussed circumstances where we might prefer to calculate performance measures in relative rather than absolute terms. -For measures that are calculated by dividing a raw measure by the target parameter, the MCSE for the relative measure is simply the MCSE for the raw measure divided by the target parameter. +For measures that are calculated by dividing a raw measure by the known target parameter, the MCSE for the relative measure is simply the MCSE for the raw measure divided by the target parameter. For instance, the MCSE of relative bias $\bar{T} / \theta$ is $$ MCSE\left( \frac{\bar{T}}{\theta} \right) = \frac{1}{\theta} MCSE(\bar{T}) = \frac{S_T}{\theta \sqrt{R}}. @@ -957,7 +1122,7 @@ MCSE\left( \frac{\bar{T}}{\theta} \right) = \frac{1}{\theta} MCSE(\bar{T}) = \fr $$ MCSEs for relative variance and relative RMSE follow similarly. -### Less conventional measures for point estimators +### MCSEs for robust measures for point estimators In Section \@ref(less-conventional-measures) we described several alternative performance measures for evaluating point estimators, which are less commonly used but are more robust to outliers compared to measures such as bias and variance. MCSEs for these less conventional measures can be obtained using results from the theory of robust statistics [@Hettmansperger2010robust; @Maronna2006robust]. @@ -977,19 +1142,56 @@ $$ MCSE\left(\tilde{T}_{\{p\}}\right) = \sqrt{\frac{U_p}{R}}, (\#eq:MCSE-trimmed-mean) $$ -where +where, taking from [@Maronna2006robust, Eq. 2.85], $$ -U_p = \frac{1}{(1 - 2p)R}\left( pR\left(T_{(pR)} - \tilde{T}_{\{p\}}\right)^2 + pR\left(T_{((1-p)R + 1)} - \tilde{T}_{\{p\}}\right)^2 + \sum_{r=pR + 1}^{(1 - p)R} \left(T_{(r)} - \tilde{T}_{\{p\}}\right)^2 \right) +U_p = \frac{1}{(1 - 2p)R}\left( pR\left(T_{(pR)} - \tilde{T}_{\{p\}}\right)^2 + pR\left(T_{((1-p)R + 1)} - \tilde{T}_{\{p\}}\right)^2 + \sum_{r=pR + 1}^{(1 - p)R} \left(T_{(r)} - \tilde{T}_{\{p\}}\right)^2 \right) . $$ -[@Maronna2006robust, Eq. 2.85]. +The additional uncertainty due to the thresholds is ignorable Performance measures based on winsorization include winsorized bias, winsorized standard error, and winsorized RMSE. MCSEs for these measures can be computed using the same formuals as for the conventional measures of bias, empirical standard error, and RMSE, but using sample moments of $\hat{X}_r$ in place of the sample moments of $T_r$. -### MCSE for Relative Variance Estimators {#MCSE-for-relative-variance} + + + +### MCSE for Confidence Intervals and Hypothesis Tests + +Performance measures for confidence intervals and hypothesis tests are simple compared to those we have described for point and variance estimators. +For evaluating hypothesis tests, the main measure is the rejection rate of the test, which is a proportion estimated as $r_\alpha$ (Equation \@ref(eq:rejection-rate-estimate)). +An estimated MCSE for the estimated rejection rate is +$$ +MCSE(r_\alpha) = \sqrt{\frac{r_\alpha ( 1 - r_\alpha)}{R}}. +(\#eq:MCSE-rejection-rate) +$$ +This MCSE uses the estimated rejection rate to approximate its Monte Carlo error. +When evaluating the validity of a test, we may expect the rejection rate to be fairly close to the nominal $\alpha$ level, in which case we could compute a MCSE using $\alpha$ in place of $r_\alpha$, taking $\sqrt{\alpha(1 - \alpha) / R}$. +When evaluating power, we will not usually know the neighborhood of the rejection rate in advance of the simulation. +However, a conservative upper bound on the MCSE can be derived by observing that MCSE is maximized when $\rho_\alpha = \frac{1}{2}$, and so +$$ +MCSE(r_\alpha) \leq \sqrt{\frac{1}{4 R}}. +$$ + +When evaluating confidence interval performance, we focus on coverage rates and expected widths. +MCSEs for the estimated coverage rate work similarly to those for rejection rates. +If the coverage rate is expected to be in the neighborhood of the intended coverage level $\beta$, then we can approximate the MCSE as +$$ +MCSE(\widehat{\text{Coverage}}(A,B)) = \sqrt{\frac{\beta(1 - \beta)}{R}}. +(\#eq:MCSE-coverage) +$$ +Alternately, Equation \@ref(eq:MCSE-coverage) could be computed using the estimated coverage rate $\widehat{\text{Coverage}}(A,B)$ in place of $\beta$. + +Finally, the expected confidence interval width can be estimated as $\bar{W}$, with MCSE +$$ +MCSE(\bar{W}) = \sqrt{\frac{S_W^2}{R}}, +(\#eq:MCSE-width) +$$ +where $S_W^2$ is the sample variance of $W_1,...,W_R$, the widths of the confidence interval from each replication. + + +### Using the Jackknife to obtain MCSEs (for relative variance estimators) {#MCSE-for-relative-variance} Estimating the MCSE of relative performance measures for variance estimators is complicated by the appearance of an estimated quantity in the denominator of the ratio. -For instance, the relative bias of $V$ is estimates as the ratio $\bar{V} / S_T^2$, and both the numerator and denominator are estimated quantities that will include some Monte Carlo error. +For instance, the relative bias of $V$ is estimated as the ratio $\bar{V} / S_T^2$, and both the numerator and denominator are estimated quantities that will include some Monte Carlo error. To properly account for the Monte Carlo uncertainty of the ratio, one possibility is to use formulas for the standard errors of ratio estimators. Alternately, we can use general uncertainty approximation techniques such as the jackknife or bootstrap [@boos2015Assessing]. The jackknife involves calculating a statistic of interest repeatedly, each time excluding one observation from the calculation. @@ -997,7 +1199,7 @@ The variance of this set of one-left-out statistics then serves as a reasonable To apply the jackknife to assess MCSEs of relative bias or relative RMSE of a variance estimator, we will need to compute several statistics repeatedly. Let $\bar{V}_{(j)}$ and $S_{T(j)}^2$ be the average variance estimate and the empirical variance estimate calculated from the set of replicates __*that excludes replicate $j$*__, for $j = 1,...,R$. -The relative bias estimate, excluding replicate $j$ would then be $\bar{V}_{(j)} / S_{T(j)}^2$. +The relative bias estimate, excluding replicate $j$, would then be $\bar{V}_{(j)} / S_{T(j)}^2$. Calculating all $R$ versions of this relative bias estimate and taking the variance of these $R$ versions yields a jackknife MCSE: $$ MCSE\left( \frac{ \bar{V}}{S_T^2} \right) = \sqrt{\frac{1}{R} \sum_{j=1}^R \left(\frac{\bar{V}_{(j)}}{S_{T(j)}^2} - \frac{\bar{V}}{S_T^2}\right)^2}. @@ -1037,48 +1239,90 @@ Instead, all we need to do is calculate the overall mean and overall variance, a Jackknife methods are useful for approximating MCSEs of other performance measures beyond just those for variance estimators. For instance, the jackknife is a convenient alternative for computing the MCSE of the empirical standard error or (raw) RMSE of a point estimator, which avoids the need to compute skewness or kurtosis. -However, @boos2015Assessing notes that the jackknife does not work for performance measures involving medians, although bootstrapping remains valid. +However, @boos2015Assessing notes that the jackknife does not work for performance measures involving percentiles, including medians, although bootstrapping (discussed next) usually remains valid. -### MCSE for Confidence Intervals and Hypothesis Tests +### Using bootstrap for MCSEs for relative performance measures -Performance measures for confidence intervals and hypothesis tests are simple compared to those we have described for point and variance estimators. -For evaluating hypothesis tests, the main measure is the rejection rate of the test, which is a proportion estimated as $r_\alpha$ (Equation \@ref(eq:rejection-rate-estimate)). -A MCSE for the estimated rejection rate is -$$ -MCSE(r_\alpha) = \sqrt{\frac{r_\alpha ( 1 - r_\alpha)}{R}}. -(\#eq:MCSE-rejection-rate) -$$ -This MCSE uses the estimated rejection rate to approximate its Monte Carlo error. -When evaluating the validity of a test, we may expect the rejection rate to be fairly close to the nominal $\alpha$ level, in which case we could compute a MCSE using $\alpha$ in place of $r_\alpha$, taking $\sqrt{\alpha(1 - \alpha) / R}$. -When evaluating power, we will not usually know the neighborhood of the rejection rate in advance of the simulation. -However, a conservative upper bound on the MCSE can be derived by observing that MCSE is maximized when $\rho_\alpha = \frac{1}{2}$, and so -$$ -MCSE(r_\alpha) \leq \sqrt{\frac{1}{4 R}}. -$$ +An alternative to the jackknife for estimating MCSEs is the bootstrap [@efron1994introduction]. +The bootstrap involves repeatedly resampling the set of simulation replications with replacement, calculating the performance measure of interest for each resample, and then calculating the standard deviation of the performance measures across the resamples. -When evaluating confidence interval performance, we focus on coverage rates and expected widths. -MCSEs for the estimated coverage rate work similarly to those for rejection rates. -If the coverage rate is expected to be in the neighborhood of the intended coverage level $\beta$, then we can approximate the MCSE as -$$ -MCSE(\widehat{\text{Coverage}}(A,B)) = \sqrt{\frac{\beta(1 - \beta)}{R}}. -(\#eq:MCSE-coverage) -$$ -Alternately, Equation \@ref(eq:MCSE-coverage) could be computed using the estimated coverage rate $\widehat{\text{Coverage}}(A,B)$ in place of $\beta$. +To illustrate, we next calculate a bootstrap MCSE for two quantities: the relative improvement in variance for Aggregation vs. Linear Regression, and the relative bias of Aggregation's variance estimator (how well calibrated it is). -Finally, the expected confidence interval width can be estimated as $\bar{W}$, with MCSE -$$ -MCSE(\bar{W}) = \sqrt{\frac{S_W^2}{R}}, -(\#eq:MCSE-width) -$$ -where $S_W^2$ is the sample variance of $W_1,...,W_R$, the widths of the confidence interval from each replication. +We first calculate these performance measures for quick reference and for making confidence intervals after we bootstrap: -### Calculating MCSEs With the `simhelpers` Package +```{r} +ests <- runs %>% + group_by( method ) %>% + summarise( + SE2 = var( ATE_hat ), + Vbar = mean( SE_hat^2 ) + ) %>% + mutate( calib = Vbar / SE2, + imp_var = SE2 / SE2[method=="LR"] ) +ests +``` +As before, we see Aggregation has a variance around 9% smaller than LR, and that its variance estimator is inflated by around 13%. +Let's make bootstrap confidence intervals on these two quantities to account for Monte Carlo error in our simulation. + +We first make our data wide format to ease bootstrapping: +```{r} +runW <- runs %>% + dplyr::select( runID, method, SE_hat, ATE_hat ) %>% + rename( S=SE_hat, A=ATE_hat ) %>% + tidyr::pivot_wider( names_from = method, + values_from = c( S, A ) ) +print( runW, n = 4 ) +``` -The `simhelpers` package provides several functions for calculating most of the performance measures that we have reviewed, along with MCSEs for each performance measures. +We make the data wide because we want to bootstrap simulation trials, which in the long form are groups of 3 rows each. +We want to resample the _simulation iteration_ (the sets of three estimates that go together) to take within-dataset correlation of the model estimates into account. +By going wide, we can just bootstrap the rows, as so: + +```{r, cache=TRUE} +rps <- repeat_and_stack( 1000, { + run_star = sample_n( runW, size = nrow(runW), replace = TRUE ) + tibble( + cal_agg = mean( run_star$S_Agg^2 ) / var( run_star$A_Agg ), + rel_imp_var = var( run_star$A_Agg ) / var( run_star$A_LR ) + ) +}) +``` + +For simplicity, we are only looking at the aggregated estimator---we could extend our code to bootstrap for everything. +Regardless, once we have our 1000 bootstrap iterations, we see how much our performance measures change across iteration: + +```{r} +rps %>% + pivot_longer( everything(), names_to = "measure", values_to = "value" ) %>% + group_by( measure ) %>% + summarise( + avg = mean( value ), + MCSE = sd( value ) ) %>% + mutate( theta = c( ests$calib[[1]], ests$imp_var[[1]] ), + CI_l = theta - 2*MCSE, + CI_u = theta + 2*MCSE + ) %>% + relocate( measure, theta ) %>% + knitr::kable( digits = 3 ) +``` +We see two results: +First, Aggregation has a lower variance than linear regression, with the upper end of the confidence interval being 97%. +Second, we have evidence that the Aggregation variance estimator is systematically too high, with the lower end of the confidence interval being above 1 (but only just). +If we had run fewer bootstrap iterations, we may well not have been able to make these claims with certainty. + + +The bootstrap is a very general tool. +You can calculate MCSE similarly for nearly any performance measures you can come up with. + + + +## Performance Calculations with the `simhelpers` Package + +The `simhelpers` package provides several functions for calculating most of the performance measures that we have reviewed, along with MCSEs for each measure. The functions are easy to use. Consider this set of simulation runs on the Welch dataset: -```{r simhelpers-MCSEs} +```{r simhelpers-MCSEs, include=FALSE} library( simhelpers ) data( welch_res ) @@ -1086,19 +1330,24 @@ welch <- welch_res %>% dplyr::select(-seed, -iterations ) %>% mutate(method = case_match(method, "Welch t-test" ~ "Welch", .default = method)) +``` +```{r} head(welch) ``` - -We can calculate performance measures across all the range of scenarios. -Here is the rejection rate for the traditional $t$-test based on the subset of simulation results with sample sizes of $n_1 = n_2 = 50$ and a mean difference of 0, using $\alpha$ levels of .01 and .05: + method == "t-test" + +We first assess the rejection rate for the traditional $t$-test based on the subset of simulation results with sample sizes of $n_1 = n_2 = 50$ and a mean difference of 0, using $\alpha$ levels of .01 and .05: ```{r} -welch_sub <- filter(welch, method == "t-test", n1 == 50, n2 == 50, mean_diff == 0 ) +welch_sub <- filter(welch, method == "t-test", + n1 == 50, n2 == 50, mean_diff == 0 ) calc_rejection(welch_sub, p_values = p_val, alpha = c(.01, .05)) ``` The column labeled `K_rejection` reports the number of replications used to calculate the performance measures. +The MCSE columns are the Monte Carlo uncertainty for the measures. +Here we see we are measuring rejection rate quite precisely. Here is the coverage rate calculated for the same condition: ```{r} @@ -1109,18 +1358,23 @@ calc_coverage( ) ``` -The performance functions are designed to be used within a `tidyverse`-style workflow, including on grouped datasets. For instance, we can calculate rejection rates for every distinct scenario examined in the simulation: +The performance functions are designed to be used within a `tidyverse`-style workflow, including on grouped datasets. +For instance, we can calculate the performance measures for all the methods for our given scenario: ```{r} -all_rejection_rates <- - welch %>% - group_by( n1, n2, mean_diff, method ) %>% +welch_sub <- filter(welch, n1 == 50, n2 == 50, + mean_diff == 0 ) +welch_sub %>% + group_by( method ) %>% summarise( calc_rejection( p_values = p_val, alpha = c(.01, .05) ) ) ``` -The resulting summaries are reported in table \@ref(tab:Welch-rejection). -```{r Welch-rejection, echo = FALSE, message = FALSE} +See below for more examples of using `simhelpers` to calculate different performance measures. +Later, when we analyze multifactor simulations, this grouping becomes even more powerful: we will group by the simulation factors and evaluate performance for all our scenarios at once. + + - - - -### MCSE Calculation in our Cluster RCT Example +### Quick Performance Calculations for our Cluster RCT Example In Section \@ref(clusterRCTperformance), we computed performance measures for three point estimators of the school-level average treatment effect in a cluster RCT. -We can carry out the same calculations using the `calc_absolute()` function from `simhelpers`, which also provides MCSEs for each measure. -Examining the MCSEs is useful to ensure that 1000 replications of the simulation is suffiicent to provide reasonably precise estimates of the performance measures. +We can carry out the same calculations using the `calc_absolute()` function from `simhelpers`, which also provides MCSEs. +Examining the MCSEs is useful to ensure that 1000 replications of the simulation is sufficient to provide reasonably precise estimates of performance. In particular, we have: ```{r cluster-MCSE-calculation} @@ -1159,14 +1411,61 @@ runs %>% ) ``` -We see the MCSEs are quite small relative to the linear regression bias term and all the SEs (`stddev`) and RMSEs. Results based on 1000 replications seems adequate to support our conclusions about the gross trends identified. -We have _not_ simulated enough to rule out the possibility that the aggregation estimator and multilevel modeling estimator could be slightly biased. Given our MCSEs, they could have true bias of as much as 0.01 (two MCSEs). +We see the MCSEs are quite small relative to the linear regression bias term, the SEs (`stddev`), and RMSEs. +Results based on 1000 replications seems adequate to support our conclusions about the gross trends identified. +We have _not_ simulated enough to rule out the possibility that the aggregation estimator and multilevel modeling estimator could be slightly biased. +Given our MCSEs, they could have true bias of as much as 0.01 (two MCSEs). -## Summary of Peformance Measures +For our variance estimators, we can use the `calc_variance()` function to compute performance measures and MCSEs using the jackknife: - -We list most of the performance criteria we saw in this chapter in the table below, for reference: +```{r cluster-variance-MCSE-calculation} +perf <- runs %>% + group_by(method) %>% + summarise( + calc_relative_var( + estimates = ATE_hat, + var_estimates = SE_hat^2, + criteria = c("relative bias", "relative rmse") + ) + ) +perf %>% + dplyr::select( -K_relvar ) %>% + knitr::kable( digits = 2 ) +``` +Comparing our calibration results to those we got from the bootstrap, above, we can use our performance table to make confidence intervals as so: +```{r} +perf %>% + dplyr:: select( method, rel_bias_var, rel_bias_var_mcse ) %>% + mutate( CI_l = rel_bias_var - 2 * rel_bias_var_mcse, + CI_u = rel_bias_var + 2 * rel_bias_var_mcse ) %>% + knitr::kable( digits = 2 ) +``` +We get essentially the same results! + + +## Concluding thoughts + +The performance of an estimator is not usually any single number. +As we saw above, there are many different qualities a given estimator can have, bias and precision being just two of them. +We list the main ones we discussed on the table below. +Furthermore, many data analysis procedures produce multiple pieces of information---not just point estimates, but also standard errors and confidence intervals and $p$-values from null hypothesis tests---and those pieces are inter-related. +For instance, a confidence interval is usually computed from a point estimate and its standard error. +Consequently, the performance of that confidence interval will be strongly affected by whether the point estimator is biased and whether the standard error tends to understates or over-states the true uncertainty. +Likewise, the performance of a hypothesis testing procedure will often strongly depend on the properties of the point estimator and standard error used to compute the test. +Thus, to understand "performance," most simulations will involve evaluating a data analysis procedure on several measures to arrive at a holistic understanding of its performance. +Moreover, the main aim of many simulations is to compare the performance of several different estimators or to determine which of several data analysis procedures is preferable. +For such aims, we use a suite of performance measures to understand how different procedures work differently, when and how one is superior to the other, and what factors influence differences in performance. + +Finally, we will actually need to assess how performance _changes_ in response to different conditions as given by the DGP. +In the next chapters we talk about how to expand a simulation from a single scenario to multiple scenarios. +Once we do this, we will then begin to graph how performance changes as a function of the DGP conditions, which will help us truly understand when and why different data analysis procedures work well or poorly. + + +### Summary of Peformance Measures + + + | Criterion | Definition | Estimator | Monte Carlo Standard Error | |-------------------------|---------------------------------------------------------|----------------------|----------------------------| @@ -1190,17 +1489,6 @@ We list most of the performance criteria we saw in this chapter in the table bel * The median absolute deviation (MAD) is another measure of overall accuracy that is less sensitive to outlier estimates. The RMSE can be driven up by a single bad egg. The MAD is less sensitive to this. -## Concluding thoughts - -In practice, many data analysis procedures produce multiple pieces of information---not just point estimates, but also standard errors and confidence intervals and p-values from null hypothesis tests---and those pieces are inter-related. -For instance, a confidence interval is usually computed from a point estimate and its standard error. -Consequently, the performance of that confidence interval will be strongly affected by whether the point estimator is biased and whether the standard error tends to understates or over-states the true uncertainty. -Likewise, the performance of a hypothesis testing procedure will often strongly depend on the properties of the point estimator and standard error used to compute the test. -Thus, most simulations will involve evaluating a data analysis procedure on several measures to arrive at a holistic understanding of its performance. - -Moreover, the main aim of many simulations is to compare the performance of several different estimators or to determine which of several data analysis procedures is preferable. -For such aims, we will need to use the performance measures to understand whether a set of procedures work differently, when and how one is superior to the other, and what factors influence differences in performance. -To fully understand the advantages and trade-offs among a set of estimators, we will generally need to compare them using several performance measures. ## Exercises @@ -1311,16 +1599,14 @@ How do the normal Wald-type intervals compare to the cluster-robust intervals? ### Jackknife calculation of MCSEs for RMSE {#jackknife-MCSE} -The following code generates 100 replications of a simulation of three average treatment effect estimators in a cluster RCT, using a simulation driver function we developed in Section \@ref(bundle-sim-demo) using components described in Sections \@ref(case-cluster) and \@ref(multiple-estimation-procedures). +The following code generates 100 replications of a simulation of three average treatment effect estimators in a cluster RCT, using a simulation driver function we developed in Section \@ref(one-run-reparameterization) using components described in Sections \@ref(case-cluster) and \@ref(multiple-estimation-procedures). ```{r, eval = FALSE} set.seed( 20251029 ) -runs_val <- sim_cluster_RCT( +runs_val <- run_RCT_sim( reps = 100, J = 16, n_bar = 20, alpha = 0.5, - gamma_1 = 0.3, gamma_2 = 0.8, - sigma2_u = 0.25, sigma2_e = 0.75 -) + ATE = 0.3, ICC = 0.2 ) ``` Compute the RMSE of each estimator, and use the jackknife technique described in Section \@ref(MCSE-for-relative-variance) to compute a MCSE for the RMSE. diff --git a/070-experimental-design.Rmd b/070-experimental-design.Rmd index 3320ca3..a3d4669 100644 --- a/070-experimental-design.Rmd +++ b/070-experimental-design.Rmd @@ -201,7 +201,6 @@ saveRDS( res, file = "results/simulation_CRT.rds" ) ```{r secret_run_full_CRT, include=FALSE} if ( !file.exists( "results/simulation_CRT.rds" ) ) { - source( here::here( "case_study_code/clustered_data_simulation_runner.R" ) ) } else { res = readRDS("results/simulation_CRT.rds") @@ -266,9 +265,44 @@ sres <- glimpse( sres ) ``` +Either way, we now have a lot of results to wade through. +After we calculate our performance measures, we have 810 rows of results across 270 different scenarios. +We can simply aggregate across everything to get overall comparisons: + +```{r} +sres %>% + group_by( method ) %>% + summarise( mean_bias = mean( bias ), + mean_SE = mean( SE ), + mean_rmse = mean( rmse ) ) +``` + +Even with this massive simplification, we see tantalizing hints of differences between the methods. +Linear Regression does seem biased, on average, across the scenarios considered. +But so does multilevel modeling---we did not see bias before, but apparently for some scenarios it is biased as well. +Overall SE and RMSE seems roughly the same across scenarios, and it is clear that uncertainty, on average, trumps bias. +This could be due, however, to those scenarios with few clusters that are small in size. +In order to understand the more complete store, we need to dig further into the results. + + + + + + +## Conclusions + +In this chapter, we have moved from isolated simulation scenarios to fully specified multifactor simulation designs. +By systematically varying multiple features of a data-generating process while holding others fixed, we can create a structured way to explore how estimator performance depends on the context in which the estimators are used. +By treating simulations as designed experiments, we gain a principled framework for choosing parameters and curating our set of scenarios to explore. +The result is a rich collection of simulation output that can capture bias, variability, uncertainty estimation, or testing behavior across a wide range of plausible conditions. + +However, the volume and complexity of these results also create a new challenge: interpretation. +With many factors, levels, and performance measures, it is no longer possible to understand results by inspecting tables alone. +The next step is therefore to summarize, aggregate, and visualize these outcomes in ways that reveal patterns and trade-offs. +In the next few chapters, we therefore turn to the problem of analyzing and presenting such results, with an emphasis on graphical approaches that clarify how performance varies across conditions. + - diff --git a/074-building-good-vizualizations.Rmd b/074-building-good-vizualizations.Rmd index 6443228..596d384 100644 --- a/074-building-good-vizualizations.Rmd +++ b/074-building-good-vizualizations.Rmd @@ -119,7 +119,7 @@ We connect the points with lines to help us see trends within each of the small The lines help us visually track which group of points goes with which. We immediately see that when there is a lot of site variation (`alpha = 0.80`), and it relates to outcome (`size_coef=0.2`), linear regression is very different from the other two methods. -We also see that when `alpha` is 0 and `size_coef` is 0, we may also have a negative bias when $J = 5`. +We also see that when `alpha` is 0 and `size_coef` is 0, we may also have a negative bias when $J = 5$. Before we get too excited about this surprising result, we add MCSEs to our plot to see if this is a real effect or just noise: ```{r} @@ -139,9 +139,9 @@ ggplot( sres_sub, aes( as.factor(J), bias, Our confidence intervals exclude zero! Our excitement mounts. We next subset our overall results again to check all the scenarios with `alpha = 0`, and `size_coef = 0` to see if this is a real effect. -We can still plot all our data, as for that subset of scenarios we still only have three factors, `ICC`, `J_bar`, and `n_bar`, left to plot. +We can still plot all our data, as for that subset of scenarios we still only have three factors, `ICC`, `J_bar`, and `n_bar`, left to plot (we omit the code as it is similar to above). -```{r} +```{r, echo=FALSE} null_sub <- sres %>% dplyr::filter( alpha == 0, size_coef == 0 ) ggplot( null_sub, aes( ICC, bias, @@ -188,14 +188,14 @@ If the boxes are wide, then we know that the factors that vary within the box ma With bundling, we generally need a good number of simulation runs per scenario, so that the MCSE in the performance measures does not make our boxplots look substantially more variable (wider) than the truth. Consider a case where all the scenarios within a box have zero _true_ bias; if the MCSE were large, the _estimated_ biases would still vary and we would see a wide boxplot when we should not. -To illustrate bundling, we replicate our small subset figure from above, but instead of each point (with a given `J', `alpha`, and `size_coef`) just being the single scenario with `n_bar=80` and `ICC = 0.20`, we plot all the scenarios in a boxplot at that location. +To illustrate bundling, we replicate our small subset figure from above, but instead of each point (with a given `J`, `alpha`, and `size_coef`) just being the single scenario with `n_bar=80` and `ICC = 0.20`, we plot all the scenarios (across the `n_bar` and `ICC` values) in a boxplot at that location. We put the boxes for the three methods side-by-side to directly compare them: ```{r clusterRCT_plot_bias_v1} ggplot( sres, aes( as.factor(J), bias, col=method, group=paste0(method, J) ) ) + facet_grid( size_coef ~ alpha, labeller = label_both ) + - geom_boxplot( coef = Inf, width=0.7, fill="grey" ) + + geom_boxplot( width=0.7, fill="grey", outlier.size = 0.5 ) + geom_hline( yintercept = 0 ) + theme_minimal() @@ -218,8 +218,8 @@ ggplot( sres, aes( as.factor(ICC), bias, col=as.factor(alpha), group=paste0(ICC, --> All of our simulation trials are represented in this plot. -Each box is a collection of simulation trials. E.g., for `J = 5`, `size_coef = 0`, and `alpha = 0.8` each of the three boxes contains 15 scenarios representing the varying ICC and cluster size. -Here are the 15 results in the top right box for the Aggregation method: +Each box is a collection of trials. E.g., for `J = 5`, `size_coef = 0`, and `alpha = 0.8` each of the three boxes contains 15 scenarios representing the varying ICC and cluster size. +Here are the 15 results in the `J = 5` box in the top right facet for the Aggregation method: ```{r} filter( sres, J == 5, @@ -240,24 +240,25 @@ The apparent outliers (long tails) for some of the boxplots suggest that the two They could also be due to MCSE, and given that we primariy see these tails when $J$ is small, this is a real concern. MCSE aside, a long tail means that some scenario in the box had a high level of estimated bias. We could try bundling along different aspects to see if either of the remaining factors (e.g., ICC) explains these differences. -Here we try bundling cluster size and number of clusters. +Here we try bundling cluster size and number of clusters (we again omit the code). -```{r clusterRCT_plot_bias_v2} +```{r clusterRCT_plot_bias_v2, echo=FALSE} ggplot( sres, aes( as.factor(alpha), bias, col=method, group=paste0(method, alpha) ) ) + facet_grid( size_coef ~ ICC, labeller = label_both ) + - geom_boxplot( coef = Inf, width=0.7, fill="grey" ) + + geom_boxplot( width=0.7, fill="grey", outlier.size=0.5 ) + geom_hline( yintercept = 0 ) + theme_minimal() ``` -We have some progress now: the long tails are primarily when the ICC is high, but we also see that MLM has bias with ICC is 0, if alpha is nonzero. +We have some progress now: the long tails are primarily when the ICC is high, but we also see that MLM has bias when the ICC is 0, if alpha is nonzero. -We know things are more unstable in smaller samples sizes, so the tails could still be MCSE, with some of our bias estimates being large due to random chance. +Estimates of bias will be more unstable in smaller samples sizes, because the estimators themselves are more unstable (recall the MCSE of bias is $SE/\sqrt{R}$, with $R$ the number of trials and $SE$ is the true standard error of the estimator). +Because of this instability, the tails could still be MCSE, with some of our bias estimates being large due to random chance. Or perhaps there is still some specific combination of factors that allow for large bias (e.g., perhaps small sample sizes makes our estimators more vulnerable to bias). In an actual analysis, we would make a note to investigate these anomalies later on. -In general, trying to group your simulation scenarios so that their boxes are generally narrow is a good idea; narrow boxes means that you have found a representation of the data where you know what is driving the variation in your performance measure, and that the factors bundled inside the boxes are less important. +In general, trying to group your simulation scenarios so that their boxes are generally small is a good idea; small boxes means that you have found a representation of the data where you know what is driving the variation in your performance measure, and that the factors bundled inside the boxes are less important. This might not always be possible, if all your factors matter; in this case the width of your boxes tells you to what extent the bundled factors matter relative to the factors explicitly present in your plot. One might wonder, with only few trials per box, whether we should instead look at the individual scenarios. @@ -280,10 +281,9 @@ Using boxplots, even over such a few number of points, notably clarifies a visua Boxplots can make seeing trends more difficult, as the eye is drawn to the boxes and tails, and the range of your plot axes can be large due to needing to accommodate the full tails and outliers of your results; this can compress the mean differences between groups, making them look small. They can also be artificially inflated, especially if the MCSEs are large. -Instead of bundling, we can therefore aggregate, where we average all the scenarios within a box to get a single number of average performance. -This will show us overall trends rather than individual simulation variation. +Instead of bundling, we can therefore aggregate, where we average all the scenarios within a box to get a single number of average performance, to see overall trends. -When we aggregate, and average over some of the factors, we collapse our simulation results down to fewer moving parts. +When we aggregate, i.e. average over some of the factors, we collapse our simulation results down to fewer moving parts. Aggregation across factors is better than not having varied those factors in the first place! A performance measure averaged over a factor is a more general answer of how things work in practice than having not varied the factor at all. @@ -318,46 +318,33 @@ ggplot( ssres, aes( as.factor(J), bias, col=method, group=method ) ) + theme_minimal() ``` -We now see quite clearly that as `alpha` grows, linear regression gets more biased if cluster size relates to average impact in the cluster (`size_coef`). +We now see quite clearly that as `alpha` grows, linear regression gets more biased if cluster size relates to average impact in the cluster (`size_coef = 0.2`). Our finding makes sense given our theoretical understanding of the problem---if size is not related to treatment effect, it is hard to imagine how varying cluster sizes would cause much bias. We are looking at an interaction between our simulation factors: we only see bias for linear regression when cluster size relates to impact and there is variation in cluster size. -We also see that all the estimators have near zero bias when there is no variation in cluster size or the cluster size does not relate to outcome, as shown by the top row and left column facets. +We also see that all the estimators have near zero bias when there is no variation in cluster size or the cluster size does not relate to outcome, as shown by the top row and left column of facets. Finally, we see the methods all likely give the same answers when there is no cluster size variation, given the overplotted lines on the left column of the figure. -We might take this figure as still too complex. So far we have learned that MLM does seem to react to ICC, and that LR reacts to `alpha` and `size_coef` in combination. +We can continue to explore by changing our $x$-axis, e.g., to ICC. More broadly, with many levels of a factor, as we have with ICC, we can let ggplot aggregate directly by taking advantage of `geom_smooth()`. This leads to the following: ```{r, echo=FALSE, message=FALSE} -ggplot( sres, aes( ICC, bias, col=as.factor(alpha), +ggplot( sres, aes( ICC, bias, col=method, group=interaction(method,alpha) ) ) + - facet_grid( size_coef ~ method, labeller = label_both ) + + facet_grid( size_coef ~ alpha, labeller = label_both ) + geom_smooth( alpha=0.75, se=FALSE, method="loess", span=1.5 ) + geom_hline( yintercept = 0 ) + theme_minimal() ``` -Our story is fairly clear now: LR is biased when alpha is large and the cluster size relates to impact. -MLM can be biased when ICC is low, if cluster size relates to impact (this is because it is driving towards person-weighting when there is little cluster variation). - - - Aggregation is powerful, but it can be misleading if you have scaling issues or extreme outliers. With bias, our scale is fairly well set, so we are good. @@ -381,22 +368,22 @@ agg_perf <- sres %>% summarise( SE = sqrt( mean( SE^2 ) ) ) ``` -Because bias is linear, you do not need to worry about the MCSE. -But if you are looking at the magnitude of bias ($|bias|$), then you can run into issues when the biases are close to zero, if they are measured noisily. -For example, imagine you have two scenarios with true bias of 0.0, but your MCSE is 0.02. +MCSE can sometimes distort an aggregation of a performance measure. +For example, if you are looking at the magnitude of bias ($|bias|$), then you can run into issues when the biases are close to zero, if they are measured noisily. +In particular, imagine you have two scenarios with true bias of 0.0, but your MCSE is 0.02. In one scenario, you estimate a bias of 0.017, and in the other -0.023. If you average the estimated biases, you get -0.003, which suggests a small bias as we would wish. Averaging the absolute biases, on the other hand, gives you 0.02, which could be deceptive. With high MCSE and small magnitudes of bias, looking at average bias, not average $|bias|$, is safer. Alternatively, you can use the formula $RMSE^2 = Bias^2 + SE^2$ to back out the average absolute bias from the RMSE and SE. +In this approach, first aggregate to get overall RMSE and SE, and then calculate $Bias^2$ as $RMSE^2 - SE^2$. ## Comparing true SEs with standardization We just did a deep dive into bias. Uncertainty (standard errors) is another primary performance criterion of interest. - As an initial exploration, we plot the standard error estimates from our Cluster RCT simulation, using smoothed lines to visualize trends. We use `ggplot`'s `geom_smooth` to aggregate over `size_coef` and `alpha`, which we leave out of the plot. We include individual data points to visualize variation around the smoothed estimates: @@ -415,12 +402,12 @@ Second, standard error decreases as the number of clusters (J) increases, which In contrast, increasing the cluster size (n_bar) has relatively little effect on the standard error (the rows of our facets look about the same). Lastly, all methods show fairly similar levels of standard error overall. -While we can extract all of these from the figure, the figure is still not ideal for _comparing_ our methods. +While we can extract all of these observations from the figure, the figure is not ideal for _comparing_ our methods. The dominant influence of design features like ICC and sample size obscures our ability to detect meaningful differences between methods. In other words, even though SE changes across scenarios, it’s difficult to tell which method is actually performing better within each scenario. We can also view the same information by bundling over the left-out dimensions. -We put `n_bar` in our bundles because maybe it does not matter that much: +We add `n_bar` to our bundles, instead of faceting by it, because maybe it does not matter that much: ```{r} ggplot( sres, aes( ICC, SE, col=method, group=paste0( ICC, method ) ) ) + facet_grid( . ~ J, labeller = label_both ) + @@ -431,18 +418,19 @@ ggplot( sres, aes( ICC, SE, col=method, group=paste0( ICC, method ) ) ) + Our plots are still dominated by the strong effects of ICC and the number of clusters (J). When performance metrics like standard error vary systematically with design features, it becomes difficult to compare methods meaningfully across scenarios. -To address this, we shift our focus through standardization. Instead of noting that: +To address this, we shift our focus through standardization. +We want to shift from observations such as: > “All my SEs are getting smaller,” -we want to conclude that: +to conclusions such as: > “Estimator 1 has systematically higher SEs than Estimator 2 across scenarios.” -Simulation results are often driven by broad design effects, which can obscure the specific methodological questions we care about. Standardizing helps bring those comparisons to the forefront. -Let's try that next. - -One straightforward strategy for standardization is to compare each method’s performance to a designated baseline. In this example, we use Linear Regression (LR) as our baseline. +Simulation results are often driven by broad design effects, which can obscure the specific methodological questions we care about. +We can therefore standardize, by calculating a relative performance measure, to bring our desired comparisons to the forefront. +One straightforward strategy for standardization is to compare each method’s performance to a designated baseline. +In this example, we use Linear Regression (LR) as our baseline. We standardize by, for each simulation scenario, dividing each method’s SE by the SE of LR, to produce `SE.scale`. This relative measure, `SE.scale`, allows us to examine how much better or worse, across our scenarios, each method performs relative to a chosen reference method. @@ -458,7 +446,7 @@ ssres <- We can then treat `SE.scale` as a measure like any other. Here we bundle, showing how relative SE changes by J, `n_bar` and ICC: -```{r} +```{r, out.width="100%"} ggplot( ssres, aes( ICC, SE.scale, col=method, group = interaction(ICC, method) ) ) + facet_grid( n_bar ~ J, labeller = label_both ) + @@ -467,14 +455,15 @@ ggplot( ssres, aes( ICC, SE.scale, col=method, ``` The figure above shows how each method compares to LR across simulation scenarios. -Aggregation clearly performs worse than LR when the Intraclass Correlation Coefficient (ICC) is zero. However, when ICC is greater than zero, Aggregation yields improved precision. +Aggregation clearly performs worse than LR when the Intraclass Correlation Coefficient (ICC) is zero. +However, when the ICC is greater than zero, Aggregation yields improved precision. The Multilevel Model (MLM), in contrast, appears more adaptive. -It captures the benefits of aggregation when ICC is high, but avoids the precision cost when ICC is zero. -This adaptivity makes MLM appealing in practice when ICC is unknown or variable across contexts. +It captures the benefits of aggregation when the ICC is high, but avoids the precision cost when ICC is zero. +This adaptability makes MLM appealing in practice when the ICC is unknown or variable across contexts. -In looking at the plot we are seeing essentially identical rows and and fairly similar across columns. -This suggests we should bundle the `n_bar` to get a cleaner view of the main patterns, and that we can also bundle over `J` as well. -We finally drop the `LR` results entirely, as it is the reference method and always has a relative SE of 1. +In looking at the plot we are seeing essentially identical rows and fairly similar columns. +This suggests we should bundle the `n_bar` to get a cleaner view of the main patterns, and that we could also bundle over `J` as well. +We finally can drop the `LR` results entirely, as it is the reference method and always has a relative SE of 1. ```{r} ssres %>% @@ -488,7 +477,8 @@ ssres %>% scale_y_continuous( breaks=seq(90, 125, by=5 ) ) ``` -The pattern is clear: when ICC = 0, Aggregation performs worse than LR, and MLM performs about the same. But as ICC increases, Aggregation and MLM both improve, and perform about the same to each other. +The pattern is clear: when ICC = 0, Aggregation performs worse than LR, and MLM performs about the same. +But as ICC increases, Aggregation and MLM both improve, and perform about the same to each other. This highlights the robustness of MLM across diverse conditions. @@ -581,9 +571,9 @@ This is the kind of diagnostic plot we often wish were included in more applied So far we have examined the performance of our _point estimators_. We next look at ways to assess our _estimated_ standard errors. -A good first question is whether they are about the right size, on average, across all the scenarios. +A good first question is whether they are _calibrated_ (about the right size on average) across all the scenarios. -When assessing estimated standard errors it is very important to see if they are _reliably_ the right size, making the bundling method an especially important tool here. +When assessing estimated standard errors it is very important to see if they are _reliably_ the right size across the scenarios, making bundling an especially important tool here. We first see if the average estimated SE, relative to the true SE, is usually around 1 across all scenarios: ```{r, out.width="100%"} @@ -665,6 +655,27 @@ It is a tricky measure: if the true SE is high for a method, then the relative i Thinking through what might be driving what can be difficult, and is often not central to the main purpose of an evaluation. People often look at confidence interval coverage and confidence interval width, instead, to assess the quality of estimated SEs. +We can instead try using the degree of freedom calculations we saw in Section \@ref(performance-stability). +We plug in $\widehat{df} = \frac{2 \bar{V}^2}{S_V^2}$ for each scenario and plot: + +```{r} +sres <- mutate( sres, + df_hat = 2 * ( ESE_hat^4 ) / ( SD_SE_hat^4 ) ) +ggplot( sres, + aes( ICC, df_hat, col=method, + group = interaction(ICC,method) ) ) + + facet_grid( . ~ J, labeller = label_both) + + geom_boxplot( position="dodge" ) + + labs( color="n" ) + + scale_y_log10( breaks = c( 2.5, 5, 10, 20, 40, 80, 160 ), + labels = c( 2.5, 5, 10, 20, 40, 80, 160 ) + ) + + geom_hline( yintercept = c(5,20,80), lty=2, col="grey" ) + + scale_x_continuous( breaks = unique( sres$ICC ) ) +``` + +The multilevel model generally has higher estimated degrees of freedom, especially when ICC is low, and Linear Regression has generally lower degrees of freedom overall (except when ICC is zero). + ## Assessing confidence intervals @@ -700,11 +711,12 @@ c_sub <- covres %>% ggplot( c_sub, aes( ICC, coverage, col=method, group=method ) ) + facet_grid( . ~ J, labeller = label_both ) + geom_line( position = position_dodge( width=0.05)) + - geom_point( position = position_dodge( width=0.05) ) + + geom_point( position = position_dodge( width=0.05), size = 2 ) + geom_errorbar( aes( ymax = coverage + 2*coverage_mcse, ymin = coverage - 2*coverage_mcse ), width=0, position = position_dodge( width=0.05) ) + - geom_hline( yintercept = 0.95 ) + geom_hline( yintercept = 0.95 ) + + scale_y_continuous( labels = scales::percent_format() ) ``` Generally coverage is good unless $J$ is low or ICC is 0. @@ -742,8 +754,19 @@ There are mild differences due to differences in how the degrees of freedom are +## Conclusions + +Visualization is rarely a single plot; it is the process of iterating from messy, exploratory graphs to a small set of figures that communicate the main findings. +In this chapter we walked through four core tools for that process---subsetting, many small multiples, bundling, and aggregation---and showed how they can be used together to manage the complexity of multifactor simulations. +Subsetting helps with orientation and sanity checks, small multiples reveal interactions, bundling summarizes variability across scenarios, and aggregation highlights broad trends when the raw results are too noisy or too high dimensional to interpret directly. +We illustrated these tools using the Cluster RCT case study, beginning with bias and then extending the same workflow to standard errors, RMSE, estimated standard errors, and confidence interval performance. +A recurring theme was that raw performance measures are often dominated by design effects such as ICC and sample size, which can obscure the method comparisons we care about. +Standardization---especially expressing performance relative to a baseline method---can make comparisons far clearer, though it also propagates uncertainty from the baseline and should be interpreted with care. +Ultimately, the goal of visualization is to discover and communicate structure: where is performance stable and when does it change? +The plots in this chapter provide a template for an overall workflow, and we encourage you to use the code and approach in your own work. +There are further tools, and further concerns, that build on this core set of techniques; that is what we discuss in the following chapter. ## Exercises @@ -781,6 +804,13 @@ sres <- res %>% names( sres ) ``` +### Checking bias + +Under the discussion of aggregation, we closed with a plot suggesting negative bias as ICC increases when `size_coef=0`. +Investigate that finding, either by digging into the results more or by running additional simulations. +_Do not_ re-run the entire simulation; sometimes a targeted investigation can be just as effective for answering a specific question. + + ### Assessing uncertainty diff --git a/075-special-topics-on-reporting.Rmd b/075-special-topics-on-reporting.Rmd index 6ba7a31..e5537f2 100644 --- a/075-special-topics-on-reporting.Rmd +++ b/075-special-topics-on-reporting.Rmd @@ -337,6 +337,11 @@ Another exploration approach we might use is regression trees. We wrote a utility method, a wrapper to the `rpart` package, to do this ([script here](code/create_analysis_tree.R)). Here, for example, we see what predicts larger bias amounts: +```{r, include=FALSE} +# To silence loading messages +source( here::here( "code/create_analysis_tree.R" ) ) +``` + ```{r} source( here::here( "code/create_analysis_tree.R" ) ) @@ -394,11 +399,11 @@ More generally, however, the Monte Carlo Standard Errors (MCSEs) may be so large One tool to handle few iterations is aggregation: if you average across scenarios, those averages will have more precise estimates of (average) performance than the estimates of performance within the scenarios. Do not, by contrast, trust the bundling approach--the MCSEs will make your boxes wider, and give the impression that there is more variation across scenarios than there really is. -Meta regression approaches such as we saw above can be particularly useful: a regression will effectively average performance across scenario, and give summaries of overall trends. +Meta regression approaches such as we saw above can also be useful: a regression effectively averages performance across scenario, and give summaries of overall trends. You can even fit random effects regression, specifically accounting for the noise in the scenario-specific performance measures. For more on using random effects for your meta regression see @gilbert2024multilevel. -### Example: ClusterRCT with only 100 replicates per scenario +### Example: ClusterRCT with only 25 replicates per scenario ```{r, include=FALSE} @@ -407,7 +412,7 @@ set.seed( 40440 ) # Make small dataset res_small <- res %>% mutate( runID = as.numeric( runID ) ) %>% - filter( runID <= 100 ) + filter( runID <= 25 ) ssres <- res_small %>% group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>% @@ -427,13 +432,12 @@ summary( ssres$R ) ``` In the prior chapter we analyzed the results of our cluster RCT simulation with 1000 iterations per scenario. -But say we only had 25 per scenario. +But say we only had 25 iterations per scenario. Using the prior chapter as a guide, we next recreate some of the plots to show how MCSE can distort the picture of what is going on. -First, we look at our single plot of the raw results. -Before we plot, however, we calculate MCSEs and add them to the plot as error bars. +We start again with bias, and generate a single plot of the raw results for a subset of our scenarios, along with error bars from the MCSEs. -```{r} +```{r, echo=FALSE} sres_sub <- ssres %>% filter( n_bar == 320, J == 20 ) %>% @@ -456,10 +460,11 @@ ggplot( sres_sub, aes( as.factor(alpha), bias, Our uncertainty is much less when ICC is 0; this is because our estimators are far more precise due to not having cluster variation to contend with. Other than the ICC = 0 case, we see substantial amounts of uncertainty, making it very hard to tell the different estimators apart. -In the top row, second plot from left, we see that the three estimators are co-dependent: they all react similarly to the same datasets, so if we end up with datasets that randomly lead to large estimates, all three will give large estimates. -The shape we are seeing is not a systematic bias, but rather a shared random variation. +In the top row, second plot from left, we see that the three estimators are co-dependent: they all react similarly to the same datasets. +In other words, the point estimates are correlated: if a dataset happens to have high-outcome units in the treated group, all the estimators will give larger estimates of an average effect. +The seeming trend we are seeing is not systematic bias, but rather shared random variation. -Here is the same plot with the full 1000 replicates, with the 100 replicate results overlaid in light color for comparison: +Here is the same plot with the full 1000 replicates, with the 25 replicate results overlaid in light color for comparison: ```{r, echo=FALSE} sres_sub_full <- @@ -489,9 +494,9 @@ ggplot( sres_sub_full, aes( as.factor(alpha), bias, coord_cartesian( ylim = c(-0.10,0.10) ) ``` -The MCSEs have shrunk by around $1/\sqrt{10} = 0.32$, as we would expect (generally the MCSEs will be on the order of $1/\sqrt{R}$, where $R$ is the number of replicates, so to halve the MCSE you need to quadruple the number of replicates). -Also note the ICC=0.2 top facet has shifted to a flat, slightly elevated line: we do not yet know if the elevation is real, just as we did not know if the dip in the prior plot was real. -Our confidence intervals are still including 0: it is possible there is no bias at all when the size coefficient is 0 (in fact we are fairly sure it is indeed the case). +With 1000 iterations, MCSEs are around $1/\sqrt{40} = 16\%$ the size, as we would expect (generally MCSEs are on the order of $1/\sqrt{R}$, where $R$ is the number of replicates, so to halve a MCSE you need to quadruple the number of replicates). +Also note the ICC=0.2 top facet has shifted to a flat, slightly elevated line: we do not yet know if the elevation is real, just as we did not know if the dip in the prior plot was real, but we do know that whatever might be happening is of a much smaller scale than suggested by the prior plot. +Our confidence intervals mostly all still include 0: it is possible there is no bias at all when the size coefficient is 0 (in fact we are fairly sure this is indeed the case). ```{r, include=FALSE} # Checking MCSEs are smaller as expected @@ -499,12 +504,13 @@ summary( sres_sub_full$bias.mcse / sres_sub$bias.mcse ) ``` Moving back to our "small replicates" simulation, we can use aggregation to smooth out some of our uncertainty. -For example, if we aggregate across 9 scenarios, our number of replicates goes from 100 to 900; our MCSEs should then be about a third the size. -To calculate an aggregated MCSE, we aggregate our scenario-specific MCSEs as follows: +For example, if we aggregate across 9 scenarios, our number of replicates goes from 25 to 225 and our MCSEs should be about a third the size. +To calculate an aggregated MCSE, aggregate the scenario-specific MCSEs as follows: $$ MCSE_{agg} = \sqrt{ \frac{1}{K^2} \sum_{k=1}^{K} MCSE_k^2 } $$ where $MCSE_k$ is the Monte Carlo Standard Error for scenario $k$, and $K$ is the number of scenarios being averaged. Assuming a collection of estimates are independent, the overall $SE^2$ of an average is the average $SE^2$ divided by $K$. + In code we have: ```{r, echo=FALSE} @@ -552,13 +558,12 @@ summary( ss$bias.mcse.x / ss$bias.mcse.y ) Even with the additional replicates per point, we see noticeable noise in our plot: look at the top-right ICC of 0.8 facet, for example. -Also note how our three methods continue to track each other up and down in top row, giving a sense of a shared error. -This is because all methods are analyzing the same set of datasets; they have shared uncertainty. -This uncertainty can be deceptive. -It can also be a boon: if we are explicitly comparing the performance of one method vs another, the shared uncertainty can be subtracted out, similar to what happens in a blocked experiment [@gilbert2024multilevel]. +Also note how our three methods continue to track each other up and down in the top row, due to shared error. +This shared error can be deceptive. +It can also be a boon: if we are explicitly comparing the performance of one method vs another, we can subtract out the shared error, similar to what happens in a blocked experiment [@gilbert2024multilevel]. -One way to take advantage of this is to fit a multilevel regression model to our raw simulation results with a random effect for dataset. -We next fit such a model, taking advantage of the fact that bias is simply the average of the error across replicates. +One way to remove this error to get a more precise head to head comparison of the estimators is to fit a multilevel regression model to our raw simulation results with a random effect for dataset. +We next fit such a model, taking advantage of the fact that estimated bias is simply the average of the error across replicates. We first make a unique ID for each scenario and dataset, and then fit the model with a random effect for both. The first random effect allows for specific scenarios to have more or less bias beyond what our model predicts. The second random effect allows for a given dataset to have a larger or smaller error than expected, shared across the three estimators. @@ -595,22 +600,38 @@ knitr::kable(ranef_vars, digits = 2) ``` The random variation for `simID` captures unexplained variation due to the interactions of the simulation factors. -It appears to be a trivial amount; almost all the variation is due to the dataset. -This makes sense: each datasets is unbalanced due to random assignment, and that estimation error is part of the dataset random effect. +Relative to the dataset variation, it seems to be a trivial amount. +This makes sense: each datasets is unbalanced due to random assignment, and that estimation error, shared across all three estimators, is part of the dataset random effect. -So far we have not included any simulation factors: we are pushing variation across simulation into the random effect terms. We can instead include the simulation factors as fixed effects, to see how they impact bias. +So far we have not included any simulation factors: we are pushing variation across simulation into the random effect terms. +We can instead include the simulation factors as fixed effects, to see how they impact bias. ```{r} M2 <- lmer( - error ~ method*(J + n_bar + ICC + alpha + size_coef) + (1|dataID) + (1|simID), + error ~ method*(J + n_bar + ICC + alpha * size_coef) + (1|dataID) + (1|simID), data = res_small ) -texreg::screenreg(M2) ``` -The above models allow us to estimate how bias varies with method and simulation factor, while accounting for the uncertainty in the simulation. +We then extract the fixed effects and sort by p-value to see which factors are most important for bias. +We adjust our p-values with a simple Benjamini-Hochberg false discovery rate correction to try and cut down the number of terms to those that are most likely to be real, and not just noise. +```{r} +library( broom.mixed ) +terms <- tidy( M2 ) %>% + dplyr::filter( effect == "fixed" ) %>% + dplyr::select( term, estimate, std.error, p.value ) %>% + mutate(p_fdr = p.adjust(p.value, method = "BH") ) + +terms %>% + filter( p_fdr < 0.10 ) %>% + arrange( p_fdr ) %>% + knitr::kable( digits=c(3) ) +``` -Finally, we can see how much variation has been explained by comparing the random effect variances: +We see what we have seen before: the LR method is notably biased when `size_coef` is non-zero and `alpha` is high. +We have estimated how bias varies with method and simulation factor, while accounting for the uncertainty in the simulation. + +Finally, we can see how much variation has been explained by the simulation factors by comparing the random effect variances: ```{r} ranef_vars1 <- as.data.frame(VarCorr(M)) %>% @@ -625,10 +646,12 @@ ranef_vars2 <- rr = left_join( ranef_vars1, ranef_vars2, by = "grp", suffix = c(".null", ".full") ) rr <- rr %>% - mutate( sd.red = sd.full / sd.null ) + mutate( var.red = sd.full^2 / sd.null^2 ) knitr::kable(rr, digits = 2) ``` +Our model is explaining an estimated 16% of the variation across simulation scenarios. + @@ -638,82 +661,66 @@ knitr::kable(rr, digits = 2) ## What to do with warnings in simulations Sometimes our analytic strategy might give some sort of warning (or fail altogether). -For example, from the cluster randomized experiment case study we have: +In Section \@ref(error-handling) we saw how to trap such warnings or errors. +In particular, in Section \@ref(error-handling) we illustrated error trapping on the multilevel model for the cluster RCT experiment by revising our `analysis_MLM()` method to a `analysis_MLM_safe()` method, which gave output from the MLM method that looked like this: ```{r, include=FALSE} source( "case_study_code/clustered_data_simulation.R" ) ``` ```{r} -set.seed(101012) # (I picked this to show a warning.) +set.seed(101012) # (I picked this to show an issue.) dat <- gen_cluster_RCT( J = 50, n_bar = 100, sigma2_u = 0 ) -mod <- lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) -``` - - - -We have to make a deliberate decision as to what to do about this: - - - Keep these "weird" trials? - - Drop them? - -Generally, when a method fails or gives a warning is something to investigate in its own right. -Ideally, failure would not be too common, meaning we could drop those trials, or keep them, without really impacting our overall results. -But one should at least know what one is ignoring. - -If you decide to drop them, you should drop the entire simulation iteration including the other estimators, even if they worked fine! -If there is something particularly unusual about the dataset, then dropping for one estimator, and keeping for the others that maybe didn't give a warning, but did struggle to estimate the estimand, would be unfair: in the final performance measures the estimators that did not give a warning could be being held to a higher standard, making the comparisons between estimators biased. - -If your estimators generate warnings, you should calculate the rate of errors or warning messages as a performance measure. -Especially if you drop some trials, it is important to see how often things are acting pecularly. - -As discussed earlier, the main tool for doing this is the `quietly()` function: -```{r} -quiet_lmer = quietly( lmer ) -qmod <- quiet_lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) -qmod -``` - -You then might have, in your analyzing code: -```{r, eval=FALSE} -analyze_data <- function( dat ) { - - M1 <- quiet_lmer( Yobs ~ 1 + Z + (1|sid), data=dat ) - message1 = ifelse( length( M1$message ) > 0, 1, 0 ) - warning1 = ifelse( length( M1$warning ) > 0, 1, 0 ) - - # Compile our results - tibble( ATE_hat = coef(M1)["Z"], - SE_hat = se.coef(M1)["Z"], - message = message1, - warning = warning1 ) -} +analysis_MLM_safe( dat ) ``` -Now you have your primary estimates, and also flags for whether there was a convergence issue. -In the analysis section you can then evaluate what proportion of the time there was a warning or message, and then do subset analyses to those simulation trials where there was no such warning. - +For this dataset, for example, we see that there was a convergence issue that gave rise to the indicated message. -For example, in our cluster RCT running example, we know that ICC is an important driver of when these convergence issues might occur, so we can explore how often we get a convergence message by ICC level: +Once the full simulation is run, we can examine the patterns of these warnings and errors to see if they are a serious or minor problem. +We first do this by calculating the rate of errors or warning messages as a performance measure across the simulation scenarios. +Especially if we plan on possibly dropping some trials, it is important to see how often things are acting peculiarly. - +For example, in our cluster RCT running example, we know that ICC is an important driver of when convergence issues might occur, so we can explore how often we get a convergence message by ICC level by calculating the number of messages, warnings, or errors per 1000 iterations: ```{r examine_convergence_rates} res %>% + dplyr::filter( method == "MLM" ) %>% group_by( method, ICC ) %>% - summarise( message = mean( message ) ) %>% - pivot_wider( names_from = "method", values_from="message" ) + summarise( message = 1000*mean( message ), + warning = 1000*mean( warning ), + error = 1000*mean( error ) ) %>% + pivot_wider( names_from = method, + values_from=c( "message", "warning", "error" ) ) ``` We see that when the ICC is 0 we get a lot of convergence issues, but as soon as we pull away from 0 it drops off considerably. -At this point we might decide to drop those runs with a message or keep them. -In this case, we decide to keep. -It should not matter much, except possibly when ICC = 0, and we know the convergence issues are driven by trying to estimate a 0 variance, and thus is in some sense expected. +Thankfully, we never saw actual errors and our warnings were very rare. + +At this point we have to decide whether to drop the problematic runs or keep them. +Ideally, failure would not be too common, meaning we could drop or keep the problematic iterations without really impacting our overall results. + +In our case, for example, it should not matter much, except possibly when ICC = 0, and we know the convergence issues are driven by trying to estimate a 0 variance, and thus is in some sense expected. Furthermore, we know people using these methods would likely ignore these messages, and thus we are faithfully capturing how these methods would be used in practice. We might eventually, however, want to do a separate analysis of the ICC = 0 context to see if the MLM approach is actually falling apart, or if it is just throwing warnings. +Importantly, if the decision is to drop, then the entire simulation iteration, including the other estimators, should be dropped even if the other estimators worked fine! +If there is something particularly unusual about the dataset that caused the concern, then dropping for one estimator and keeping for the others would be unfair: in the final performance measures the estimators that did not give a warning could be being held to a higher standard, making the comparisons between estimators biased. +Consider, for example, if the dropped datasets were fundamentally harder to evaluate. + + +## Conclusions + +Designed experiments in simulation are only as useful as the summaries you make of their output. +In this chapter, we showed a few advanced ways of moving from raw simulation outputs to interpretable summaries using regression, ANOVA, LASSO, trees, and multilevel meta-models. +These tools let you parse the main effects of your simulation factors along with their interactions, and can help turn many scenario-level estimates into clear statements about where methods work and where they break. +They can also help protect against how Monte Carlo simulation error can potentially distort findings. +We also focused on two specific areas where analyzing a set of results can be tricky: when each iteration is expensive (so you do not have too many of them), and when sometimes your estimators fail or perform poorly in a way that raises warnings or flags. +In the former case, aggregation and multilevel models can be especially useful when replicates are expensive. +In the latter, evaluate when failures occur, and then either drop them consistently or explicitly model them. +Overall, these tools give a way to go deeper into a set of simulation results. +Often the findings here would not be presented in the main text, but they can be important supporting detail, especially when placed in a supplement. diff --git a/080-simulations-as-evidence.Rmd b/080-simulations-as-evidence.Rmd index d840551..077f9d7 100644 --- a/080-simulations-as-evidence.Rmd +++ b/080-simulations-as-evidence.Rmd @@ -2,144 +2,439 @@ # Simulations as evidence +```{r, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) + +library(tidyverse) +knitr::opts_chunk$set(echo = TRUE, + fig.width = 5, + fig.height = 3, + out.width = "75%", + fig.align = "center") +options(list(dplyr.summarise.inform = FALSE)) +theme_set( theme_classic() ) + + +library( MASS ) +library( randomForest ) +``` + We began this book with an acknowledgement that simulation is fraught with the potential for misuse: _simulations are doomed to succeed_. -We close this section by reiterating this point, and then discuss several ways researchers might design their simulations so they more arguably produce real evidence on how well things might work. +We close this part by reiterating this point, and then discussing several ways researchers might design their simulations so they more arguably produce real evidence. -When our work is done, ideally we will have generated simulations that provide a sort of "consumer product testing" for the study designs and statistical methods we are exploring. -Clearly, it is important to do some consumer product tests that are grounded in how consumers will actually use the product in real life---but before you get to that point, it can be helpful to cover an array of conditions that are probably more extreme than what will be experienced in practice (like dropping things off buildings or driving over them with cars or whatever else). +When our work is done, ideally we will have generated simulations that provide a sort of "consumer product testing" for the statistical methods we are exploring. +Clearly, it is important to do some consumer product tests that are grounded in how consumers will actually use the product in real life---but before you get to that point, it can be helpful to cover an array of conditions that are probably more extreme than what will be experienced in practice (simulations akin to dropping things off buildings or driving over them with cars or whatever else). For simulation, extreme levels of a factor make understanding the role they play more clear. -They can also help us understand the limits of our methods. -Extreme can go in both the good and bad directions. -Extreme contexts should also include best-case (or at least optimistic) scenarios for when our estimator of interest will excel, giving in effect an upper bound on how well things could go. +Extreme contexts can also help lay bare the limits of our methods. + +"Extreme" can go in both the good and bad directions. +Extreme contexts can include best-case (or at least optimistic) scenarios for when our estimator of interest will excel, giving a nominal upper bound on how well things could go. +Usually extreme simulations are simple ones, with fairly straightforward data generating processes that do not attempt to capture the complexities of real data. +Such extreme simulations might be considered overly _clean_ tests. -Extreme and clear simulations are also usually easier to write, manipulate, and understand. +Extreme and clean simulations are also usually easier to write, manipulate, and understand. Such simulations can help us learn about the estimators themselves. For example, we can use simulation to uncover how different aspects of a data generating process affect the performance of an estimator. To discover this, we would need a data generation process with clear levers controlling those aspects. -Simple simulations can be used to push theory further---we can experiment to gain an inclination of whether a model is a good idea at all, or to verify we are right about an intuition or derivation about how well an estimator can work. +Simple simulations can be used to push theory further---we can experiment to gain an inclination of whether a novel approach to something is a good idea at all, or to verify we are right about an intuition or derivation about how well an estimator can work. But such simulations cannot be the end of our journey. In general, a researcher should work to ensure their simulation evidence is _relevant_. +The evidence should inform how things are likely to work out in practice. A set of simulations only using unrealistic data generating processes may not be that useful. Unless the estimators being tested have truly universally stable properties, we will not learn much from a simulation if it is not relevant to the problem at hand. -We have to circle back to providing testing of the sorts of situations an eventual user of our methods might encounter in practice. +After exploring the overly extreme or overly clean, we have to circle back to providing testing of the sorts of situations an eventual user of our methods might encounter in practice. So how can we make our simulations more relevant? - -## Strategies for making relevant simulations - In the following subsections we go through a range of general strategies for making relevant simulations: - 1. Break symmetries and regularities - 2. Use extensive multi-factor simulations - 3. Generate simulations based on prior literature. - 4. Pick simulation factors based on real data - 5. Resample real data directly to get authentic distributions - 6. Design a fully calibrated simulation + 1. Break symmetries and regularities. + 2. Use extensive multi-factor simulations. + 3. Use simulation DGPs from prior literature. + 4. Pick simulation factors based on real data. + 5. Resample real data directly to get authentic distributions. + 6. Design a fully calibrated simulation. -### Break symmetries and regularities +## Break symmetries and regularities In a series of famous causal inference papers [@lin2013agnostic; @freedman2008regression], researchers examined when linear regression adjustment of a randomized experiment (i.e., when controlling for baseline covariates in a randomized experiment) could cause problems. -Critically, if the treatment assignment is 50%, then the concerns that these researchers examined do not come into play, as asymmetries between the two groups gets perfectly cancelled out. -That said, if the treatment proportion is more lopsided, then under some circumstances you can get bias, and you can get invalid standard errors, depending on other structures of the data. +Critically, if the treatment assignment is 50%, then the concerns that these researchers examined do not come into play, as asymmetries between the two groups get perfectly cancelled out. +That said, if the treatment proportion is more lopsided, then under some circumstances you can get bias and invalid standard errors, depending on other structures of the data. Simulations can be used to explore these issues, but only if we break the symmetry of the 50% treatment assignment. -When designing simulations, it is worth looking for places of symmetry, because in those contexts estimators will often work better than they might otherwise, and other factors may not have as much of an effect as anticipated. +When designing simulations, it is worth looking for places of symmetry, because in those contexts estimators will often work better than they might otherwise, and other factors may not have as much of an effect on performance as anticipated. Similarly, in recent work on best practices for analyzing multisite experiments [@miratrix2021applied], we identified how different estimators could be targeting different estimands. In particular, some estimators target site-average treatment effects, some target person-average treatment effects, and some target a kind of precision-weighted blend of the two. -To see this play out in practice, our simulations needed the sizes of sites to vary, and also the proportion of treated within site to vary. -If we had run simulations with equal site size and equal proportion treated, we would not see the broader behavior that separates the estimators considered. +To see this play out in practice, our simulations needed the sizes of sites to vary, and also the proportion of treated within site to vary across sites. +If we had run simulations with equal site size or equal proportion treated, we would not see the broader behavior that separates the estimators considered. Overall, it is important to make your data generation processes irregular. -### Make your simulation general with an extensive multi-factor experiment +## Use extensive multi-factor simulations. "If a single simulation is not convincing, use more of them," is one principle a researcher might take. By conducting extensive multifactor simulations, once can explore a large space of possible data generating scenarios. If, across the full range of scenarios, a general story bears out, then perhaps that will be more convincing than a narrower range. -Of course, the critic will claim that some aspect that is not varying is the real culprit. +Of course, the critic will claim that some aspect of your DGP that is not varying is the real culprit. If this aspect is unrealistic, then the findings, across the board, may be less relevant. -Thus, pick the factors one varies with care. +Thus, pick the factors one varies with care, and pick factors in anticipation of reviewer critique. -### Use previously published simulations to beat them at their own game +## Use simulation DGPs from prior literature. If a relevant prior paper uses a simulation to make a case, one approach is to replicate that simulation, adding in the new estimator one wants to evaluate. -This makes it (more) clear that you are not fishing: you are using something established in the literature as a published benchmark. +In effect, use previously published simulations to beat them at their own game. +Using an established DGP makes it (more) clear that you are not fishing: you are using something that already exists in the literature as a published benchmark. By constraining oneself to published simulations, one has less wiggle room to cherry pick a data generating process that works the way you want. +Indeed, such a benchmarking approach is often taken in the machine learning literature, where different canonical datasets are used to benchmark the performance of new methods. - -### Calibrate simulation factors to real data +## Pick simulation factors based on real data Use real data to inform choice of factor levels or other data generation features. -For example, in James's work on designing methods for meta-analysis, there is often a question of how big sample sizes should be and how many different outcomes per study there should be when simulating effect size estimates from hypothetical studies to be included in a hypothetical meta-analysis. -To make the simulations realistic, James obtained data from a set of past real meta-analyses and fit parametric models to these features, and used the resulting parameter estimates as benchmarks for how large and how varied the simulated meta analyses should be. +For example, in James's work on designing methods for meta-analysis, there is often a question of how big simulated sample sizes should be and how many different simulated outcomes per study there should be when generating sets of hypothetical studies to be included in hypothetical meta-analyses. +In one project, to make such simulations realistic, James obtained data from a set of past real meta-analyses and fit parametric models to these features, and then used the resulting parameter estimates as benchmarks for how large and how varied the simulated meta analyses should be. + -In this case, one would probably not use the exact estimated values, but instead use them as a point of reference and possibly explore a range of values around them. +When doing this, we recommend not using the exact estimated values, but instead using them as a point of reference. For instance, say we find that the distribution of study sizes fits a Poisson(63) pretty well. We might then then simulate study sizes using a Poisson with mean parameters of 40, 60, or 80 (where 40, 60, and 80 would be one factor in our multifactor experiment). -### Use real data to obtain directly +## Resample real data directly to get authentic distributions -You can also use real data directly to avoid a parametric model. +You can also use real data directly to avoid a parametric model entirely. For example, say you need a population distribution for a slew of covariates. -Rather than using something artificial (like multivariate normal), you can pull population data on a bunch of covariates from some administrative data and then use those data as a (finite) population distribution. -You then simple sample (possibly with replacement) rows from your reference population to generate your simulationsample. +Rather than using something artificial (such as a multivariate normal), you can pull population data on a bunch of covariates from some administrative data and then use those data as a population distribution. +You then sample (probably with replacement) rows from your reference population to generate your simulation sample. The appeal here is that your covariates will then have distributions that are much more authentic than some multivariate normal--they will include both continuous and categorical variables, they will tend to be skewed, and they will be correlated in different ways that are more interesting than anything someone could make up. -As an illustration of this approach, an old paper on heteroscedasticity-robust standard errors (Long and Irvin, 2000) does something similar. -It was important in the context that they’re studying because the behavior of heteroscedasticity-robust SEs is influenced by leverage, which is a function of the covariate distribution. -Getting authentic leverage was hard to do with a parametric model, so getting examples of it from real life made the simulation more relevant. +@longUsingHeteroscedasticityConsistent2000 used an approach like this in their study of heteroscedasticity-robust standard errors. +They wanted to explore how well these standard errors worked in practice, with a focus on how high leverage points could undermine their effectiveness. +Leverage is purely a function of the covariate distribution, so they needed to have an authentic covariate distribution in order to properly explore this issue. +Generating realistic leverage is hard to do with a parametric model, so instead they used examples from real life to make the simulation more relevant. -### Fully calibrated simulations + +## Fully calibrated simulations {#calibrated-simulations} Extending some of the prior ideas even further, one practice in increasing vogue is to generate _calibrated simulations_. -These are simulations tailored to a specific applied contexts, where we design our simulation study to more narrowly inform what assumptions and structures are necessary in order to make progress in that specific context. +See @Kern2014calibrated for an example of this in the context of generalizing experiments. +This paper has one of the earliest uses of this term, as far as we know. + +A calibrated simulation is one that is tailored to a specific applied context, with a DGP that is close to what we think the real world is like in that context. +Usually, this means we try to generate data that looks as much like a target dataset as possible, with the same covariate distributions and the same relationships between covariates and outcomes. +These simulations are not particularly general, but instead are designed to more narrowly inform what how well things would work for a particular circumstance. -Often we would do this by building our simulations out of existing data. -For an example from above, one might sample, with replacement, from the covariate distribution of an actual dataset so that the distribution of covariates is authentic in how the covariates are distributed and, more importantly, how they co-relate. +Once we have a fully calibrated DGP, targeting a specific dataset, we can then modify it by adjusting sample sizes or other aspects to explore a space of DGPs that are all connected to this core initial calibrated DGP. +We can thus still build more general evidence, but the evidence will always be anchored by our initial target. + +We usually would build a calibrated simulation out of existing data. +As discussed in the prior section, we might start with sampling from the covariate distribution of an actual dataset so that the distribution of covariates is authentic in how the covariates are distributed and, more importantly, how they co-relate. But this is not far enough. We also need to generate a realistic relationship between our covariates and outcome to truly assess how well the estimators work in practice. It is very easy to accidentally put a very simple model in place for this final component, thus making a calibrated simulation quite naive in, perhaps, the very way that counts. -We next walk through how you might calibrate further in the context of evaluating estimators for some sort of causal inference context where we are assessing methods of estimating a treatment effect of some binary treatment. -If we just resample our covariates, but then layer a constant treatment effect on top, we may be missing critical aspects of how our estimators might fail in practice. +One approach to get a more realistic model between covariates and outcomes is to fit a flexible model to the original data, and then use the model to predict outcomes from the covariates. +One would then generate residuals to add to the predictions, ideally in a way that preserves the heteroscedasticity and overall outcome distribution of the original data. +For example, in the subsection below, we provide a copula based approach to doing this, which ensures the outcome distribution is in line with the original data while allowing control over how tightly coupled the covariates are to the outcome. +We do not say this is the only approach, but rather one of a large range of possibilities. -In the area of causal inference, the potential outcomes framework provides a natural path for generating calibrated simulations [@Kern2014calibrated]. -Also see \@ref(potential-outcomes) for more discussion of simulations in the potential outcomes framework. -Under this framework, we would take an existing randomized experiment or observational study and then impute all the missing potential outcomes under some specific scheme. -This fully defines the sample of interest and thus any target parameters, such as a measure of heterogeneity, are then fully known. -For our simulation we then synthetically, and repeatedly, randomize and ``observe'' outcomes to be analyzed with the methods we are testing. -We could also resample from our dataset to generate datasets of different size, or to have a superpopulation target as our estimand. +Clearly, these calibration games can be fairly complex. +Calibrated simulations frequently do not lend themselves to a clear factor-based structure that have levers that change targeted aspects of the data generating process (although sometimes you can craft your approach to make such levers possible). +In exchange for this loss of clean structure, we end up with a simulation that might be more faithfully capturing aspects of a specific context, making our simulations more relevant to answering the narrower question of "how will things work here?" even if they are less relevant for answering the question of "how will things work generally?" -The key feature here is the imputation step: how do we build our full set of covariates and outcomes? -One baseline method one can use is to generate a matched-pairs dataset by, for each unit, finding a close match given all the demographic and other covariate information of the sample. We then use the matched unit as the imputed potential outcome. -By doing this (with replacement) for all units we can generate a fully imputed dataset which we then use as our population, with all outcomes being "real," as they are taken from actual data. -Such matching can preserve complex relationships in the data that are not model dependent. -In particular, if outcomes tend to be coarsely defined (e.g., on an integer scale) or have specific clumps (such as zero-inflation or rounding), this structure will be preserved. -One concern with this approach is the noise in the matching could in general dilute the structure of the treatment effect as the control- and treatment-side potential outcomes may be very unrelated, creating a lot of so-called idiosyncratic treatment variation (unit-to-unit variation in the treatment effects that is not explained by the covariates). -This is akin to measurement error diluting found relationships in linear models. -We could reduce such variation by first imputing missing outcomes using some model (e.g., a random forest) fit to the original data, and then matching on all units including the imputed potential outcome as a hidden "covariate." -This is not a data analysis strategy, but instead a method of generating synthetic data that both has a given structure of interest and also remains faithful to the idiosyncrasies of an actual dataset. +## Generating calibrated data with a copula + + +In this section, we borrow from a blog post^[See (https://cares-blog.gse.harvard.edu/post/copulas-for-simulation/)] to illustrate how to use copulas to generate data that is very calibrated to a target dataset. +In this example, we will generate new observations $(X, Y)$, where $X$ is a vector of covariates (some continuous, some categorical) and $Y$ is a continuous outcome of an arbitrary distribution as described by a target dataset. + +A copula is a mathematical device that is used to describe a multivariate distribution in terms of its marginals and the correlation structure between them, and we used it to generate new $Y_i$ values based on the $X_i$ in a way that maintained the same marginal distributions for all variables as the empirical dataset. + +We found the copula to have several benefits. +First, the copula approach can give a different $Y$ for a given $X$. +By contrast, simply bootstrapping our empirical data (a common way to generate synthetic data that looks like an empirical target) would not achieve this. +Second, we could control the tightness of the coupling between $X$ and $Y$, while maintaining a fairly complex and nonlinear relationship between $X$ and $Y$, and while maintaining possible heteroskedasticity found in the initial dataset. + + +```{r setup_sim_evidence, include=FALSE} + +N = 3000 +Sigma_X = diag( c(1,2,3) ) + 1 +Sigma_X[1,2] = Sigma_X[2,1] = -0.8 +Sigma_X + +#I use some fake data generated with this code: + +X = mvrnorm( 2*N, mu=c(1,2,3), Sigma = Sigma_X) +colnames(X) = paste0( "X", 1:ncol(X) ) + + dat <- X %>% as_tibble() %>% + mutate( X4 = X1 + rnorm( n() ), + Y = (1 * X1^2 + 2 * X1*X2 + exp(X3 / 2 ) + rnorm(2*N, sd=4))^2, + Y = pmin( Y, quantile( Y, 0.95 ) ), + X1 = round( 2*X1 ), + X2 = 0 + ( X2 >= 0.5 ) ) + + dat$Y[ dat$Y > 300 & dat$Y < 800 ] = 550 + +``` + +```{r, include=FALSE} +new_data = dat[1:N,] +new_data <- rename( new_data, + trueY = Y ) +dat = dat[(N+1):(2*N),] + +#plot(dat) +``` + +To illustrate, we will use a (fake) dataset with four covariates, X1, X2, X3, and X4, and an outcome, Y. +Here are a few rows of this "empirical" data that we are trying to match: + +```{r, echo=FALSE} +dat %>% + slice_tail( n = 5 ) %>% + knitr::kable( digits = 2 ) +``` + +Our outcome in this case is a bit funky: + +```{r, message=FALSE, warning=FALSE, echo=FALSE } +qplot( dat$Y, bins = 30 ) +``` + +The covariates have a correlation structure with themselves and also Y. +Here are the pairwise correlations on the raw values: + +```{r} +print( round( cor( dat ), digits = 2 ) ) +``` + +We want to generate synthetic data that captures the correlation structure of our empirical data along with the marginal distributions of each covariate and outcome. +We will do this via the following steps: + +**Steps for making data with a copula:** + +1. Predict $Y$ given the full set of covariates $X$ using a machine learner. +2. Tie $\hat{Y}_i$ and $Y_i$ together with a copula. +3. Generate new $X$s (in our case, via the bootstrap). +4. Generate a predicted $\hat{Y}_i$ for each $X_i$ using the machine learner. +5. Generate the final, observed, $Y_i$ using our copula. + +### Step 1. Predict the outcome given the covariates + +We can use some machine learning model to make predictions of $Y$ given $X$. +Here we use a random forest: + +```{r} +library(randomForest) +mod <- randomForest( Y ~ ., + data=dat, + ntree = 200 ) +dat$Y_hat = predict( mod ) +``` + +Our predictive model for $Y$ can be anything we want: we are just trying to get some predicted $Y$ that is appropriately tied to the $X$. + +Our predictions will be under dispersed and not natural in appearance. +For starters, the standard deviation of the $\hat{Y}$ will be less than the original $Y$, because our predictions are only getting at the structural aspects, not the residual noise: + +```{r} +tibble( sd_Yhat = sd( dat$Y_hat ), + sd_Y = sd( dat$Y ), + cor = cor( dat$Y_hat, dat$Y ) ) %>% + knitr::kable( digits = c(1,1,2) ) +``` + +Our synthetic $Y$ values will need to be "noised up" so they look more natural. +We also want them to match the empirical distribution of our original data. + + +### Step 2. Tie $\hat{Y}$ to $Y$ with a copula + +The first step of the copula approach is to make a function that converts empirical data to normalized $z$-scores based on their percentile rank, as we will be doing all our data generation in "normal $z$-score space." +In other words, we take a value, calculate its percentile, and then calculate what $z$-score that percentile would have, if the value came from a standard normal distribution. + +Here is a function to do this: +```{r} +convert_to_z <- function( Y ) { + K = length(Y) + r_Y <- rank( Y, ties.method = "random" )/K - 1/(2*K) + z <- qnorm( r_Y ) + z +} +``` +We calculate percentiles by ranking the values, and then taking the midpoint of each rank. +If we had 20 values, for example, we would then have percentiles of 5%, 15%, ..., 95%. +(We can't have 0 or 100% since this will turn into Infinity when converting to a $z$-score.) + +Using our function, we convert both our $Y$ and predicted $Y$ to these normal $z$-scores: +```{r, messages=FALSE} +z_Y = convert_to_z( dat$Y ) +z_Yhat = convert_to_z( dat$Y_hat ) + +qplot( z_Y, z_Yhat, asp = 1, alpha=0.05 ) + + geom_abline( slope=1, intercept=0, col="red" ) + + labs( title = "$z$-scores of Y vs predicted Y" ) + + coord_fixed() + + theme( legend.position = "none" ) +``` + +Note how the $z$-scores have smoothed our Y distribution due to the random ties, breaking up our spikes of identical values. +We can then see how coupled the predicted and actual are: + +```{r} +stats <- tibble( + rho = cor( z_Y, z_Yhat ), + sd_Y = sd( z_Y ), + sd_Yhat = sd( z_Yhat ) ) +stats +``` + +### Step 3. Get some new Xs + +So far we are just modeling relationship in our original data. +We next set the stage for generating a new dataset. +We start with making more $X$ values. +A simple trick is to simply bootstrap the data, keeping the $X$ only. +We might, for example, generate a new dataset of 500 observations like so: + +```{r} +nd = dat %>% + dplyr::select( starts_with("X") ) %>% + slice_sample( n = 500, replace = TRUE ) +``` + +If we don't want exact duplicates of the vectors of $X$ values in our data, we can use other methods such as the `synthpop` package, which generates data based on a series of conditional distributions. + +So we can check our work later, we are instead going to cheat and use a dataset, `new_data` that is generated from the same DGP as our original fake data. +It looks like so: + +```{r, echo=FALSE} +new_data +``` + +To be very specific, `trueY` would not normally be in our data at this point; we have it here so we can compare our generated Y to the true Y to see how well our DGP is working. +In practice, we would not be able to do this: we would need to generate new $X$ like shown above. + + +### Step 4. Generate a predicted $\hat{Y}_i$ for each $X_i$ using the machine learner. + +This step is quite easy: + +```{r} +new_data$Yhat = predict( mod, newdata = new_data ) +``` + +Done! + +### Step 5. Generate the final, observed, $Y_i$ using a copula that ties $\hat{Y}_i$ to $Y_i$. + +Now we generate our new $Y$ by converting our predictions (the Yhat) to the new $Y$ via the copula. +We are bascially using the same pieces that are in the $z$-score function, above. + +We first generate the percentiles the Y-hats represent: +```{r} +K = nrow(new_data) +new_data$r_Yhat = rank( new_data$Yhat ) / K - 1/(2*K) +summary( new_data$r_Yhat ) +``` + +We then convert the percentiles to the $z$-scores they represent: +```{r} +new_data$z_Yhat = qnorm( new_data$r_Yhat ) +summary( new_data$z_Yhat ) +``` + +We next use a formula for the conditional distribution of a normal variable given the other variable, namely, if $X_1, X_2$ both have unit variance and mean zero, we use +$$X_2 | X_1 = a \sim N( \rho a, 1 - \rho^2 ) .$$ +With this formula, we generate $Y$ as so: + +```{r} +rho = cor( z_Y, z_Yhat ) # from our empirical +new_data$z_Y = rnorm( K, + mean = rho * new_data$z_Yhat, + sd = sqrt( 1-rho^2 ) ) +``` + +Now, finally, for each observation in z-space, we figure out what rank it would correspond to in our empirical distribution, and then take that empirical data point as the value: + +```{r} +empDist = sort(dat$Y) +K = length(empDist) +new_data$r_Y = ceiling( pnorm( new_data$z_Y ) * K ) +new_data$Y = empDist[ new_data$r_Y ] +``` + +And we are done! + +### How did we do? + +Let's explore results: did we end up with something "authentic"? +We first compare the distribution of our generated $Y$ to the true $Y$ from the real DGP: + +```{r, echo=FALSE} +dtL = new_data %>% + pivot_longer( cols = c( Y, Yhat, trueY ), + names_to = "kind", values_to = "Y" ) + +ggplot( dtL, aes( Y ) ) + + facet_wrap( ~ kind, nrow= 1 ) + + geom_histogram( alpha=0.5, binwidth = 100 ) +``` +`trueY` and $Y$ are very close! Note the Yhat is quite different, by contrast. + +We further check our process by looking at how closely the correlations of our new Y with the Xs match the correlations of the true Y and the Xs: + +```{r, echo=FALSE} +cor( new_data[ c( "X1","X2", "X3", "X4", "Yhat", "Y", "trueY" ) ] ) %>% + as.data.frame() %>% + rownames_to_column( "Variable" ) %>% + dplyr::select( Variable, Y, trueY ) %>% + dplyr::filter( !(Variable %in% c( "Yhat", "Y", "trueY" )) ) %>% + knitr::kable( digits = 2 ) +``` +Our new Y roughly looks like the true Y, with respect to the correlation with the Xs. +We gave our copula approach a hard test here--simpler covariate relationships will be better captured. +That said, this doesn't seem that bad. + + + + +### Building a multifactor simulation from our calibrated DGP + +In the above, we made a DGP that mimics, as closely as we can, the empirical data we are targeting. +In an actual simulation, however, we would generally want to manipulate different aspects of our DGP to explore how different methods might function. + +We offer some examples of how to do this for this context. + +First, we can directly set `rho` to control how tightly coupled the outcome is to the covariates. +If `rho` is 0, then our outcome distribution will be exactly the same, but there would be no connection between it and the covariates. +For `rho` of 1, the outcome would be perfectly predicted by the predicted outcome, which is a function of the covariates. +The strength of the connection between $X$ and $Y$ is maximized. + + +Second, we can easily control sample size by generating larger or smaller datasets of $X$ and then generating $Y$ from those $X$. + +Third, we can drop some of our $X$s by simply not including them in the prediction model, thus making them noise variables. +We may still end up with relationships between a dropped $X$ and $Y$ due to the dropped $X$'s correlation structure with other $X$s that are tied to $Y$. + -A second approach that allows for varying the level of a systematic effect is to specify a treatment effect model, predict treatment effects for all units and use those to impute the treatment potential outcome for all control units. -This will perfectly preserve the complex structure between the covariates and the $Y_i(0)$s. -Unfortunately, this would also give no idiosyncratic treatment variation . -To add in idiosyncratic variation we could then need to generate a distribution of perturbations and add these to the imputed outcomes just as an error term in a regression model. -Regardless of how we generate them, once we have a "fully observed" sample with the full set of treatment and control potential outcomes for all of our units, we can calculate any target estimands we like on our population, and then compare our estimators to these ground truths (even if they have no parametric analog) as desired. -Clearly, these calibration games can be fairly complex. -They do not lend themselves to a clear factor-based structure that have levers that change targeted aspects of the data generating process (although sometimes you can build in controls to make such levers possible). -In exchange, we end up with a simulation that might be more faithfully capturing aspects of a specific context, making our simulations more relevant to answering the narrower question of "how will things work here?" diff --git a/105-file-management.Rmd b/105-file-management.Rmd index 8d824d9..9c97ed2 100644 --- a/105-file-management.Rmd +++ b/105-file-management.Rmd @@ -37,8 +37,9 @@ The third phase involves analyzing the simulation results and drawing conclusion The end-product of this phase is often a set of figures, tables, and text (perhaps taking the form of a memo, blog post, slide deck, or manuscript) that summarizes what you find. To stay organized when conducting a larger, multifactor simulation, we find it generally useful to keep a clear separation between phases. -In practice, this means keeping the code for each phase in a separate file (or set of files) and storing the results from each phase so that code for subsequent phases can be re-run or revised without re-executing the entire code base. -This approach is in keeping with a general principle for organizing large computational projects: keep the code for distinct tasks in different files. +In practice, this means keeping the code for each phase in a separate file (or set of files) and storing the results from each phase so that code for subsequent phases can be re-run or revised without re-executing the entire code base. +In particular, dividing your simulations in this way allows for easily changing how one _analyzes_ an experiment without re-running the entire thing. +The modular approach we propose is in keeping with a general principle for organizing large computational projects: keep the code for distinct tasks in different files. In this chapter, we describe strategies for organizing the code and work products involved in a multifactor simulation study. We start by offering guidance and recommendations for how to structure individual files and make use of code that is organized across more than one file. @@ -65,10 +66,11 @@ However, it is difficult to work this way if the simulation involves more intens For larger or more computationally demanding projects, we find it useful to make more use of the first type of file, the humble `.R` script. Ideally, you should follow the principles of modular programming for the code in these files. Each `.R` file should hold a collection of code that is related to a single task. -Accordingly, we distinguish between two types of of `.R` scripts: those that only contain functions that can be used for other purposes and those that carry out numerical calculations. +Accordingly, we distinguish between two types of `.R` scripts: those that only contain functions that can be used for other purposes and those that carry out numerical calculations. The former type just holds functions, so that the only consequence of running them from top to bottom will be to load some new functions into your workspace. The latter type are traditional scripts that do calculations, store or load data files, and create graphs and other summaries of data. + ### Putting headers in your .R file When you write an `.R` script, it is a good idea to put a header at the top of the file that describes the file's purpose. @@ -86,10 +88,13 @@ Using four trailing dashes trailing dashes indicates a section of the code that If you click on the dropdown at the bottom of the RStudio source pane, you will see a pop-up table of contents that allows you to quickly navigate to different parts of your source file. You can also see the same table of contents by clicking the `Outline` button in the top right corner of the source pane. + + + ### The source command {#about-source-command} If your code is organized across more than one file, you will need a way to call one file from another. -This is the purpose of the `source()` command, which effectively cuts-and-pastes the contents of the given file into your R session. +This is the purpose of the `source()` command, which effectively cuts-and-pastes the contents of the named file into your R session. For example, the following code runs three separate R scripts in turn: ```{r demo_source_multiple, eval=FALSE} @@ -113,8 +118,8 @@ With this structure, there would be no way to source the contents of `simulation For this reason, we tend to avoid using `source()` in scripts that might themselves be sourced. -One reason for storing code in individual files and using `source()` is that you can then include testing code in each of your files to both demonstrate the syntax and check the correctness of your functions. -Then, when you are not focused on that particular function or component, you don't have to look at that testing code. +One reason for storing code in individual files and using `source()` is that you can also then include testing code in each of your files to both demonstrate the syntax and check the correctness of your functions. +Then, when you are not focused on that particular function or component, you do not have to look at that testing code. Another good reason for this type of modular organization is that is supports developing a broader simulation universe, potentially with a variety of data generating functions. With a whole library of options, you can then readily run multiple simulations that involve some different components and some shared features. @@ -130,18 +135,23 @@ When we sourced it, we end up with the following methods available to call: [7] "make_dat_tuned" "rand_exp" "summarize_sim_data" ``` -The `describe()` and `summarize()` methods printed various statistics about a sample dataset; we used used these to examine generated datasets and debug the data generating functions. -We also had a variety of different data generating methods, which we developed over the course of the project as we were trying to chase down errors in our estimators and understand strange behavior. +The main methods in the file are for a set of different data generating methods, each developed over the course of the project as we chased down errors in our estimators and worked to understand strange behavior. +The `describe()` and `summarize()` methods print summary statistics for a given dataset; we used these to explore and debug our different data generating functions. + +We also had a separate file, `estimators.R`, that contained our different estimators we were comparing. +Having our estimators in a separate file made it easy to use them for other purposes, such as our empirical data analysis. +We could simply source the `estimators.R` file and then apply them to a real dataset. +Having a single file ensured that our simulation and applied analysis were exactly aligned in terms of the estimators we were using. +After we debugged or tweaked our estimators, we could immediately re-run our applied analysis to update the results, without worrying about whether the calculations were up-to-date and consistent with the simulation code. -For this project, we stored estimation functions in a separate file from the data-generating functions. -Doing so made it easier to use these functions for purposes other than running simulations. -The project also involved doing some empirical data analysis, and so we could simply source the file with our estimation functions and apply them to a real dataset. -This ensured that our simulation and applied analysis were exactly aligned in terms of the estimators we were using. -As we debugged and tweaked our estimators, we could immediately re-run our applied analysis to update the results, without worrying about whether the calculations were up-to-date and consistent with the simulation code. -### Storing testing code {#storing-testing-code} + + + +### Storing testing code in your scripts {#about-keeping-tests-with-FALSE} If you have an extended `.R` file with a collection of functions, you might also want to store some code that runs each function in turn, so you can easily remind yourself of what it does, or what the output looks like. + One way to keep this code around, but not have it run all the time when you run your script, is to put the code inside a "FALSE block," that might look like so: ```{r} @@ -168,9 +178,10 @@ You would then need to `source()` the files containing the function to be tested This approach does have the drawback that the testing code is not in close proximity to the function to be tested, so you need to work across multiple files when developing and running the test code. However, unlike with the `FALSE` blocks, organizing your test code in a separate file makes it easier to re-run a collection of several tests. This approach to organizing becomes more appealing when the validation code involves calls to functions that are themselves stored in separate files or when it involves invoking packages that are not needed for the main calculations involved in the simulation. -It is also closer to how you might work when developing code for an R package, with unit tests that check the correctness of critical functions.[^unit-testing] +It is also closer to how you might work when developing code for an R package, with unit tests that check the correctness of critical functions. +We discuss unit testing further in Section \@ref(unit-testing). +Also see Chapters 13 through 15 of @Wickham2023packages. -[^unit-testing]: For more on unit testings, see Chapters 13 through 15 of @Wickham2023packages. ## Principled directory structures diff --git a/120-parallel-processing.Rmd b/120-parallel-processing.Rmd index 0930d23..cfc2caa 100644 --- a/120-parallel-processing.Rmd +++ b/120-parallel-processing.Rmd @@ -16,30 +16,38 @@ library(furrr) ``` -Especially if you take our advice of "when in doubt, go more general" and if you calculate monte carlo standard errors, you will quickly come up against the limits of your computer. -Simultions can be incredibly computationally intensive, and there are a few means for dealing with that. -The first, touched on at times throughout the book, is to optimize ones code by looking for ways to remove extraneous calculation (e.g., by writing ones own methods rather than using the safety-checking and thus sometimes slower methods in R, or by saving calculations that are shared across different estimation approaches). -The second is to use more computing power. +Especially if you take our advice of "when in doubt, go more general" and if are running enough replicates to get nice and samll Monte Carlo errors, you will quickly come up against the computational limits of your computer. +Simulations can be incredibly computationally intensive, and there are a few means for dealing with that. +The first is to optimize code by removing extraneous calculation (e.g., by writing methods from scratch rather than using the safety-checking and thus sometimes slower methods in R, or by saving calculations that are shared across different estimation approaches) to make it run faster. +This approach is usually quite hard, and the benefits often minimal; see Appendix @ref(optimize-code) for further discussion and examples. +The second is to use more computing power by making the simulation _parallel_. This latter approach is the topic of this chapter. -There are two general ways to do parallel calculation. -The first is to take advantage of the fact that most modern computers have multiple cores (i.e., computers) built in. -With this approach, we tell R to use more of the processing power of your desktop or laptop. +_Parallel computation_ is where you have multiple computers working on a problem at the same time. +Simulations are very easy to do in parallel. +With a multifactor experiment it is very easy to break apart the overall into pieces. +For example, each computer could do one of your scenarios in a multi-factor simulation. +Even without a multifactor experiment, due to the cycle of "generate data, then analyze," it is easy to have a bunch of computers doing the same thing, as long as they are generating different datasets, with a final collection step where all the individual iterations are combined into a single set at the end. +In either case, the _total_ work is the same, but since each computer is doing a piece at the same time, the overall simulation will be completed much sooner. + +There are three general ways to do parallel calculation. +The first is to take advantage of the fact that most modern computers come with multiple cores (i.e., computers) built in. +For this simplest approach, we simply tell R to use more of this processing power. If your computer has eight cores, you can easily get a near eight-fold increase in the speed of your simulation. -The second is to use cloud computing, or compute on a cluster. -A computing cluster is a network of hundreds or thousands of computers, coupled with commands where you break apart a simulation into pieces and send the pieces to your army of computers. -Conceptually, this is the same as when you do baby parallel on your desktop: more cores equals more simulations per minute and thus faster simulation overall. +The second is to use cloud computing, putting your simulation on a virtual computer that might have 50 or more cores. +Other than having to handle moving files back and forth from your machine to the cloud, this is not much different from the first approach: you are still just telling R to use more cores, but now you have more cores to use. +Even if the speed-up itself is not massive, this approach also takes the computing off your computer entirely, making it easier to set up a job to run for days or weeks without making your day to day life any more difficult. + +The third way is to use a computing cluster. +A cluster is a network of hundreds or thousands of computers, coupled with commands for breaking apart a simulation into pieces and sending the pieces to that vast army. +Conceptually, this approach is the same as when you do baby parallel on your desktop: more cores equals more simulations per minute and thus faster simulation overall. But the interface to a cluster can be a bit tricky, and very cluster-dependent. -But once you get it up and running, it can be a very powerful tool. -First, it takes the computing off your computer entirely, making it easier to set up a job to run for days or weeks without making your day to day life any more difficult. -Second, it gives you hundreds of cores, potentially, which means a speed-up of hundreds rather than four or eight. +But once you get a cluster up and running, it can be a very powerful tool. +Not only is it off your personal computer, a cluster can potentially give you hundreds of cores, which means a speed-up of hundreds rather than four or eight. + -Simulations are a very natural choice for parallel computation. -With a multifactor experiment it is very easy to break apart the overall into pieces. -For example, you might send each factor combination to a single machine. -Even without multi factor experiments, due to the cycle of "generate data, then analyze," it is easy to have a bunch of computers doing the same thing, with a final collection step where all the individual iterations are combined into one at the end. ## Parallel on your computer @@ -94,8 +102,8 @@ tictoc::tic() Note the `.options = furrr_options(seed = NULL)` part of the argument. This is to silence some warnings. -Given how tasks are handed out, R will get upset if you don't do some handholding regarding how it should set seeds for pseudoranom number generation. -In particular, if you don't set the seed, the multiple sessions could end up having the same starting seed and thus run the exact same simulations (in principle). +Given how tasks are handed out, R will get upset if you do not do some handholding regarding how it should set seeds for pseudoranom number generation. +In particular, if you do not set the seed, the multiple sessions could end up having the same starting seed and thus run the exact same simulations (in principle). We have seen before how to set specific seed for each simulation scenario, but `furrr` doesn't know we have done this. This is why the extra argument about seeds: it is being explicit that we are handling seed setting on our own. @@ -138,26 +146,26 @@ In general, a "cluster" is a system of computers that are connected up to form a These clusters will have some overlaying coordinating programs that you, the user, will interact with to set up a "job," or set of jobs, which is a set of tasks you want some number of the computers on the cluster to do for you in tandum. These coordinating programs will differ, depending on what cluster you are using, but have some similarities that bear mention. -For running simulations, you only need the smallest amount of knowledge about how to engage with these systems because you don't need all the individual computers working on your project communicating with each other (which is the hard part of distributed computing, in general). +For running simulations, you only need the smallest amount of knowledge about how to engage with these systems because you do not need all the individual computers working on your project communicating with each other (which is the hard part of distributed computing, in general). -### What is a command-line interface? +### Command-line interfaces and CMD mode in R In the good ol' days, when things were simpler, yet more difficult, you would interact with your computer via a "command-line interface." The easiest way to think about this is as an R console, but in a different language that the entire computer speaks. -A command line interface is designed to do things like find files with a specific name, or copy entire directories, or, importantly, start different programs. -Another place you may have used a command line inteface is when working with Git: anything fancy with Git is often done via command-line. +A command line interface is designed to do things such as find files with a specific name, or copy entire directories, or, importantly, start different programs. +One place you may have used a command line inteface is when working with Git: anything fancy with Git is often done via command-line. People will talk about a "shell" (a generic term for this computer interface) or "bash" or "csh." You can get access to a shell from within RStudio by clicking on the "Terminal" tab. -Try it, if you've never done anything like this before, and type +Try it, if you have never done anything like this before, and type (the ">" is not part of what you type): ``` -ls +> ls ``` It should list some file names. Note this command does _not_ have the parenthesis after the command, like in R or most other programming languages. -The syntax of a shell is usually mystifying and brutal: it is best to just steal scripts from the internet and try not to think about it too much, unless you want to think about it a lot. +The syntax of a shell is usually mystifying and brutal: it is best to just steal scripts from the internet or some generative AI and try not to think about it too much, unless you want to think about it a lot. Importantly for us, from the command line interface you can start an R program, telling it to start up and run a script for you. This way of running R is noninteractive: you say "go do this thing," and R starts up, goes and does it, and then quits. @@ -176,11 +184,9 @@ Sys.sleep(30) cat( "Finished\n" ) ``` -Then open the terminal and type (the ">" is not part of what you type): -``` -> ls -``` -Do you see your `dumb_job.R` file? If not, your terminal session is in the wrong directory. +After you save, open the terminal and type `ls` to list the files, as above. +Do you see your `dumb_job.R` file? +If not, your terminal session is in the wrong directory. In your computer system, files are stored in a directory structure, and when you open a terminal, you are somewhere in that structure. To find out where, you can type @@ -198,69 +204,93 @@ Once you are in the directory with your file, type: > R CMD BATCH dumb_job.R R_output.txt --no-save ``` -The above command says "Run R" (the first part) in batch mode (the "CMD BATCH" part), meaning source the `dumb_job.R` script as soon as R starts, saving all console output in the file `R_output.txt` (it will be saved in the current directory where you run the program), and where you don't save the workspace when finished. +The above command says "Run R" (the first part) in batch mode (the "CMD BATCH" part), meaning source the `dumb_job.R` script as soon as R starts, saving all console output in the file `R_output.txt` (it will be saved in the current directory where you run the program), and where you do not save the workspace when finished. This command should take about a minute to complete, because our script sleeps a lot (the sleep represents your script doing a lot of work, like a real simulation would do). Once the command completes (you will see your ">" prompt come back), verify that you have the `R_output.txt` and the data file `sim_results.csv` by typing `ls`. If you open up your Finder or Microsoft equivilent, you can actually see the `R_output.txt` file appear half-way through, while your job is running. -If you open it, you will see the usual header of R telling you what it loading, the "Making numbers" comment, and so forth. -R is saving everything as it works through your script. +If you open it, you will see the usual header of R loading up, the "Making numbers" comment, and so forth. +R is saving all the output as it works through your script. -Running R in this fashion is the key element to a basic way of setting up a massive job on the cluster: you will have a bunch of R programs all "going and doing something" on different computers in the cluster. -They will all save their results to files (they will have files of different names, or you will not be happy with the end result) and then you will gather these files together to get your final set of results. +Running R in this fashion is the key element to a basic way of setting up a massive job on the cluster: you launch a bunch of R programs that are each running a script like this on different computers in the cluster. +They will all save their results to files (they will have files of different names so you do not overwrite your work) and then you will gather these files together to get your final set of results. *Small Exercise:* Try putting an error in your `dumb_job.R` script. What happens when you run it in batch mode? -### Running a job on a cluster +### Task dispatchers and command line scripts -In the above, you can run a command on the command-line, and the command line interfact will pause while it runs. +In the above, you can run a command on the command-line, but the command line interface will pause while it runs. As you saw, when you hit return with the above R command, the program just sat there for a minute before you got your command-line prompt back, due to the sleep. -When you properlly run a big job (program) on a cluster, it doesn't quite work that way. -You will instead set a program to run, but tell the cluster to run it somewhere else (people might say "run in the background"). -This is good because you get your command-line prompt back, and can do other things, while the program runs in the background. +When you properly run a big job (program) on a cluster, it does not quite work that way. +You will instead set a program to run, but tell the cluster to run it somewhere else (people might say "run it in the background"). +This is good because you get your command-line prompt back, and can use it to tell your computer to do other things, all while the program is running. -There are various methods for doing this, but they usually boil down to a request from you to some sort of managerial process that takes requests and assigns some computer, somewhere, to do them. -(Imagine a dispatcher at a taxi company. You call up, ask for a ride, and it sends you a taxi to do it. The dispatcher is just fielding requests, assinging them to taxis.) +There are various methods for starting a job, but they usually boil down to a request from you to some sort of managerial process (a dispatcher) that takes requests and assigns some computer, somewhere, to do them. +For a sense of this process, imagine a dispatcher at a taxi company. +You call up, ask for a ride, and the dispatcher sends you a taxi to do it. +The dispatcher is just fielding requests, assinging them to taxis. -For example, one dispatcher is the slurm (which may or may not be on the cluster you are attempting to use; this is where a lot of this information gets very cluster-specific). +One dispatcher on some clusters is called `slurm` (which may or may not be on the cluster you are attempting to use; this is where a lot of this information gets very cluster-specific). You first set up a script that describes the job to be run. It is like a work request. -This would be a plain text file, such as this example (`sbatch_runScript.txt`): +These scripts are plain text files, such as this example (`sbatch_runScript.txt`): ``` #!/bin/bash -#SBATCH -n 32 # Number of cores requested -#SBATCH -N 1 # Ensure that all cores are on one machine -#SBATCH -t 480 # Runtime in minutes -#SBATCH -p stats # Partition to submit to -#SBATCH --mem-per-cpu=1000 # Memory per cpu in MB -#SBATCH --open-mode=append # Append to output file, don't truncate -#SBATCH -o /output/directory/out/%j.out # Standard out goes to this file -#SBATCH -e /output/directory/out/%j.err # Standard err goes to this file -#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL -#SBATCH --mail-user=email@gmail.com # Email address - -# You might have some special loading of modules in the computing environment + +# Number of cores requested +#SBATCH -n 32 + +# Ensure that all cores are on one machine +#SBATCH -N 1 + +# Upper bound on runtime in minutes +#SBATCH -t 480 + +# Partition to submit to +#SBATCH -p stats + +# Memory per cpu in MB +#SBATCH --mem-per-cpu=1000 + +# Append to output file, don't truncate +#SBATCH --open-mode=append + +# Standard out goes to this file +#SBATCH -o /output/directory/out/%j.out + +# Standard err goes to this file +#SBATCH -e /output/directory/out/%j.err + +# Email notification options : BEGIN, END, FAIL, ALL +#SBATCH --mail-type=ALL +#SBATCH --mail-user=email@gmail.com + +# Load modules for the computing environment source new-modules.sh module load gcc/7.1.0-fasrc01 module load R + +# Set user R library path export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER -#R file to run, and txt files to produce for output and errors -R CMD BATCH estimator_performance_simulation.R logs/R_output_${INDEX_VAR}.txt --no-save --no-restore +# Run R script; write output to indexed log file +R CMD BATCH estimator_performance_simulation.R \ + logs/R_output_${INDEX_VAR}.txt \ + --no-save --no-restore ``` -This file starts with a bunch of variables that describe how sbatch should handle the request. -It then has a series of commands that get the computer environment ready. +This file starts by setting a bunch of variables that describe how sbatch should handle the request. +It then has a series of commands that gets the computer environment ready. Finally, it has the `R CMD BATCH` command that does the work you want. These scripts can be quite confusing to understand. There are so many options! What do these things even do? The answer is, for researchers early on their journey to do this kind of work, "Who knows?" -The general rule is to find an example file for the system you are working on that works, and then modify it for your own purposes. +The general rule is to find a working example file for the system you are on, and then modify it for your own purposes. Once you have such a file, you could run it on the command line, like this: @@ -271,21 +301,28 @@ sbatch -o stdout.txt \ ``` You do this, and it will _not_ sit there and wait for the job to be done. -The `sbatch` command will instead send the job off to some computer which will do the work in parallel. +The `sbatch` command will instead send the job off to some computer which will do the work in the background. -Interestingly, your R script could, at this point, do the "one computer" parallel type code listed above. -Note the script above has 32 cores; your single job could then have 32 cores all working away on their individual pieces of the simulation, as before (e.g., with `future_pmap`). +Interestingly, your R script could, at this point, do the "one computer" parallel type code listed above and use `future_pmap()` or similar. +Note the script above asks for 32 cores; your single job could then have 32 cores all working away on their individual pieces of the simulation, as before. You would have a 32-fold speedup, in this case. This is the core element to having your simulation run on a cluster. The next step is to do this _a lot_, sending off a bunch of these jobs to different computers. +### Moving files to and from a cluster -Some final tips +One of the larger issues with working on a cluster is you will need to get all your code _to_ the cluster and get all of your results _back_. +One approach for getting the code to the cluster is to use GitHub. +Make a project, and then check out the project on the cluster. +This has the advantage of making it easier to develop your code on your local computer, where you can easily run your RStudio or other IDE. +It also has the advantage of allowing you to record which GitHub version of code you ran, so you can know exactly which code produced which results. - * Remember to save a workspace or RDS!! Once you tell Odyssey to run an R file, it, well, runs the R file. But, you probably want information after it's done - like an R object or even an R workspace. For any R file you want to run on Odyssey, remember at the end of the R file to put a command to save something after everything else is done. If you want to save a bunch of R objects, an R workspace might be a good way to go, but those files can be huge. A lot of times I find myself wanting only one or two R objects, and RDS files are a lot smaller. +For getting your results off the cluster it is not recommended to use GitHub as total of all of your results may take many gigabytes of space. +They are also not really in the spirit of a verson control system. +Instead, use a scp client such as FileZilla to transfer files back to your personal computer. - * Moving files from a cluster to your computer. You will need to first upload your files and code to the cluster, and then, once you've saved your workspace/RDS, you need those back on your computer. Using a scp client such as FileZilla is an easy way to do this file-transfer stuff. You can also use a Git repo for the code, but checking in the simulation results is not generally advisable: they are big, and not really in the spirit of a verson control system. Download your simulation results outside of Git, and keep your code in Git, is a good rule of thumb. +Overall, download your simulation results outside of Git, and keep your code in Git, is a good rule of thumb. ### Checking on a job @@ -337,7 +374,7 @@ This multiple dispatching of sbatch commands is the final component for large si Asking for multiple, smaller, jobs is also nicer for the cluster than having one giant job that goes on for a long time. By dividing a job into smaller pieces, and asking the scheduler to schedule those pieces, you can let the scheduler share and allocate resources between you and others more fairly. It can make a list of your jobs, and farm them out as it has space. -This might go faster for you; with a really big job, the scheduler can't even allocate it until the needed number of workers is available. +This might go faster for you; with a really big job, the scheduler could not even allocate it until the needed number of workers are available. With smaller jobs, you can take a lot of little spaces to get your work done. Especially since simulation is so independent (just doing the same thing over and over) there is rarely any need for one giant process that has to do everything. @@ -375,7 +412,7 @@ done ``` One question is then how do the different processes know what part of the simulation they should be working on? -E.g., each worker needs to have its own seed so it don't do exactly the same simulation as a different worker! +E.g., each worker needs to have its own seed so it does not do exactly the same simulation as a different worker! The workers also need their own filenames so they save things in their own files. The key is the `export INDEX_VAR` line: this puts a variable in the environment that will be set to a specific number. Inside your R script, you can get that index like so: @@ -402,7 +439,7 @@ This still doesn't exactly answer how to have each worker know what to work on. Conider the case of our multifactor experiment, where we have a large combination of simulation trials we want to run. There are two approaches one might use here. -One simple approach is the following: we first generate all the factors with `expand_grid()` as usual, and then we take the row of this grid that corresponds to our index. +One simple approach is the following: generate all the factors with `expand_grid()` as usual, and then take the row of this grid that corresponds to our index. ```{r, eval=FALSE} sim_factors = expand_grid( ... ) diff --git a/130-debugging_and_testing.Rmd b/130-debugging_and_testing.Rmd index ec043ca..ba1a2de 100644 --- a/130-debugging_and_testing.Rmd +++ b/130-debugging_and_testing.Rmd @@ -21,7 +21,7 @@ There are some tools, however, that can mitigate that to some extent. When you follow modular design, you will often have methods you wrote calling other methods you wrote, which call even more methods you wrote. When you get an error, you might not be sure where to look, or what is happening. -A simple method (a method often reviled by professional programmers, but which is still useful for more ordinary folks) for debugging is to use `print()` statements in your code. +A simple debugging method (a method often reviled by professional programmers, but which is still useful for more ordinary folks) is to use `print()` statements in your code. For example, consider the following code: ```{r demo_print, eval=FALSE} @@ -32,7 +32,8 @@ For example, consider the following code: Here we are printing something out that we suspect might be something we want to check. -There are a few methods for printing to the console: `print()`, which takes any object and prints it out. You can print a dataframe, variable, or a string: +There are a few methods for printing to the console. +The first is `print()`, which takes any object and prints it out. You can print a dataframe, variable, or a string: ```{r, eval=FALSE} print( "My var is ", my_var, "\n" ) print( my_tibble ) @@ -50,7 +51,8 @@ cli::cli_alert("My var is {my_var}") The problem with printing is it is easy to have a lot of printout statements, and then when you run your code you get a wall of text. For simulations, it is easy to print so much that it will meaningfully slow your simulation down! -You can use `print()` statements to help you figure out what is going on, but it is often better to use a more interactive debugging tool, such as the `browser()` function or `stopifnot()` statements, which we will discuss next. +You can use `print()` statements to help you figure out what is going on, while it is often better to use a more interactive debugging tool, such as the `browser()` function or `stopifnot()` statements. +We discuss these next. @@ -64,16 +66,16 @@ Consider the following code taken from a simulation: } ``` -The `browser()` command stops your code and puts you in an interactive console where you can look at different objects and see what is happening. +The `browser()` command stops your code and puts you in an interactive console where you can look at different objects in your workspace and see what is happening. Having it triggered when something bad happens (in this case when a set of estimates has an unexpected NA) can help untangle what is driving a rare event. The interactive console allows you to look at the current state of the code, and you can type in commands to see what is going on. It is just like a normal R workspace, but if you look at the Environment, you will only see what the code has available at the time `browser()` was called. If you are inside a function, for example, you will only see the things passed to the function, and the variables the function has made. -This can be very important to, for example, check what values were passed to your function--many bugs are due to the wrong thing getting passed to some code that would otherwise work. +The browser can be very important to, for example, check what values were passed to your function--many bugs are due to the wrong thing getting passed to some code that would otherwise work. -Once in a browser, you can say `q` to quit out. +Once in a browser, you can type `q` to quit out. You can also type `n` to go to the next line of code. This allows you to walk through the code step by step, seeing what happens as you move along. Much of the time, RStudio will even jump to the part of your script where you paused, so you can see the code that will be run with each step. @@ -89,11 +91,11 @@ debug( gen_dat ) run_simulation( some_parameters ) ``` -Now, when `run_simulation()` eventually calls `gen_dat` the script will stop, and you can see exactly what was passed to `gen_dat` and also then walk through `gen_dat` line by line to see what is going on. +Now, when `run_simulation()` eventually calls `gen_dat` the script will stop, and you can see exactly what was passed to `gen_dat` and also then walk through `gen_dat` line by line to see what it does. -## Protecting functions with `stop()` {#about-stopifnot} +## Protecting functions with `stopifnot()` {#about-stopifnot} When writing functions, especially those that take a lot of parameters, it is often wise to include `stopifnot()` statements at the top to verify the function is getting what it expects. These are sometimes called "assert statements" and are a tool for making errors show up as early as possible. @@ -112,11 +114,12 @@ make_groups( c(100,200,300,400), c(1,100,10000) ) ``` What is nasty about this possible error is nothing is telling you that something is wrong! -You could build an entire simulation on this, not realizing that your fourth group has the variance of your first, and get results that make no sense to you. -You could even publish something based on a finding that depends on this error, which would eventually be quite embarrasing. +You could build an entire simulation on this, not realizing that your fourth group has the variance of your first, and spend a lot of time wrestling with results that make no sense to you. +You could even publish something based on a finding that depends on this error, which would eventually be quite embarrassing. +Even if you know something is wrong with your simulation, it might take some time and effort to track down the origin of problem, as nothing would be alerting you to the error. -If this function was used in our data generating code, we might eventually see some warning that something is off, but this would still not tell us where things went off the rails. -We can instead protect our function by putting in an *assert statement* using `stopifnot()`: +To avoid this kind of hardship, we can instead use *assert statements* to verify that the arguments to our function are as they should be, and, if they are not, stop the function in its tracks. +We implement assert statements via the `stopifnot()` method: ```{r} make_groups <- function( means, sds ) { stopifnot( length(means) == length(sds) ) @@ -125,15 +128,14 @@ make_groups <- function( means, sds ) { } ``` -Now we get this: +Now, if we call our function incorrectly, we get this: ```{r, error=TRUE} make_groups( c(100,200,300,400), c(1,100,10000) ) ``` The `stopifnot()` command ensures your code is getting called as you intended. - -These statements can also serve as a sort of documentation as to what you expect. +Assert statements can also serve as a type of documentation as to what you expect. Consider, for example: ```{r} make_xy <- function( N, mu_x, mu_y, rho ) { @@ -143,19 +145,23 @@ make_xy <- function( N, mu_x, mu_y, rho ) { tibble(X = X, Y=Y) } ``` -Here we see that rho should be between -1 and 1 quite clearly. -A good reminder of what the parameter is for. +Here we see that rho should be between -1 and 1. +The assert serves as a good reminder of what the parameter is for. -This also protects you from inadvetently misremembering the order of your parameters when you call the function (although it is good practice to name your parameters as you pass). +Assert statements also protects you from inadvertently miss-remembering the order of your parameters when you call your function (although it is good practice to name your parameters as you pass). Consider: ```{r, error=TRUE} a <- make_xy( 10, 2, 3, 0.75 ) b <- make_xy( 10, 0.75, 2, 3 ) +``` + +That said, we should usually do this: +```{r} c <- make_xy( 10, rho = 0.75, mu_x = 2, mu_y = 3 ) ``` -## Testing code +## Testing code {#unit-testing} Testing your code is a good way to ensure that it does what you expect. We have seen some demonstration of testing code early on, such as when we made plots of our simulated data to see if it looked like we expected. @@ -167,9 +173,9 @@ It can also be hard to track down the ripple effects of changing a low-level met This is why people developed "unit testing," an approach to testing where you write code that you can just run whenever you want, code which runs a series of tests on your code and prints out which tests work as expected, and which do not. In R, the most common way of doing this is the `testthat` package. -There are two general aspects to `testthat` that can be useful, the `expect_*()` methods and the `test_that()` function. +There are two general parts to `testthat`: the `expect_*()` methods and the `test_that()` function. -Consider the following simple DGP to generate an X and Y variable that have a given relationship: +Consider the following simple DGP to generate an X and Y variable with a given relationship: ```{r} my_DGP <- function( N, mu, beta ) { stopifnot( N > 0, beta <= 1 ) @@ -228,17 +234,18 @@ test_file(here::here( "code/demo_test_file.R" ) ) ``` The `test_file()` method will then give an overall printout of all the tests made, and list which passed, which gave warnings, and which were skipped. -You can also run the test from inside RStudio: at the top-right you should see "Run Tests"--if you click on it, it will start an entirely new work session, and source the file to test it. +You can also run the test from inside RStudio: when in a unit test file, you should see "Run Tests" at the top-right of the script pane--if you click on it, RStudio will start an entirely new work session, and source your file to test it. This is the best way to use these testing files. -This stand-alone work session approach means it is important to make the test file stand-alone: the file should source the code you want to test, and load any needed libraries, before running the testing code. +Tests are run "from scratch" and this stand-alone work session approach means you have to load any needed libraries before running the testing code. +If your testing code is testing your core code (stored in the `R/` folder of your project, see Chapter \@ref(organizing-a-simulation-project) for further discussion), you will have to source the relevant files at the top of your test file. -You can finally make an entire directory of these testing files, and run them all at once with `test_dir()`. -The usual way to store the files is in a `tests/testthat/` directory inside your project. +For a full testing work flow, put all the testing files in a single directory, and run them all at once with `test_dir()`. +The usual place to store testing files is in a `tests/testthat/` directory inside your project. You can then have a `tests/testthat.R` file that runs `test_dir()` on the `tests/testthat/` directory. -The `testthat` package is designed to allow for including unit testing in an R package, but we are repurposing it for general projects here. +The `testthat` package was designed for unit testing new R packages, but we are repurposing it for general projects here. -Once you have your unit testing all set up, you can work on your project, and then run the unit tests to see if you broke anything. +Once you have your unit testing all set up, you can work on your project, and then easily run all the unit tests to see if you broke anything. Even more important, if you are working with a collaborator, you can both run unit tests to ensure you have not broken something that someone else was counting on! Furthermore, you can use the test code as a reference for how the code should be used, and what the expected output is. For any reasonably complex project, having test code can be of enormous benefit. diff --git a/140-simulation-for-power-analysis.Rmd b/140-simulation-for-power-analysis.Rmd index 8961ad0..81675d0 100644 --- a/140-simulation-for-power-analysis.Rmd +++ b/140-simulation-for-power-analysis.Rmd @@ -52,10 +52,14 @@ We had pilot data from school administrative records (in particular discipline r We assume our experimental sample will be on schools that have chronic issues with discipline, so we filtered our historic data to get schools we imagined to likely be in our study. -We ended up with the following data, with log-transformed discipline rates for each year (we did this to put things on a multiplicative scale, and to make our data more normal given heavy skew in the original). Each row is a potential school in the district. +We ended up with the following data, with log-transformed discipline rates for each year (we used the log transform to put things on a multiplicative scale, and to make the distribution of rates more normal given a heavy skew in the original). +Each row is a potential school in the district. -```{r, messages=FALSE, warnings=FALSE} +```{r, messages=FALSE, warnings=FALSE, include=FALSE} datW = read_csv( "data/discipline_data.csv" ) +``` + +```{r} datW ``` @@ -65,7 +69,7 @@ lpd_mns = apply( datW[,-1], 2, mean ) lpd_mns lpd_cov = cov( datW[,-1] ) -lpd_cov +round( lpd_cov, digits=2 ) ``` ## The data generating process diff --git a/150-potential-outcomes-framework.Rmd b/150-potential-outcomes-framework.Rmd index 10c715f..7e85aad 100644 --- a/150-potential-outcomes-framework.Rmd +++ b/150-potential-outcomes-framework.Rmd @@ -338,4 +338,51 @@ runs %>% mutate( inflate = 100 * hat / true ) ``` +## Calibrated simulations within the potential outcomes framework + +In the area of causal inference, the potential outcomes framework provides a natural path for generating calibrated simulations. +Also see Chapter \@ref(calibrated-simulations) for more discussion of calibration at a high level. + +Under the calibration framework, we would take an existing randomized experiment or observational study and then impute all the missing potential outcomes under some specific scheme. +This fully defines the sample of interest and thus any target parameters, such as a measure of heterogeneity, are then fully known. +For our simulation we then synthetically, and repeatedly, randomize and "observe" outcomes to be analyzed with the methods we are testing. +We could also resample from our dataset to generate datasets of different size, or to have a superpopulation target as our estimand. + +The key feature here is the imputation step: how do we build our full set of covariates and outcomes? +There are a variety of options that have different pros and cons. + +### Matching-based imputation +One baseline method one can use is to generate a matched-pairs dataset by, for each unit, finding a close match given all the demographic and other covariate information of the sample. +The opposite pair of each matched unit then gives the imputed potential outcome. +By doing this (with replacement) for all units we can generate a fully imputed dataset which we then use as our population, with all outcomes being "real," as they are taken from actual data. +Such matching can mostly preserve complex relationships between covariates and outcomes in a manner that is not model dependent. +In particular, if outcomes tend to be coarsely defined (e.g., on an integer scale) or have specific clumps (such as zero-inflation or rounding), this structure will be preserved. + +One concern with the matching approach is the noise in the matching could in general dilute the structure of the treatment effect as the control- and treatment-side potential outcomes may be very unrelated, creating a lot of so-called idiosyncratic treatment variation (unit-to-unit variation in the treatment effects that is not explained by the covariates). +This is akin to measurement error diluting found relationships in linear models. + +We could reduce such variation by first imputing missing outcomes using some model (e.g., a random forest) fit to the original data, and then matching on all units including the imputed potential outcome as a hidden "covariate." +This is not a data analysis strategy, but instead a method of generating synthetic data that both has a given structure of interest and also remains faithful to the idiosyncrasies of an actual dataset. + +### Model-based imputation + +A second approach for imputing the missing potential outcomes is to specify a treatment effect model, predict treatment effects for all units and use the predicted effects to impute the treatment potential outcome for all control units. +For example, if we fit $\hat{\tau}(X)$ to the original data, where $\hat{tau}(X)$ estimates the treatment effect for a unit with covariate value $X$, we can then impute $Y_i(1) = Y_i(0) + \hat{\tau}(X_i)$ for all control units and $Y_i(0) = Y_i(1) - \hat{f}(X_i)$ for all treated units. +This imputation perfectly preserves the complex structure between the covariates and the $Y_i(0)$s for the original controls and covariates and $Y_i(1)$ for the original treated. + +One drawback is the imputed outcomes may not be values we actually see on the counterfactual side. +For example, if the original outcomes are binary, but if $\hat{\tau}(X)$ estimates fractional effects, then the imputed outcomes will not be binary, and could even be negative or above 1. + +A second possible drawback is this form of imputation does not give any idiosyncratic treatment variation. +To add in idiosyncratic variation we would need to generate a distribution of perturbations and add these to the imputed outcomes just as we would an error term in a regression model. + +One advantage of the modeling approach is it allows for varying the amount of treatment effect. +We can, for example, scale $\hat{\tau}(X)$ by some factor to increase or decrease the size of the treatment effects in our imputed dataset. + + + +### Using the imputed data +Regardless of how we generate the missing potential outcomes, once we have a "fully observed" sample with the full set of treatment and control potential outcomes for all of our units, we can calculate any target estimands we like on our population, and then compare our estimators to these ground truths (even if they have no parametric analog) as desired. + + diff --git a/200-coding-tidbits.Rmd b/200-coding-tidbits.Rmd index 27ccda7..2e6fcd2 100644 --- a/200-coding-tidbits.Rmd +++ b/200-coding-tidbits.Rmd @@ -221,7 +221,7 @@ For example, we can time how long it takes to analyze some data from our cluster library( bench ) dat <- gen_cluster_RCT(n_bar=100, J = 100) timings <- mark( - MLM = quiet_analysis_MLM(dat), + MLM = analysis_MLM_safe(dat), OLS = analysis_OLS(dat), agg = analysis_agg(dat), check = FALSE @@ -246,7 +246,7 @@ timings <- bench::press( { dat <- gen_cluster_RCT(n_bar=n_bar, J = J) bench::mark( - MLM = quiet_analysis_MLM(dat), + MLM = analysis_MLM_safe(dat), OLS = analysis_OLS(dat), agg = analysis_agg(dat), check = FALSE @@ -278,7 +278,7 @@ For example, you might have the following: profvis( { replicate( 10, { data <- gen_cluster_RCT( n_bar = 100, J = 100 ) - res1 <- quiet_analysis_MLM(data) + res1 <- analysis_MLM_safe(data) res2 <- analysis_OLS(data) res3 <- analysis_agg(data) } ) diff --git a/Designing-Simulations-in-R.toc b/Designing-Simulations-in-R.toc index fc6e815..60cb53a 100644 --- a/Designing-Simulations-in-R.toc +++ b/Designing-Simulations-in-R.toc @@ -1,294 +1,314 @@ -\contentsline {chapter}{Welcome}{7}{chapter*.2}% -\contentsline {section}{License}{8}{section*.3}% -\contentsline {section}{About the authors}{8}{section*.4}% -\contentsline {section}{Acknowledgements}{9}{section*.5}% -\contentsline {part}{I\hspace {1em}An Introductory Look}{11}{part.1}% -\contentsline {chapter}{\numberline {1}Introduction}{13}{chapter.1}% -\contentsline {section}{\numberline {1.1}Some of simulation's many uses}{14}{section.1.1}% -\contentsline {subsection}{\numberline {1.1.1}Comparing statistical approaches}{15}{subsection.1.1.1}% -\contentsline {subsection}{\numberline {1.1.2}Assessing performance of complex pipelines}{15}{subsection.1.1.2}% -\contentsline {subsection}{\numberline {1.1.3}Assessing performance under misspecification}{16}{subsection.1.1.3}% -\contentsline {subsection}{\numberline {1.1.4}Assessing the finite-sample performance of a statistical approach}{16}{subsection.1.1.4}% -\contentsline {subsection}{\numberline {1.1.5}Conducting Power Analyses}{17}{subsection.1.1.5}% -\contentsline {subsection}{\numberline {1.1.6}Simulating processess}{18}{subsection.1.1.6}% -\contentsline {section}{\numberline {1.2}The perils of simulation as evidence}{19}{section.1.2}% -\contentsline {section}{\numberline {1.3}Simulating to learn}{21}{section.1.3}% -\contentsline {section}{\numberline {1.4}Why R?}{21}{section.1.4}% -\contentsline {section}{\numberline {1.5}Organization of the text}{23}{section.1.5}% -\contentsline {chapter}{\numberline {2}Programming Preliminaries}{25}{chapter.2}% -\contentsline {section}{\numberline {2.1}Welcome to the tidyverse}{25}{section.2.1}% -\contentsline {section}{\numberline {2.2}Functions}{26}{section.2.2}% -\contentsline {subsection}{\numberline {2.2.1}Rolling your own}{26}{subsection.2.2.1}% -\contentsline {subsection}{\numberline {2.2.2}A dangerous function}{27}{subsection.2.2.2}% -\contentsline {subsection}{\numberline {2.2.3}Using Named Arguments}{30}{subsection.2.2.3}% -\contentsline {subsection}{\numberline {2.2.4}Argument Defaults}{31}{subsection.2.2.4}% -\contentsline {subsection}{\numberline {2.2.5}Function skeletons}{32}{subsection.2.2.5}% -\contentsline {section}{\numberline {2.3}\texttt {\textbackslash {}\textgreater {}} (Pipe) dreams}{32}{section.2.3}% -\contentsline {section}{\numberline {2.4}Recipes versus Patterns}{33}{section.2.4}% -\contentsline {section}{\numberline {2.5}Exercises}{34}{section.2.5}% -\contentsline {chapter}{\numberline {3}An initial simulation}{37}{chapter.3}% -\contentsline {section}{\numberline {3.1}Simulating a single scenario}{39}{section.3.1}% -\contentsline {section}{\numberline {3.2}A non-normal population distribution}{41}{section.3.2}% -\contentsline {section}{\numberline {3.3}Simulating across different scenarios}{42}{section.3.3}% -\contentsline {section}{\numberline {3.4}Extending the simulation design}{45}{section.3.4}% -\contentsline {section}{\numberline {3.5}Exercises}{45}{section.3.5}% -\contentsline {part}{II\hspace {1em}Structure and Mechanics of a Simulation Study}{49}{part.2}% -\contentsline {chapter}{\numberline {4}Structure of a simulation study}{51}{chapter.4}% -\contentsline {section}{\numberline {4.1}General structure of a simulation}{51}{section.4.1}% -\contentsline {section}{\numberline {4.2}Tidy, modular simulations}{53}{section.4.2}% -\contentsline {section}{\numberline {4.3}Skeleton of a simulation study}{54}{section.4.3}% -\contentsline {subsection}{\numberline {4.3.1}Data-Generating Process}{56}{subsection.4.3.1}% -\contentsline {subsection}{\numberline {4.3.2}Data Analysis Procedure}{56}{subsection.4.3.2}% -\contentsline {subsection}{\numberline {4.3.3}Repetition}{57}{subsection.4.3.3}% -\contentsline {subsection}{\numberline {4.3.4}Performance summaries}{58}{subsection.4.3.4}% -\contentsline {subsection}{\numberline {4.3.5}Multifactor simulations}{59}{subsection.4.3.5}% -\contentsline {section}{\numberline {4.4}Exercises}{60}{section.4.4}% -\contentsline {chapter}{\numberline {5}Case Study: Heteroskedastic ANOVA and Welch}{61}{chapter.5}% -\contentsline {section}{\numberline {5.1}The data-generating model}{63}{section.5.1}% -\contentsline {subsection}{\numberline {5.1.1}Now make a function}{65}{subsection.5.1.1}% -\contentsline {subsection}{\numberline {5.1.2}Cautious coding}{67}{subsection.5.1.2}% -\contentsline {section}{\numberline {5.2}The hypothesis testing procedures}{67}{section.5.2}% -\contentsline {section}{\numberline {5.3}Running the simulation}{69}{section.5.3}% -\contentsline {section}{\numberline {5.4}Summarizing test performance}{70}{section.5.4}% -\contentsline {section}{\numberline {5.5}Exercises}{71}{section.5.5}% -\contentsline {subsection}{\numberline {5.5.1}Other \(\alpha \)'s}{71}{subsection.5.5.1}% -\contentsline {subsection}{\numberline {5.5.2}Compare results}{72}{subsection.5.5.2}% -\contentsline {subsection}{\numberline {5.5.3}Power}{72}{subsection.5.5.3}% -\contentsline {subsection}{\numberline {5.5.4}Wide or long?}{72}{subsection.5.5.4}% -\contentsline {subsection}{\numberline {5.5.5}Other tests}{73}{subsection.5.5.5}% -\contentsline {subsection}{\numberline {5.5.6}Methodological extensions}{73}{subsection.5.5.6}% -\contentsline {subsection}{\numberline {5.5.7}Power analysis}{73}{subsection.5.5.7}% -\contentsline {chapter}{\numberline {6}Data-generating processes}{75}{chapter.6}% -\contentsline {section}{\numberline {6.1}Examples}{75}{section.6.1}% -\contentsline {subsection}{\numberline {6.1.1}Example 1: One-way analysis of variance}{76}{subsection.6.1.1}% -\contentsline {subsection}{\numberline {6.1.2}Example 2: Bivariate Poisson model}{76}{subsection.6.1.2}% -\contentsline {subsection}{\numberline {6.1.3}Example 3: Hierarchical linear model for a cluster-randomized trial}{76}{subsection.6.1.3}% -\contentsline {section}{\numberline {6.2}Components of a DGP}{77}{section.6.2}% -\contentsline {section}{\numberline {6.3}A statistical model is a recipe for data generation}{80}{section.6.3}% -\contentsline {section}{\numberline {6.4}Plot the artificial data}{82}{section.6.4}% -\contentsline {section}{\numberline {6.5}Check the data-generating function}{83}{section.6.5}% -\contentsline {section}{\numberline {6.6}Example: Simulating clustered data}{85}{section.6.6}% -\contentsline {subsection}{\numberline {6.6.1}A design decision: What do we want to manipulate?}{85}{subsection.6.6.1}% -\contentsline {subsection}{\numberline {6.6.2}A model for a cluster RCT}{86}{subsection.6.6.2}% -\contentsline {subsection}{\numberline {6.6.3}From equations to code}{89}{subsection.6.6.3}% -\contentsline {subsection}{\numberline {6.6.4}Standardization in the DGP}{91}{subsection.6.6.4}% -\contentsline {section}{\numberline {6.7}Sometimes a DGP is all you need}{93}{section.6.7}% -\contentsline {section}{\numberline {6.8}More to explore}{98}{section.6.8}% -\contentsline {section}{\numberline {6.9}Exercises}{98}{section.6.9}% -\contentsline {subsection}{\numberline {6.9.1}The Welch test on a shifted-and-scaled \(t\) distribution}{98}{subsection.6.9.1}% -\contentsline {subsection}{\numberline {6.9.2}Plot the bivariate Poisson}{99}{subsection.6.9.2}% -\contentsline {subsection}{\numberline {6.9.3}Check the bivariate Poisson function}{99}{subsection.6.9.3}% -\contentsline {subsection}{\numberline {6.9.4}Add error-catching to the bivariate Poisson function}{100}{subsection.6.9.4}% -\contentsline {subsection}{\numberline {6.9.5}A bivariate negative binomial distribution}{100}{subsection.6.9.5}% -\contentsline {subsection}{\numberline {6.9.6}Another bivariate negative binomial distribution}{101}{subsection.6.9.6}% -\contentsline {subsection}{\numberline {6.9.7}Plot the data from a cluster-randomized trial}{102}{subsection.6.9.7}% -\contentsline {subsection}{\numberline {6.9.8}Checking the Cluster RCT DGP}{102}{subsection.6.9.8}% -\contentsline {subsection}{\numberline {6.9.9}More school-level variation}{102}{subsection.6.9.9}% -\contentsline {subsection}{\numberline {6.9.10}Cluster-randomized trial with baseline predictors}{102}{subsection.6.9.10}% -\contentsline {subsection}{\numberline {6.9.11}3-parameter IRT datasets}{103}{subsection.6.9.11}% -\contentsline {subsection}{\numberline {6.9.12}Check the 3-parameter IRT DGP}{104}{subsection.6.9.12}% -\contentsline {subsection}{\numberline {6.9.13}Explore the 3-parameter IRT model}{104}{subsection.6.9.13}% -\contentsline {subsection}{\numberline {6.9.14}Random effects meta-regression}{104}{subsection.6.9.14}% -\contentsline {subsection}{\numberline {6.9.15}Meta-regression with selective reporting}{105}{subsection.6.9.15}% -\contentsline {chapter}{\numberline {7}Estimation procedures}{107}{chapter.7}% -\contentsline {section}{\numberline {7.1}Writing estimation functions}{108}{section.7.1}% -\contentsline {section}{\numberline {7.2}Including Multiple Data Analysis Procedures}{110}{section.7.2}% -\contentsline {section}{\numberline {7.3}Validating an Estimation Function}{114}{section.7.3}% -\contentsline {subsection}{\numberline {7.3.1}Checking against existing implementations}{114}{subsection.7.3.1}% -\contentsline {subsection}{\numberline {7.3.2}Checking novel procedures}{116}{subsection.7.3.2}% -\contentsline {subsection}{\numberline {7.3.3}Checking with simulations}{119}{subsection.7.3.3}% -\contentsline {section}{\numberline {7.4}Handling errors, warnings, and other hiccups}{120}{section.7.4}% -\contentsline {subsection}{\numberline {7.4.1}Capturing errors and warnings}{121}{subsection.7.4.1}% -\contentsline {subsection}{\numberline {7.4.2}Adapting estimators for errors and warnings}{126}{subsection.7.4.2}% -\contentsline {subsection}{\numberline {7.4.3}The \texttt {safely()} option}{128}{subsection.7.4.3}% -\contentsline {section}{\numberline {7.5}Exercises}{131}{section.7.5}% -\contentsline {subsection}{\numberline {7.5.1}More Heteroskedastic ANOVA}{131}{subsection.7.5.1}% -\contentsline {subsection}{\numberline {7.5.2}Contingent testing}{132}{subsection.7.5.2}% -\contentsline {subsection}{\numberline {7.5.3}Check the cluster-RCT functions}{132}{subsection.7.5.3}% -\contentsline {subsection}{\numberline {7.5.4}Extending the cluster-RCT functions}{133}{subsection.7.5.4}% -\contentsline {subsection}{\numberline {7.5.5}Contingent estimator processing}{133}{subsection.7.5.5}% -\contentsline {subsection}{\numberline {7.5.6}Estimating 3-parameter item response theory models}{134}{subsection.7.5.6}% -\contentsline {subsection}{\numberline {7.5.7}Meta-regression with selective reporting}{134}{subsection.7.5.7}% -\contentsline {chapter}{\numberline {8}Running the Simulation Process}{137}{chapter.8}% -\contentsline {section}{\numberline {8.1}Repeating oneself}{137}{section.8.1}% -\contentsline {section}{\numberline {8.2}One run at a time}{138}{section.8.2}% -\contentsline {subsection}{\numberline {8.2.1}Reparameterizing}{141}{subsection.8.2.1}% -\contentsline {section}{\numberline {8.3}Bundling simulations with \texttt {simhelpers}}{142}{section.8.3}% -\contentsline {section}{\numberline {8.4}Seeds and pseudo-random number generators}{143}{section.8.4}% -\contentsline {section}{\numberline {8.5}Exercises}{146}{section.8.5}% -\contentsline {subsection}{\numberline {8.5.1}Welch simulations}{146}{subsection.8.5.1}% -\contentsline {subsection}{\numberline {8.5.2}Compare sampling distributions of Pearson's correlation coefficients}{146}{subsection.8.5.2}% -\contentsline {subsection}{\numberline {8.5.3}Reparameterization, redux}{147}{subsection.8.5.3}% -\contentsline {subsection}{\numberline {8.5.4}Fancy clustered RCT simulations}{147}{subsection.8.5.4}% -\contentsline {chapter}{\numberline {9}Performance Measures}{149}{chapter.9}% -\contentsline {section}{\numberline {9.1}Measures for Point Estimators}{151}{section.9.1}% -\contentsline {subsection}{\numberline {9.1.1}Comparing the Performance of the Cluster RCT Estimation Procedures}{153}{subsection.9.1.1}% -\contentsline {subsubsection}{Are the estimators biased?}{154}{section*.12}% -\contentsline {subsubsection}{Which method has the smallest standard error?}{154}{section*.13}% -\contentsline {subsubsection}{Which method has the smallest Root Mean Squared Error?}{155}{section*.14}% -\contentsline {subsection}{\numberline {9.1.2}Less Conventional Performance Measures}{156}{subsection.9.1.2}% -\contentsline {section}{\numberline {9.2}Measures for Variance Estimators}{158}{section.9.2}% -\contentsline {subsection}{\numberline {9.2.1}Satterthwaite degrees of freedom}{160}{subsection.9.2.1}% -\contentsline {subsection}{\numberline {9.2.2}Assessing SEs for the Cluster RCT Simulation}{161}{subsection.9.2.2}% -\contentsline {section}{\numberline {9.3}Measures for Confidence Intervals}{162}{section.9.3}% -\contentsline {subsection}{\numberline {9.3.1}Confidence Intervals in the Cluster RCT Simulation}{163}{subsection.9.3.1}% -\contentsline {section}{\numberline {9.4}Measures for Inferential Procedures (Hypothesis Tests)}{164}{section.9.4}% -\contentsline {subsection}{\numberline {9.4.1}Validity}{165}{subsection.9.4.1}% -\contentsline {subsection}{\numberline {9.4.2}Power}{165}{subsection.9.4.2}% -\contentsline {subsection}{\numberline {9.4.3}Rejection Rates}{166}{subsection.9.4.3}% -\contentsline {subsection}{\numberline {9.4.4}Inference in the Cluster RCT Simulation}{167}{subsection.9.4.4}% -\contentsline {section}{\numberline {9.5}Relative or Absolute Measures?}{168}{section.9.5}% -\contentsline {subsection}{\numberline {9.5.1}Performance relative to a benchmark estimator}{170}{subsection.9.5.1}% -\contentsline {section}{\numberline {9.6}Estimands Not Represented By a Parameter}{171}{section.9.6}% -\contentsline {section}{\numberline {9.7}Uncertainty in Performance Estimates (the Monte Carlo Standard Error)}{174}{section.9.7}% -\contentsline {subsection}{\numberline {9.7.1}Conventional measures for point estimators}{174}{subsection.9.7.1}% -\contentsline {subsection}{\numberline {9.7.2}Less conventional measures for point estimators}{176}{subsection.9.7.2}% -\contentsline {subsection}{\numberline {9.7.3}MCSE for Relative Variance Estimators}{177}{subsection.9.7.3}% -\contentsline {subsection}{\numberline {9.7.4}MCSE for Confidence Intervals and Hypothesis Tests}{178}{subsection.9.7.4}% -\contentsline {subsection}{\numberline {9.7.5}Calculating MCSEs With the \texttt {simhelpers} Package}{179}{subsection.9.7.5}% -\contentsline {subsection}{\numberline {9.7.6}MCSE Calculation in our Cluster RCT Example}{180}{subsection.9.7.6}% -\contentsline {section}{\numberline {9.8}Summary of Peformance Measures}{182}{section.9.8}% -\contentsline {section}{\numberline {9.9}Concluding thoughts}{183}{section.9.9}% -\contentsline {section}{\numberline {9.10}Exercises}{183}{section.9.10}% -\contentsline {subsection}{\numberline {9.10.1}Brown and Forsythe (1974) results}{183}{subsection.9.10.1}% -\contentsline {subsection}{\numberline {9.10.2}Size-adjusted power}{184}{subsection.9.10.2}% -\contentsline {subsection}{\numberline {9.10.3}Three correlation estimators}{184}{subsection.9.10.3}% -\contentsline {subsection}{\numberline {9.10.4}Confidence interval comparison}{186}{subsection.9.10.4}% -\contentsline {subsection}{\numberline {9.10.5}Jackknife calculation of MCSEs for RMSE}{186}{subsection.9.10.5}% -\contentsline {subsection}{\numberline {9.10.6}Jackknife calculation of MCSEs for RMSE ratios}{187}{subsection.9.10.6}% -\contentsline {subsection}{\numberline {9.10.7}Distribution theory for person-level average treatment effects}{187}{subsection.9.10.7}% -\contentsline {part}{III\hspace {1em}Systematic Simulations}{189}{part.3}% -\contentsline {chapter}{\numberline {10}Simulating across multiple scenarios}{191}{chapter.10}% -\contentsline {section}{\numberline {10.1}Simulating across levels of a single factor}{192}{section.10.1}% -\contentsline {subsection}{\numberline {10.1.1}A performance summary function}{195}{subsection.10.1.1}% -\contentsline {subsection}{\numberline {10.1.2}Adding performance calculations to the simulation driver}{196}{subsection.10.1.2}% -\contentsline {section}{\numberline {10.2}Simulating across multiple factors}{198}{section.10.2}% -\contentsline {section}{\numberline {10.3}Using pmap to run multifactor simulations}{200}{section.10.3}% -\contentsline {section}{\numberline {10.4}When to calculate performance metrics}{204}{section.10.4}% -\contentsline {subsection}{\numberline {10.4.1}Aggregate as you simulate (inside)}{204}{subsection.10.4.1}% -\contentsline {subsection}{\numberline {10.4.2}Keep all simulation runs (outside)}{204}{subsection.10.4.2}% -\contentsline {subsection}{\numberline {10.4.3}Getting raw results ready for analysis}{206}{subsection.10.4.3}% -\contentsline {section}{\numberline {10.5}Summary}{208}{section.10.5}% -\contentsline {section}{\numberline {10.6}Exercises}{209}{section.10.6}% -\contentsline {subsection}{\numberline {10.6.1}Extending Brown and Forsythe}{209}{subsection.10.6.1}% -\contentsline {subsection}{\numberline {10.6.2}Comparing the trimmed mean, median and mean}{209}{subsection.10.6.2}% -\contentsline {subsection}{\numberline {10.6.3}Estimating latent correlations}{210}{subsection.10.6.3}% -\contentsline {subsection}{\numberline {10.6.4}Meta-regression}{211}{subsection.10.6.4}% -\contentsline {subsection}{\numberline {10.6.5}Examine a multifactor simulation design}{211}{subsection.10.6.5}% -\contentsline {chapter}{\numberline {11}Designing multifactor simulations}{213}{chapter.11}% -\contentsline {section}{\numberline {11.1}Choosing parameter combinations}{214}{section.11.1}% -\contentsline {section}{\numberline {11.2}Case Study: A multifactor evaluation of cluster RCT estimators}{216}{section.11.2}% -\contentsline {subsection}{\numberline {11.2.1}Choosing parameters for the Clustered RCT}{216}{subsection.11.2.1}% -\contentsline {subsection}{\numberline {11.2.2}Redundant factor combinations}{218}{subsection.11.2.2}% -\contentsline {subsection}{\numberline {11.2.3}Running the simulations}{218}{subsection.11.2.3}% -\contentsline {subsection}{\numberline {11.2.4}Calculating performance metrics}{219}{subsection.11.2.4}% -\contentsline {chapter}{\numberline {12}Exploring and presenting simulation results}{221}{chapter.12}% -\contentsline {section}{\numberline {12.1}Tabulation}{222}{section.12.1}% -\contentsline {subsection}{\numberline {12.1.1}Example: estimators of treatment variation}{224}{subsection.12.1.1}% -\contentsline {section}{\numberline {12.2}Visualization}{225}{section.12.2}% -\contentsline {subsection}{\numberline {12.2.1}Example 0: RMSE in Cluster RCTs}{226}{subsection.12.2.1}% -\contentsline {subsection}{\numberline {12.2.2}Example 1: Biserial correlation estimation}{227}{subsection.12.2.2}% -\contentsline {subsection}{\numberline {12.2.3}Example 2: Variance estimation and Meta-regression}{227}{subsection.12.2.3}% -\contentsline {subsection}{\numberline {12.2.4}Example 3: Heat maps of coverage}{228}{subsection.12.2.4}% -\contentsline {subsection}{\numberline {12.2.5}Example 4: Relative performance of treatment effect estimators}{229}{subsection.12.2.5}% -\contentsline {section}{\numberline {12.3}Modeling}{231}{section.12.3}% -\contentsline {subsection}{\numberline {12.3.1}Example 1: Biserial, revisited}{232}{subsection.12.3.1}% -\contentsline {subsection}{\numberline {12.3.2}Example 2: Comparing methods for cross-classified data}{233}{subsection.12.3.2}% -\contentsline {section}{\numberline {12.4}Reporting}{234}{section.12.4}% -\contentsline {chapter}{\numberline {13}Building good visualizations}{237}{chapter.13}% -\contentsline {section}{\numberline {13.1}Subsetting and Many Small Multiples}{238}{section.13.1}% -\contentsline {section}{\numberline {13.2}Bundling}{241}{section.13.2}% -\contentsline {section}{\numberline {13.3}Aggregation}{245}{section.13.3}% -\contentsline {subsubsection}{\numberline {13.3.0.1}Some notes on how to aggregate}{247}{subsubsection.13.3.0.1}% -\contentsline {section}{\numberline {13.4}Comparing true SEs with standardization}{248}{section.13.4}% -\contentsline {section}{\numberline {13.5}The Bias-SE-RMSE plot}{253}{section.13.5}% -\contentsline {section}{\numberline {13.6}Assessing the quality of the estimated SEs}{255}{section.13.6}% -\contentsline {subsection}{\numberline {13.6.1}Stability of estimated SEs}{257}{subsection.13.6.1}% -\contentsline {section}{\numberline {13.7}Assessing confidence intervals}{258}{section.13.7}% -\contentsline {section}{\numberline {13.8}Exercises}{260}{section.13.8}% -\contentsline {subsection}{\numberline {13.8.1}Assessing uncertainty}{261}{subsection.13.8.1}% -\contentsline {subsection}{\numberline {13.8.2}Assessing power}{261}{subsection.13.8.2}% -\contentsline {subsection}{\numberline {13.8.3}Going deeper with coverage}{262}{subsection.13.8.3}% -\contentsline {subsection}{\numberline {13.8.4}Pearson correlations with a bivariate Poisson distribution}{262}{subsection.13.8.4}% -\contentsline {subsection}{\numberline {13.8.5}Making another plot for assessing SEs}{262}{subsection.13.8.5}% -\contentsline {chapter}{\numberline {14}Special Topics on Reporting Simulation Results}{263}{chapter.14}% -\contentsline {section}{\numberline {14.1}Using regression to analyze simulation results}{263}{section.14.1}% -\contentsline {subsection}{\numberline {14.1.1}Example 1: Biserial, revisited}{263}{subsection.14.1.1}% -\contentsline {subsection}{\numberline {14.1.2}Example 2: Cluster RCT example, revisited}{266}{subsection.14.1.2}% -\contentsline {subsubsection}{\numberline {14.1.2.1}Using LASSO to simplify the model}{267}{subsubsection.14.1.2.1}% -\contentsline {section}{\numberline {14.2}Using regression trees to find important factors}{272}{section.14.2}% -\contentsline {section}{\numberline {14.3}Analyzing results with few iterations per scenario}{274}{section.14.3}% -\contentsline {subsection}{\numberline {14.3.1}Example: ClusterRCT with only 100 replicates per scenario}{275}{subsection.14.3.1}% -\contentsline {section}{\numberline {14.4}What to do with warnings in simulations}{281}{section.14.4}% -\contentsline {chapter}{\numberline {15}Case study: Comparing different estimators}{285}{chapter.15}% -\contentsline {section}{\numberline {15.1}Bias-variance tradeoffs}{288}{section.15.1}% -\contentsline {chapter}{\numberline {16}Simulations as evidence}{293}{chapter.16}% -\contentsline {section}{\numberline {16.1}Strategies for making relevant simulations}{294}{section.16.1}% -\contentsline {subsection}{\numberline {16.1.1}Break symmetries and regularities}{294}{subsection.16.1.1}% -\contentsline {subsection}{\numberline {16.1.2}Make your simulation general with an extensive multi-factor experiment}{295}{subsection.16.1.2}% -\contentsline {subsection}{\numberline {16.1.3}Use previously published simulations to beat them at their own game}{295}{subsection.16.1.3}% -\contentsline {subsection}{\numberline {16.1.4}Calibrate simulation factors to real data}{295}{subsection.16.1.4}% -\contentsline {subsection}{\numberline {16.1.5}Use real data to obtain directly}{295}{subsection.16.1.5}% -\contentsline {subsection}{\numberline {16.1.6}Fully calibrated simulations}{296}{subsection.16.1.6}% -\contentsline {part}{IV\hspace {1em}Computational Considerations}{299}{part.4}% -\contentsline {chapter}{\numberline {17}Organizing a simulation project}{301}{chapter.17}% -\contentsline {section}{\numberline {17.1}Simulation project structure}{301}{section.17.1}% -\contentsline {section}{\numberline {17.2}Well structured code files}{302}{section.17.2}% -\contentsline {subsection}{\numberline {17.2.1}Putting headers in your .R file}{303}{subsection.17.2.1}% -\contentsline {subsection}{\numberline {17.2.2}The source command}{303}{subsection.17.2.2}% -\contentsline {subsection}{\numberline {17.2.3}Storing testing code}{305}{subsection.17.2.3}% -\contentsline {section}{\numberline {17.3}Principled directory structures}{306}{section.17.3}% -\contentsline {section}{\numberline {17.4}Saving simulation results}{307}{section.17.4}% -\contentsline {subsection}{\numberline {17.4.1}File formats}{307}{subsection.17.4.1}% -\contentsline {subsection}{\numberline {17.4.2}Saving simulations as you go}{308}{subsection.17.4.2}% -\contentsline {chapter}{\numberline {18}Parallel Processing}{313}{chapter.18}% -\contentsline {section}{\numberline {18.1}Parallel on your computer}{314}{section.18.1}% -\contentsline {section}{\numberline {18.2}Parallel on a virtual machine}{315}{section.18.2}% -\contentsline {section}{\numberline {18.3}Parallel on a cluster}{315}{section.18.3}% -\contentsline {subsection}{\numberline {18.3.1}What is a command-line interface?}{316}{subsection.18.3.1}% -\contentsline {subsection}{\numberline {18.3.2}Running a job on a cluster}{318}{subsection.18.3.2}% -\contentsline {subsection}{\numberline {18.3.3}Checking on a job}{320}{subsection.18.3.3}% -\contentsline {subsection}{\numberline {18.3.4}Running lots of jobs on a cluster}{320}{subsection.18.3.4}% -\contentsline {subsection}{\numberline {18.3.5}Resources for Harvard's Odyssey}{323}{subsection.18.3.5}% -\contentsline {subsection}{\numberline {18.3.6}Acknowledgements}{323}{subsection.18.3.6}% -\contentsline {chapter}{\numberline {19}Debugging and Testing}{325}{chapter.19}% -\contentsline {section}{\numberline {19.1}Debugging with \texttt {print()}}{325}{section.19.1}% -\contentsline {section}{\numberline {19.2}Debugging with \texttt {browser()}}{326}{section.19.2}% -\contentsline {section}{\numberline {19.3}Debugging with \texttt {debug()}}{327}{section.19.3}% -\contentsline {section}{\numberline {19.4}Protecting functions with \texttt {stop()}}{327}{section.19.4}% -\contentsline {section}{\numberline {19.5}Testing code}{328}{section.19.5}% -\contentsline {part}{V\hspace {1em}Complex Data Structures}{333}{part.5}% -\contentsline {chapter}{\numberline {20}Using simulation as a power calculator}{335}{chapter.20}% -\contentsline {section}{\numberline {20.1}Getting design parameters from pilot data}{336}{section.20.1}% -\contentsline {section}{\numberline {20.2}The data generating process}{337}{section.20.2}% -\contentsline {section}{\numberline {20.3}Running the simulation}{341}{section.20.3}% -\contentsline {section}{\numberline {20.4}Evaluating power}{342}{section.20.4}% -\contentsline {subsection}{\numberline {20.4.1}Checking validity of our models}{342}{subsection.20.4.1}% -\contentsline {subsection}{\numberline {20.4.2}Assessing Precision (SE)}{344}{subsection.20.4.2}% -\contentsline {subsection}{\numberline {20.4.3}Assessing power}{345}{subsection.20.4.3}% -\contentsline {subsection}{\numberline {20.4.4}Assessing Minimum Detectable Effects}{346}{subsection.20.4.4}% -\contentsline {section}{\numberline {20.5}Power for Multilevel Data}{347}{section.20.5}% -\contentsline {chapter}{\numberline {21}Simulation under the Potential Outcomes Framework}{351}{chapter.21}% -\contentsline {section}{\numberline {21.1}Finite vs.\nobreakspace {}Superpopulation inference}{352}{section.21.1}% -\contentsline {section}{\numberline {21.2}Data generation processes for potential outcomes}{352}{section.21.2}% -\contentsline {section}{\numberline {21.3}Finite sample performance measures}{355}{section.21.3}% -\contentsline {section}{\numberline {21.4}Nested finite simulation procedure}{358}{section.21.4}% -\contentsline {chapter}{\numberline {22}The Parametric bootstrap}{363}{chapter.22}% -\contentsline {section}{\numberline {22.1}Air conditioners: a stolen case study}{364}{section.22.1}% -\contentsline {chapter}{\numberline {A}Coding Reference}{367}{appendix.A}% -\contentsline {section}{\numberline {A.1}How to repeat yourself}{367}{section.A.1}% -\contentsline {subsection}{\numberline {A.1.1}Using \texttt {replicate()}}{367}{subsection.A.1.1}% -\contentsline {subsection}{\numberline {A.1.2}Using \texttt {map()}}{368}{subsection.A.1.2}% -\contentsline {subsection}{\numberline {A.1.3}map with no inputs}{370}{subsection.A.1.3}% -\contentsline {subsection}{\numberline {A.1.4}Other approaches for repetition}{370}{subsection.A.1.4}% -\contentsline {section}{\numberline {A.2}Default arguments for functions}{371}{section.A.2}% -\contentsline {section}{\numberline {A.3}Profiling Code}{372}{section.A.3}% -\contentsline {subsection}{\numberline {A.3.1}Using \texttt {Sys.time()} and \texttt {system.time()}}{372}{subsection.A.3.1}% -\contentsline {subsection}{\numberline {A.3.2}The \texttt {tictoc} package}{373}{subsection.A.3.2}% -\contentsline {subsection}{\numberline {A.3.3}The \texttt {bench} package}{373}{subsection.A.3.3}% -\contentsline {subsection}{\numberline {A.3.4}Profiling with \texttt {profvis}}{376}{subsection.A.3.4}% -\contentsline {section}{\numberline {A.4}Optimizing code (and why you often shouldn't)}{376}{section.A.4}% -\contentsline {subsection}{\numberline {A.4.1}Hand-building functions}{377}{subsection.A.4.1}% -\contentsline {subsection}{\numberline {A.4.2}Computational efficiency versus simplicity}{378}{subsection.A.4.2}% -\contentsline {subsection}{\numberline {A.4.3}Reusing code to speed up computation}{379}{subsection.A.4.3}% -\contentsline {chapter}{\numberline {B}Further readings and resources}{385}{appendix.B}% +\contentsline {chapter}{Welcome}{9}{chapter*.2}% +\contentsline {section}{License}{10}{section*.3}% +\contentsline {section}{About the authors}{10}{section*.4}% +\contentsline {section}{Acknowledgements}{11}{section*.5}% +\contentsline {part}{I\hspace {1em}An Introductory Look}{13}{part.1}% +\contentsline {chapter}{\numberline {1}Introduction}{15}{chapter.1}% +\contentsline {section}{\numberline {1.1}Some of simulation's many uses}{16}{section.1.1}% +\contentsline {subsection}{\numberline {1.1.1}Comparing statistical approaches}{17}{subsection.1.1.1}% +\contentsline {subsection}{\numberline {1.1.2}Assessing performance of complex pipelines}{17}{subsection.1.1.2}% +\contentsline {subsection}{\numberline {1.1.3}Assessing performance under misspecification}{18}{subsection.1.1.3}% +\contentsline {subsection}{\numberline {1.1.4}Assessing the finite-sample performance of a statistical approach}{18}{subsection.1.1.4}% +\contentsline {subsection}{\numberline {1.1.5}Conducting Power Analyses}{19}{subsection.1.1.5}% +\contentsline {subsection}{\numberline {1.1.6}Simulating processess}{20}{subsection.1.1.6}% +\contentsline {section}{\numberline {1.2}The perils of simulation as evidence}{21}{section.1.2}% +\contentsline {section}{\numberline {1.3}Simulating to learn}{23}{section.1.3}% +\contentsline {section}{\numberline {1.4}Why R?}{23}{section.1.4}% +\contentsline {section}{\numberline {1.5}Organization of the text}{25}{section.1.5}% +\contentsline {chapter}{\numberline {2}Programming Preliminaries}{27}{chapter.2}% +\contentsline {section}{\numberline {2.1}Welcome to the tidyverse}{27}{section.2.1}% +\contentsline {section}{\numberline {2.2}Functions}{28}{section.2.2}% +\contentsline {subsection}{\numberline {2.2.1}Rolling your own}{28}{subsection.2.2.1}% +\contentsline {subsection}{\numberline {2.2.2}A dangerous function}{29}{subsection.2.2.2}% +\contentsline {subsection}{\numberline {2.2.3}Using Named Arguments}{32}{subsection.2.2.3}% +\contentsline {subsection}{\numberline {2.2.4}Argument Defaults}{33}{subsection.2.2.4}% +\contentsline {subsection}{\numberline {2.2.5}Function skeletons}{34}{subsection.2.2.5}% +\contentsline {section}{\numberline {2.3}\texttt {\textbackslash {}\textgreater {}} (Pipe) dreams}{34}{section.2.3}% +\contentsline {section}{\numberline {2.4}Recipes versus Patterns}{35}{section.2.4}% +\contentsline {section}{\numberline {2.5}Exercises}{36}{section.2.5}% +\contentsline {chapter}{\numberline {3}An initial simulation}{39}{chapter.3}% +\contentsline {section}{\numberline {3.1}Simulating a single scenario}{41}{section.3.1}% +\contentsline {section}{\numberline {3.2}A non-normal population distribution}{43}{section.3.2}% +\contentsline {section}{\numberline {3.3}Simulating across different scenarios}{44}{section.3.3}% +\contentsline {section}{\numberline {3.4}Extending the simulation design}{47}{section.3.4}% +\contentsline {section}{\numberline {3.5}Exercises}{47}{section.3.5}% +\contentsline {part}{II\hspace {1em}Structure and Mechanics of a Simulation Study}{51}{part.2}% +\contentsline {chapter}{\numberline {4}Structure of a simulation study}{53}{chapter.4}% +\contentsline {section}{\numberline {4.1}General structure of a simulation}{53}{section.4.1}% +\contentsline {section}{\numberline {4.2}Tidy, modular simulations}{55}{section.4.2}% +\contentsline {section}{\numberline {4.3}Skeleton of a simulation study}{56}{section.4.3}% +\contentsline {subsection}{\numberline {4.3.1}Data-Generating Process}{58}{subsection.4.3.1}% +\contentsline {subsection}{\numberline {4.3.2}Data Analysis Procedure}{58}{subsection.4.3.2}% +\contentsline {subsection}{\numberline {4.3.3}Repetition}{59}{subsection.4.3.3}% +\contentsline {subsection}{\numberline {4.3.4}Performance summaries}{60}{subsection.4.3.4}% +\contentsline {subsection}{\numberline {4.3.5}Multifactor simulations}{61}{subsection.4.3.5}% +\contentsline {section}{\numberline {4.4}Exercises}{62}{section.4.4}% +\contentsline {chapter}{\numberline {5}Case Study: Heteroskedastic ANOVA and Welch}{63}{chapter.5}% +\contentsline {section}{\numberline {5.1}The data-generating model}{65}{section.5.1}% +\contentsline {subsection}{\numberline {5.1.1}Now make a function}{67}{subsection.5.1.1}% +\contentsline {subsection}{\numberline {5.1.2}Cautious coding}{69}{subsection.5.1.2}% +\contentsline {section}{\numberline {5.2}The hypothesis testing procedures}{69}{section.5.2}% +\contentsline {section}{\numberline {5.3}Running the simulation}{71}{section.5.3}% +\contentsline {section}{\numberline {5.4}Summarizing test performance}{72}{section.5.4}% +\contentsline {section}{\numberline {5.5}Exercises}{73}{section.5.5}% +\contentsline {subsection}{\numberline {5.5.1}Other \(\alpha \)'s}{73}{subsection.5.5.1}% +\contentsline {subsection}{\numberline {5.5.2}Compare results}{74}{subsection.5.5.2}% +\contentsline {subsection}{\numberline {5.5.3}Power}{74}{subsection.5.5.3}% +\contentsline {subsection}{\numberline {5.5.4}Wide or long?}{74}{subsection.5.5.4}% +\contentsline {subsection}{\numberline {5.5.5}Other tests}{75}{subsection.5.5.5}% +\contentsline {subsection}{\numberline {5.5.6}Methodological extensions}{75}{subsection.5.5.6}% +\contentsline {subsection}{\numberline {5.5.7}Power analysis}{75}{subsection.5.5.7}% +\contentsline {chapter}{\numberline {6}Data-generating processes}{77}{chapter.6}% +\contentsline {section}{\numberline {6.1}Examples}{77}{section.6.1}% +\contentsline {subsection}{\numberline {6.1.1}Example 1: One-way analysis of variance}{78}{subsection.6.1.1}% +\contentsline {subsection}{\numberline {6.1.2}Example 2: Bivariate Poisson model}{78}{subsection.6.1.2}% +\contentsline {subsection}{\numberline {6.1.3}Example 3: Hierarchical linear model for a cluster-randomized trial}{78}{subsection.6.1.3}% +\contentsline {section}{\numberline {6.2}Components of a DGP}{79}{section.6.2}% +\contentsline {section}{\numberline {6.3}A statistical model is a recipe for data generation}{82}{section.6.3}% +\contentsline {section}{\numberline {6.4}Plot the artificial data}{84}{section.6.4}% +\contentsline {section}{\numberline {6.5}Check the data-generating function}{85}{section.6.5}% +\contentsline {section}{\numberline {6.6}Example: Simulating clustered data}{87}{section.6.6}% +\contentsline {subsection}{\numberline {6.6.1}A design decision: What do we want to manipulate?}{87}{subsection.6.6.1}% +\contentsline {subsection}{\numberline {6.6.2}A model for a cluster RCT}{88}{subsection.6.6.2}% +\contentsline {subsection}{\numberline {6.6.3}From equations to code}{91}{subsection.6.6.3}% +\contentsline {subsection}{\numberline {6.6.4}Standardization in the DGP}{93}{subsection.6.6.4}% +\contentsline {section}{\numberline {6.7}Sometimes a DGP is all you need}{95}{section.6.7}% +\contentsline {section}{\numberline {6.8}More to explore}{100}{section.6.8}% +\contentsline {section}{\numberline {6.9}Exercises}{100}{section.6.9}% +\contentsline {subsection}{\numberline {6.9.1}The Welch test on a shifted-and-scaled \(t\) distribution}{100}{subsection.6.9.1}% +\contentsline {subsection}{\numberline {6.9.2}Plot the bivariate Poisson}{101}{subsection.6.9.2}% +\contentsline {subsection}{\numberline {6.9.3}Check the bivariate Poisson function}{101}{subsection.6.9.3}% +\contentsline {subsection}{\numberline {6.9.4}Add error-catching to the bivariate Poisson function}{101}{subsection.6.9.4}% +\contentsline {subsection}{\numberline {6.9.5}A bivariate negative binomial distribution}{102}{subsection.6.9.5}% +\contentsline {subsection}{\numberline {6.9.6}Another bivariate negative binomial distribution}{103}{subsection.6.9.6}% +\contentsline {subsection}{\numberline {6.9.7}Plot the data from a cluster-randomized trial}{104}{subsection.6.9.7}% +\contentsline {subsection}{\numberline {6.9.8}Checking the Cluster RCT DGP}{104}{subsection.6.9.8}% +\contentsline {subsection}{\numberline {6.9.9}More school-level variation}{104}{subsection.6.9.9}% +\contentsline {subsection}{\numberline {6.9.10}Cluster-randomized trial with baseline predictors}{104}{subsection.6.9.10}% +\contentsline {subsection}{\numberline {6.9.11}3-parameter IRT datasets}{105}{subsection.6.9.11}% +\contentsline {subsection}{\numberline {6.9.12}Check the 3-parameter IRT DGP}{106}{subsection.6.9.12}% +\contentsline {subsection}{\numberline {6.9.13}Explore the 3-parameter IRT model}{106}{subsection.6.9.13}% +\contentsline {subsection}{\numberline {6.9.14}Random effects meta-regression}{106}{subsection.6.9.14}% +\contentsline {subsection}{\numberline {6.9.15}Meta-regression with selective reporting}{107}{subsection.6.9.15}% +\contentsline {chapter}{\numberline {7}Estimation procedures}{109}{chapter.7}% +\contentsline {section}{\numberline {7.1}Writing estimation functions}{110}{section.7.1}% +\contentsline {section}{\numberline {7.2}Including Multiple Data Analysis Procedures}{112}{section.7.2}% +\contentsline {section}{\numberline {7.3}Validating an Estimation Function}{116}{section.7.3}% +\contentsline {subsection}{\numberline {7.3.1}Checking against existing implementations}{116}{subsection.7.3.1}% +\contentsline {subsection}{\numberline {7.3.2}Checking novel procedures}{118}{subsection.7.3.2}% +\contentsline {subsection}{\numberline {7.3.3}Checking with simulations}{121}{subsection.7.3.3}% +\contentsline {section}{\numberline {7.4}Handling errors, warnings, and other hiccups}{122}{section.7.4}% +\contentsline {subsection}{\numberline {7.4.1}Capturing errors and warnings}{123}{subsection.7.4.1}% +\contentsline {subsection}{\numberline {7.4.2}Adapting estimators for errors and warnings}{128}{subsection.7.4.2}% +\contentsline {subsection}{\numberline {7.4.3}The \texttt {safely()} option}{130}{subsection.7.4.3}% +\contentsline {section}{\numberline {7.5}Exercises}{133}{section.7.5}% +\contentsline {subsection}{\numberline {7.5.1}More Heteroskedastic ANOVA}{133}{subsection.7.5.1}% +\contentsline {subsection}{\numberline {7.5.2}Contingent testing}{134}{subsection.7.5.2}% +\contentsline {subsection}{\numberline {7.5.3}Check the cluster-RCT functions}{135}{subsection.7.5.3}% +\contentsline {subsection}{\numberline {7.5.4}Extending the cluster-RCT functions}{135}{subsection.7.5.4}% +\contentsline {subsection}{\numberline {7.5.5}Contingent estimator processing}{136}{subsection.7.5.5}% +\contentsline {subsection}{\numberline {7.5.6}Estimating 3-parameter item response theory models}{136}{subsection.7.5.6}% +\contentsline {subsection}{\numberline {7.5.7}Meta-regression with selective reporting}{136}{subsection.7.5.7}% +\contentsline {chapter}{\numberline {8}Running the Simulation Process}{139}{chapter.8}% +\contentsline {section}{\numberline {8.1}Repeating oneself}{139}{section.8.1}% +\contentsline {section}{\numberline {8.2}One run at a time}{140}{section.8.2}% +\contentsline {section}{\numberline {8.3}Reparameterizing}{143}{section.8.3}% +\contentsline {section}{\numberline {8.4}Bundling simulations with \texttt {simhelpers}}{144}{section.8.4}% +\contentsline {section}{\numberline {8.5}Seeds and pseudo-random number generators}{146}{section.8.5}% +\contentsline {section}{\numberline {8.6}Conclusions}{149}{section.8.6}% +\contentsline {section}{\numberline {8.7}Exercises}{150}{section.8.7}% +\contentsline {subsection}{\numberline {8.7.1}Welch simulations}{150}{subsection.8.7.1}% +\contentsline {subsection}{\numberline {8.7.2}Compare sampling distributions of Pearson's correlation coefficients}{150}{subsection.8.7.2}% +\contentsline {subsection}{\numberline {8.7.3}Reparameterization, redux}{150}{subsection.8.7.3}% +\contentsline {subsection}{\numberline {8.7.4}Fancy clustered RCT simulations}{150}{subsection.8.7.4}% +\contentsline {chapter}{\numberline {9}Performance Measures}{153}{chapter.9}% +\contentsline {section}{\numberline {9.1}Measures for Point Estimators}{155}{section.9.1}% +\contentsline {subsection}{\numberline {9.1.1}Comparing the Performance of the Cluster RCT Estimation Procedures}{158}{subsection.9.1.1}% +\contentsline {subsubsection}{Are the estimators biased?}{158}{section*.12}% +\contentsline {subsubsection}{Which method has the smallest standard error?}{159}{section*.13}% +\contentsline {subsubsection}{Which method has the smallest Root Mean Squared Error?}{160}{section*.14}% +\contentsline {subsection}{\numberline {9.1.2}Robust Performance Measures}{160}{subsection.9.1.2}% +\contentsline {section}{\numberline {9.2}Measures for Variance Estimators}{163}{section.9.2}% +\contentsline {subsection}{\numberline {9.2.1}Calibration}{163}{subsection.9.2.1}% +\contentsline {subsection}{\numberline {9.2.2}Stability}{164}{subsection.9.2.2}% +\contentsline {subsection}{\numberline {9.2.3}Assessing SEs for the Cluster RCT Simulation}{166}{subsection.9.2.3}% +\contentsline {section}{\numberline {9.3}Measures for Confidence Intervals}{167}{section.9.3}% +\contentsline {subsection}{\numberline {9.3.1}Confidence Intervals in the Cluster RCT Simulation}{168}{subsection.9.3.1}% +\contentsline {section}{\numberline {9.4}Measures for Inferential Procedures (Hypothesis Tests)}{170}{section.9.4}% +\contentsline {subsection}{\numberline {9.4.1}Validity}{170}{subsection.9.4.1}% +\contentsline {subsection}{\numberline {9.4.2}Power}{171}{subsection.9.4.2}% +\contentsline {subsection}{\numberline {9.4.3}Rejection Rates}{172}{subsection.9.4.3}% +\contentsline {subsection}{\numberline {9.4.4}Inference in the Cluster RCT Simulation}{172}{subsection.9.4.4}% +\contentsline {section}{\numberline {9.5}Relative or Absolute Measures?}{174}{section.9.5}% +\contentsline {subsection}{\numberline {9.5.1}Performance relative to the target parameter}{174}{subsection.9.5.1}% +\contentsline {subsection}{\numberline {9.5.2}Performance relative to a benchmark estimator}{176}{subsection.9.5.2}% +\contentsline {section}{\numberline {9.6}Estimands Not Represented By a Parameter}{177}{section.9.6}% +\contentsline {section}{\numberline {9.7}Uncertainty in Performance Estimates (the Monte Carlo Standard Error)}{180}{section.9.7}% +\contentsline {subsection}{\numberline {9.7.1}Conventional measures for point estimators}{180}{subsection.9.7.1}% +\contentsline {subsection}{\numberline {9.7.2}MCSEs for robust measures for point estimators}{182}{subsection.9.7.2}% +\contentsline {subsection}{\numberline {9.7.3}MCSE for Confidence Intervals and Hypothesis Tests}{183}{subsection.9.7.3}% +\contentsline {subsection}{\numberline {9.7.4}Using the Jackknife to obtain MCSEs (for relative variance estimators)}{183}{subsection.9.7.4}% +\contentsline {subsection}{\numberline {9.7.5}Using bootstrap for MCSEs for relative performance measures}{185}{subsection.9.7.5}% +\contentsline {section}{\numberline {9.8}Performance Calculations with the \texttt {simhelpers} Package}{187}{section.9.8}% +\contentsline {subsection}{\numberline {9.8.1}Quick Performance Calculations for our Cluster RCT Example}{189}{subsection.9.8.1}% +\contentsline {section}{\numberline {9.9}Concluding thoughts}{190}{section.9.9}% +\contentsline {subsection}{\numberline {9.9.1}Summary of Peformance Measures}{191}{subsection.9.9.1}% +\contentsline {section}{\numberline {9.10}Exercises}{192}{section.9.10}% +\contentsline {subsection}{\numberline {9.10.1}Brown and Forsythe (1974) results}{192}{subsection.9.10.1}% +\contentsline {subsection}{\numberline {9.10.2}Size-adjusted power}{192}{subsection.9.10.2}% +\contentsline {subsection}{\numberline {9.10.3}Three correlation estimators}{193}{subsection.9.10.3}% +\contentsline {subsection}{\numberline {9.10.4}Confidence interval comparison}{194}{subsection.9.10.4}% +\contentsline {subsection}{\numberline {9.10.5}Jackknife calculation of MCSEs for RMSE}{195}{subsection.9.10.5}% +\contentsline {subsection}{\numberline {9.10.6}Jackknife calculation of MCSEs for RMSE ratios}{195}{subsection.9.10.6}% +\contentsline {subsection}{\numberline {9.10.7}Distribution theory for person-level average treatment effects}{195}{subsection.9.10.7}% +\contentsline {part}{III\hspace {1em}Systematic Simulations}{197}{part.3}% +\contentsline {chapter}{\numberline {10}Simulating across multiple scenarios}{199}{chapter.10}% +\contentsline {section}{\numberline {10.1}Simulating across levels of a single factor}{200}{section.10.1}% +\contentsline {subsection}{\numberline {10.1.1}A performance summary function}{203}{subsection.10.1.1}% +\contentsline {subsection}{\numberline {10.1.2}Adding performance calculations to the simulation driver}{204}{subsection.10.1.2}% +\contentsline {section}{\numberline {10.2}Simulating across multiple factors}{206}{section.10.2}% +\contentsline {section}{\numberline {10.3}Using pmap to run multifactor simulations}{208}{section.10.3}% +\contentsline {section}{\numberline {10.4}When to calculate performance metrics}{212}{section.10.4}% +\contentsline {subsection}{\numberline {10.4.1}Aggregate as you simulate (inside)}{212}{subsection.10.4.1}% +\contentsline {subsection}{\numberline {10.4.2}Keep all simulation runs (outside)}{212}{subsection.10.4.2}% +\contentsline {subsection}{\numberline {10.4.3}Getting raw results ready for analysis}{214}{subsection.10.4.3}% +\contentsline {section}{\numberline {10.5}Summary}{216}{section.10.5}% +\contentsline {section}{\numberline {10.6}Exercises}{217}{section.10.6}% +\contentsline {subsection}{\numberline {10.6.1}Extending Brown and Forsythe}{217}{subsection.10.6.1}% +\contentsline {subsection}{\numberline {10.6.2}Comparing the trimmed mean, median and mean}{217}{subsection.10.6.2}% +\contentsline {subsection}{\numberline {10.6.3}Estimating latent correlations}{218}{subsection.10.6.3}% +\contentsline {subsection}{\numberline {10.6.4}Meta-regression}{219}{subsection.10.6.4}% +\contentsline {subsection}{\numberline {10.6.5}Examine a multifactor simulation design}{219}{subsection.10.6.5}% +\contentsline {chapter}{\numberline {11}Designing multifactor simulations}{221}{chapter.11}% +\contentsline {section}{\numberline {11.1}Choosing parameter combinations}{222}{section.11.1}% +\contentsline {section}{\numberline {11.2}Case Study: A multifactor evaluation of cluster RCT estimators}{224}{section.11.2}% +\contentsline {subsection}{\numberline {11.2.1}Choosing parameters for the Clustered RCT}{224}{subsection.11.2.1}% +\contentsline {subsection}{\numberline {11.2.2}Redundant factor combinations}{226}{subsection.11.2.2}% +\contentsline {subsection}{\numberline {11.2.3}Running the simulations}{226}{subsection.11.2.3}% +\contentsline {subsection}{\numberline {11.2.4}Calculating performance metrics}{227}{subsection.11.2.4}% +\contentsline {section}{\numberline {11.3}Conclusions}{229}{section.11.3}% +\contentsline {chapter}{\numberline {12}Exploring and presenting simulation results}{231}{chapter.12}% +\contentsline {section}{\numberline {12.1}Tabulation}{232}{section.12.1}% +\contentsline {subsection}{\numberline {12.1.1}Example: estimators of treatment variation}{234}{subsection.12.1.1}% +\contentsline {section}{\numberline {12.2}Visualization}{235}{section.12.2}% +\contentsline {subsection}{\numberline {12.2.1}Example 0: RMSE in Cluster RCTs}{236}{subsection.12.2.1}% +\contentsline {subsection}{\numberline {12.2.2}Example 1: Biserial correlation estimation}{237}{subsection.12.2.2}% +\contentsline {subsection}{\numberline {12.2.3}Example 2: Variance estimation and Meta-regression}{237}{subsection.12.2.3}% +\contentsline {subsection}{\numberline {12.2.4}Example 3: Heat maps of coverage}{238}{subsection.12.2.4}% +\contentsline {subsection}{\numberline {12.2.5}Example 4: Relative performance of treatment effect estimators}{239}{subsection.12.2.5}% +\contentsline {section}{\numberline {12.3}Modeling}{241}{section.12.3}% +\contentsline {subsection}{\numberline {12.3.1}Example 1: Biserial, revisited}{242}{subsection.12.3.1}% +\contentsline {subsection}{\numberline {12.3.2}Example 2: Comparing methods for cross-classified data}{243}{subsection.12.3.2}% +\contentsline {section}{\numberline {12.4}Reporting}{244}{section.12.4}% +\contentsline {chapter}{\numberline {13}Building good visualizations}{247}{chapter.13}% +\contentsline {section}{\numberline {13.1}Subsetting and Many Small Multiples}{248}{section.13.1}% +\contentsline {section}{\numberline {13.2}Bundling}{251}{section.13.2}% +\contentsline {section}{\numberline {13.3}Aggregation}{254}{section.13.3}% +\contentsline {subsubsection}{\numberline {13.3.0.1}Some notes on how to aggregate}{257}{subsubsection.13.3.0.1}% +\contentsline {section}{\numberline {13.4}Comparing true SEs with standardization}{257}{section.13.4}% +\contentsline {section}{\numberline {13.5}The Bias-SE-RMSE plot}{262}{section.13.5}% +\contentsline {section}{\numberline {13.6}Assessing the quality of the estimated SEs}{264}{section.13.6}% +\contentsline {subsection}{\numberline {13.6.1}Stability of estimated SEs}{266}{subsection.13.6.1}% +\contentsline {section}{\numberline {13.7}Assessing confidence intervals}{267}{section.13.7}% +\contentsline {section}{\numberline {13.8}Conclusions}{270}{section.13.8}% +\contentsline {section}{\numberline {13.9}Exercises}{270}{section.13.9}% +\contentsline {subsection}{\numberline {13.9.1}Checking bias}{272}{subsection.13.9.1}% +\contentsline {subsection}{\numberline {13.9.2}Assessing uncertainty}{272}{subsection.13.9.2}% +\contentsline {subsection}{\numberline {13.9.3}Assessing power}{272}{subsection.13.9.3}% +\contentsline {subsection}{\numberline {13.9.4}Going deeper with coverage}{272}{subsection.13.9.4}% +\contentsline {subsection}{\numberline {13.9.5}Pearson correlations with a bivariate Poisson distribution}{272}{subsection.13.9.5}% +\contentsline {subsection}{\numberline {13.9.6}Making another plot for assessing SEs}{272}{subsection.13.9.6}% +\contentsline {chapter}{\numberline {14}Special Topics on Reporting Simulation Results}{273}{chapter.14}% +\contentsline {section}{\numberline {14.1}Using regression to analyze simulation results}{273}{section.14.1}% +\contentsline {subsection}{\numberline {14.1.1}Example 1: Biserial, revisited}{273}{subsection.14.1.1}% +\contentsline {subsection}{\numberline {14.1.2}Example 2: Cluster RCT example, revisited}{276}{subsection.14.1.2}% +\contentsline {subsubsection}{\numberline {14.1.2.1}Using LASSO to simplify the model}{277}{subsubsection.14.1.2.1}% +\contentsline {section}{\numberline {14.2}Using regression trees to find important factors}{282}{section.14.2}% +\contentsline {section}{\numberline {14.3}Analyzing results with few iterations per scenario}{284}{section.14.3}% +\contentsline {subsection}{\numberline {14.3.1}Example: ClusterRCT with only 25 replicates per scenario}{285}{subsection.14.3.1}% +\contentsline {section}{\numberline {14.4}What to do with warnings in simulations}{290}{section.14.4}% +\contentsline {section}{\numberline {14.5}Conclusions}{291}{section.14.5}% +\contentsline {chapter}{\numberline {15}Case study: Comparing different estimators}{293}{chapter.15}% +\contentsline {section}{\numberline {15.1}Bias-variance tradeoffs}{296}{section.15.1}% +\contentsline {chapter}{\numberline {16}Simulations as evidence}{301}{chapter.16}% +\contentsline {section}{\numberline {16.1}Break symmetries and regularities}{302}{section.16.1}% +\contentsline {section}{\numberline {16.2}Use extensive multi-factor simulations.}{303}{section.16.2}% +\contentsline {section}{\numberline {16.3}Use simulation DGPs from prior literature.}{303}{section.16.3}% +\contentsline {section}{\numberline {16.4}Pick simulation factors based on real data}{303}{section.16.4}% +\contentsline {section}{\numberline {16.5}Resample real data directly to get authentic distributions}{304}{section.16.5}% +\contentsline {section}{\numberline {16.6}Fully calibrated simulations}{304}{section.16.6}% +\contentsline {section}{\numberline {16.7}Generating calibrated data with a copula}{305}{section.16.7}% +\contentsline {subsection}{\numberline {16.7.1}Step 1. Predict the outcome given the covariates}{307}{subsection.16.7.1}% +\contentsline {subsection}{\numberline {16.7.2}Step 2. Tie \(\hat {Y}\) to \(Y\) with a copula}{307}{subsection.16.7.2}% +\contentsline {subsection}{\numberline {16.7.3}Step 3. Get some new Xs}{309}{subsection.16.7.3}% +\contentsline {subsection}{\numberline {16.7.4}Step 4. Generate a predicted \(\hat {Y}_i\) for each \(X_i\) using the machine learner.}{310}{subsection.16.7.4}% +\contentsline {subsection}{\numberline {16.7.5}Step 5. Generate the final, observed, \(Y_i\) using a copula that ties \(\hat {Y}_i\) to \(Y_i\).}{310}{subsection.16.7.5}% +\contentsline {subsection}{\numberline {16.7.6}How did we do?}{311}{subsection.16.7.6}% +\contentsline {subsection}{\numberline {16.7.7}Building a multifactor simulation from our calibrated DGP}{312}{subsection.16.7.7}% +\contentsline {part}{IV\hspace {1em}Computational Considerations}{313}{part.4}% +\contentsline {chapter}{\numberline {17}Organizing a simulation project}{315}{chapter.17}% +\contentsline {section}{\numberline {17.1}Simulation project structure}{315}{section.17.1}% +\contentsline {section}{\numberline {17.2}Well structured code files}{316}{section.17.2}% +\contentsline {subsection}{\numberline {17.2.1}Putting headers in your .R file}{317}{subsection.17.2.1}% +\contentsline {subsection}{\numberline {17.2.2}The source command}{317}{subsection.17.2.2}% +\contentsline {subsection}{\numberline {17.2.3}Storing testing code in your scripts}{319}{subsection.17.2.3}% +\contentsline {section}{\numberline {17.3}Principled directory structures}{320}{section.17.3}% +\contentsline {section}{\numberline {17.4}Saving simulation results}{321}{section.17.4}% +\contentsline {subsection}{\numberline {17.4.1}File formats}{321}{subsection.17.4.1}% +\contentsline {subsection}{\numberline {17.4.2}Saving simulations as you go}{323}{subsection.17.4.2}% +\contentsline {chapter}{\numberline {18}Parallel Processing}{327}{chapter.18}% +\contentsline {section}{\numberline {18.1}Parallel on your computer}{328}{section.18.1}% +\contentsline {section}{\numberline {18.2}Parallel on a virtual machine}{329}{section.18.2}% +\contentsline {section}{\numberline {18.3}Parallel on a cluster}{330}{section.18.3}% +\contentsline {subsection}{\numberline {18.3.1}Command-line interfaces and CMD mode in R}{330}{subsection.18.3.1}% +\contentsline {subsection}{\numberline {18.3.2}Task dispatchers and command line scripts}{332}{subsection.18.3.2}% +\contentsline {subsection}{\numberline {18.3.3}Moving files to and from a cluster}{334}{subsection.18.3.3}% +\contentsline {subsection}{\numberline {18.3.4}Checking on a job}{334}{subsection.18.3.4}% +\contentsline {subsection}{\numberline {18.3.5}Running lots of jobs on a cluster}{335}{subsection.18.3.5}% +\contentsline {subsection}{\numberline {18.3.6}Resources for Harvard's Odyssey}{337}{subsection.18.3.6}% +\contentsline {subsection}{\numberline {18.3.7}Acknowledgements}{337}{subsection.18.3.7}% +\contentsline {chapter}{\numberline {19}Debugging and Testing}{339}{chapter.19}% +\contentsline {section}{\numberline {19.1}Debugging with \texttt {print()}}{339}{section.19.1}% +\contentsline {section}{\numberline {19.2}Debugging with \texttt {browser()}}{340}{section.19.2}% +\contentsline {section}{\numberline {19.3}Debugging with \texttt {debug()}}{341}{section.19.3}% +\contentsline {section}{\numberline {19.4}Protecting functions with \texttt {stopifnot()}}{341}{section.19.4}% +\contentsline {section}{\numberline {19.5}Testing code}{342}{section.19.5}% +\contentsline {part}{V\hspace {1em}Complex Data Structures}{347}{part.5}% +\contentsline {chapter}{\numberline {20}Using simulation as a power calculator}{349}{chapter.20}% +\contentsline {section}{\numberline {20.1}Getting design parameters from pilot data}{350}{section.20.1}% +\contentsline {section}{\numberline {20.2}The data generating process}{351}{section.20.2}% +\contentsline {section}{\numberline {20.3}Running the simulation}{354}{section.20.3}% +\contentsline {section}{\numberline {20.4}Evaluating power}{355}{section.20.4}% +\contentsline {subsection}{\numberline {20.4.1}Checking validity of our models}{355}{subsection.20.4.1}% +\contentsline {subsection}{\numberline {20.4.2}Assessing Precision (SE)}{358}{subsection.20.4.2}% +\contentsline {subsection}{\numberline {20.4.3}Assessing power}{358}{subsection.20.4.3}% +\contentsline {subsection}{\numberline {20.4.4}Assessing Minimum Detectable Effects}{359}{subsection.20.4.4}% +\contentsline {section}{\numberline {20.5}Power for Multilevel Data}{360}{section.20.5}% +\contentsline {chapter}{\numberline {21}Simulation under the Potential Outcomes Framework}{365}{chapter.21}% +\contentsline {section}{\numberline {21.1}Finite vs.\nobreakspace {}Superpopulation inference}{366}{section.21.1}% +\contentsline {section}{\numberline {21.2}Data generation processes for potential outcomes}{366}{section.21.2}% +\contentsline {section}{\numberline {21.3}Finite sample performance measures}{369}{section.21.3}% +\contentsline {section}{\numberline {21.4}Nested finite simulation procedure}{372}{section.21.4}% +\contentsline {section}{\numberline {21.5}Calibrated simulations within the potential outcomes framework}{376}{section.21.5}% +\contentsline {subsection}{\numberline {21.5.1}Matching-based imputation}{376}{subsection.21.5.1}% +\contentsline {subsection}{\numberline {21.5.2}Model-based imputation}{377}{subsection.21.5.2}% +\contentsline {subsection}{\numberline {21.5.3}Using the imputed data}{377}{subsection.21.5.3}% +\contentsline {chapter}{\numberline {22}The Parametric bootstrap}{379}{chapter.22}% +\contentsline {section}{\numberline {22.1}Air conditioners: a stolen case study}{380}{section.22.1}% +\contentsline {chapter}{\numberline {A}Coding Reference}{383}{appendix.A}% +\contentsline {section}{\numberline {A.1}How to repeat yourself}{383}{section.A.1}% +\contentsline {subsection}{\numberline {A.1.1}Using \texttt {replicate()}}{383}{subsection.A.1.1}% +\contentsline {subsection}{\numberline {A.1.2}Using \texttt {map()}}{384}{subsection.A.1.2}% +\contentsline {subsection}{\numberline {A.1.3}map with no inputs}{386}{subsection.A.1.3}% +\contentsline {subsection}{\numberline {A.1.4}Other approaches for repetition}{386}{subsection.A.1.4}% +\contentsline {section}{\numberline {A.2}Default arguments for functions}{387}{section.A.2}% +\contentsline {section}{\numberline {A.3}Profiling Code}{388}{section.A.3}% +\contentsline {subsection}{\numberline {A.3.1}Using \texttt {Sys.time()} and \texttt {system.time()}}{388}{subsection.A.3.1}% +\contentsline {subsection}{\numberline {A.3.2}The \texttt {tictoc} package}{389}{subsection.A.3.2}% +\contentsline {subsection}{\numberline {A.3.3}The \texttt {bench} package}{389}{subsection.A.3.3}% +\contentsline {subsection}{\numberline {A.3.4}Profiling with \texttt {profvis}}{392}{subsection.A.3.4}% +\contentsline {section}{\numberline {A.4}Optimizing code (and why you often shouldn't)}{392}{section.A.4}% +\contentsline {subsection}{\numberline {A.4.1}Hand-building functions}{393}{subsection.A.4.1}% +\contentsline {subsection}{\numberline {A.4.2}Computational efficiency versus simplicity}{394}{subsection.A.4.2}% +\contentsline {subsection}{\numberline {A.4.3}Reusing code to speed up computation}{395}{subsection.A.4.3}% +\contentsline {chapter}{\numberline {B}Further readings and resources}{401}{appendix.B}% diff --git a/book.bib b/book.bib index c280cf8..bd632aa 100644 --- a/book.bib +++ b/book.bib @@ -1,13 +1,21 @@ %% This BibTeX bibliography file was created using BibDesk. %% https://bibdesk.sourceforge.io/ -%% Created for Luke Miratrix at 2025-12-18 10:05:30 -0500 +%% Created for Luke Miratrix at 2026-01-16 15:00:23 -0500 %% Saved with string encoding Unicode (UTF-8) +@book{efron1994introduction, + author = {Efron, Bradley and Tibshirani, Robert J}, + date-added = {2026-01-16 15:00:18 -0500}, + date-modified = {2026-01-16 15:00:18 -0500}, + publisher = {Chapman and Hall/CRC}, + title = {An introduction to the bootstrap}, + year = {1994}} + @article{Benjamin2017redefine, author = {Benjamin, Daniel J. and Berger, James O. and Johannesson, Magnus and Nosek, Brian A. and Wagenmakers, E.-J. and Berk, Richard and Bollen, Kenneth A. and Brembs, Bj{\"o}rn and Brown, Lawrence and Camerer, Colin and Cesarini, David and Chambers, Christopher D. and Clyde, Merlise and Cook, Thomas D. and De Boeck, Paul and Dienes, Zoltan and Dreber, Anna and Easwaran, Kenny and Efferson, Charles and Fehr, Ernst and Fidler, Fiona and Field, Andy P. and Forster, Malcolm and George, Edward I. and Gonzalez, Richard and Goodman, Steven and Green, Edwin and Green, Donald P. and Greenwald, Anthony G. and Hadfield, Jarrod D. and Hedges, Larry V. and Held, Leonhard and Hua Ho, Teck and Hoijtink, Herbert and Hruschka, Daniel J. and Imai, Kosuke and Imbens, Guido and Ioannidis, John P. A. and Jeon, Minjeong and Jones, James Holland and Kirchler, Michael and Laibson, David and List, John and Little, Roderick and Lupia, Arthur and Machery, Edouard and Maxwell, Scott E. and McCarthy, Michael and Moore, Don A. and Morgan, Stephen L. and Munaf{\'o}, Marcus and Nakagawa, Shinichi and Nyhan, Brendan and Parker, Timothy H. and Pericchi, Luis and Perugini, Marco and Rouder, Jeff and Rousseau, Judith and Savalei, Victoria and Sch{\"o}nbrodt, Felix D. and Sellke, Thomas and Sinclair, Betsy and Tingley, Dustin and Van Zandt, Trisha and Vazire, Simine and Watts, Duncan J. and Winship, Christopher and Wolpert, Robert L. and Xie, Yu and Young, Cristobal and Zinman, Jonathan and Johnson, Valen E.}, doi = {10.1038/s41562-017-0189-z}, @@ -485,7 +493,7 @@ @article{Bloom2016using title = {{Using Multisite Experiments to Study Cross-Site Variation in Treatment Effects: A Hybrid Approach With Fixed Intercepts and a Random Treatment Coefficient}}, volume = {10}, year = {2016}, - bdsk-file-1 = {YnBsaXN0MDDSAQIDBFxyZWxhdGl2ZVBhdGhYYm9va21hcmtfEGguLi8uLi8uLi8uLi9Eb2N1bWVudHMvUGFwZXJzIExpYnJhcnkvQmxvb20vSm91cm5hbCBvZiBSZXNlYXJjaCBvbiBFZHVjYXRpb25hbCBFZmZlY3RpdmVuZXNzX1BvcnRlcl8xLnBkZk8RBGRib29rZAQAAAAABBAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgAwAABQAAAAEBAABVc2VycwAAAAkAAAABAQAAbG1pcmF0cml4AAAACQAAAAEBAABEb2N1bWVudHMAAAAOAAAAAQEAAFBhcGVycyBMaWJyYXJ5AAAFAAAAAQEAAEJsb29tAAAAPQAAAAEBAABKb3VybmFsIG9mIFJlc2VhcmNoIG9uIEVkdWNhdGlvbmFsIEVmZmVjdGl2ZW5lc3NfUG9ydGVyXzEucGRmAAAAGAAAAAEGAAAEAAAAFAAAACgAAAA8AAAAVAAAAGQAAAAIAAAABAMAAMs4AAAAAAAACAAAAAQDAAC97g4AAAAAAAgAAAAEAwAAMEEPAAAAAAAIAAAABAMAANRHDwAAAAAACAAAAAQDAAA5SQ8AAAAAAAgAAAAEAwAATUkPAAAAAAAYAAAAAQYAAMwAAADcAAAA7AAAAPwAAAAMAQAAHAEAAAgAAAAABAAAQb5MnI4AAAAYAAAAAQIAAAEAAAAAAAAADwAAAAAAAAAAAAAAAAAAAAgAAAAEAwAABAAAAAAAAAAEAAAAAwMAAPcBAAAIAAAAAQkAAGZpbGU6Ly8vCAAAAAEBAABMb2JhZG9yYQgAAAAEAwAAAJCClucAAAAIAAAAAAQAAEHHFbB+AAAAJAAAAAEBAAA4OTYxOUJFQy1DRENDLTQyQ0EtODAwNy0yOTc0QUE5RUI4MzQYAAAAAQIAAIEAAAABAAAA7xMAAAEAAAAAAAAAAAAAAAEAAAABAQAALwAAAAAAAAABBQAAHwEAAAECAAAwNDM1NjM1YTNiZjkxMTc2MTFlYzk1NWM1YzlmYjNhZGFmY2ZiMTY5OWNmN2QyMzA3ZjBjMmZiNjZjOGRiYzI1OzAwOzAwMDAwMDAwOzAwMDAwMDAwOzAwMDAwMDAwOzAwMDAwMDAwMDAwMDAwMjA7Y29tLmFwcGxlLmFwcC1zYW5kYm94LnJlYWQtd3JpdGU7MDE7MDEwMDAwMTE7MDAwMDAwMDAwMDBmNDk0ZDs2NDsvdXNlcnMvbG1pcmF0cml4L2RvY3VtZW50cy9wYXBlcnMgbGlicmFyeS9ibG9vbS9qb3VybmFsIG9mIHJlc2VhcmNoIG9uIGVkdWNhdGlvbmFsIGVmZmVjdGl2ZW5lc3NfcG9ydGVyXzEucGRmAADMAAAA/v///wEAAAAAAAAAEAAAAAQQAACsAAAAAAAAAAUQAAAsAQAAAAAAABAQAABcAQAAAAAAAEAQAABMAQAAAAAAAAIgAAAkAgAAAAAAAAUgAACYAQAAAAAAABAgAACoAQAAAAAAABEgAADYAQAAAAAAABIgAAC4AQAAAAAAABMgAADIAQAAAAAAACAgAAAEAgAAAAAAADAgAAAwAgAAAAAAAAHAAAB8AQAAAAAAABHAAAAUAAAAAAAAABLAAACMAQAAAAAAAIDwAAA4AgAAAAAAAAAIAA0AGgAjAI4AAAAAAAACAQAAAAAAAAAFAAAAAAAAAAAAAAAAAAAE9g==}, + bdsk-file-1 = {YnBsaXN0MDDSAQIDBFxyZWxhdGl2ZVBhdGhYYm9va21hcmtfEGguLi8uLi8uLi8uLi9Eb2N1bWVudHMvUGFwZXJzIExpYnJhcnkvQmxvb20vSm91cm5hbCBvZiBSZXNlYXJjaCBvbiBFZHVjYXRpb25hbCBFZmZlY3RpdmVuZXNzX1BvcnRlcl8xLnBkZk8RBGRib29rZAQAAAAABBAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgAwAABQAAAAEBAABVc2VycwAAAAkAAAABAQAAbG1pcmF0cml4AAAACQAAAAEBAABEb2N1bWVudHMAAAAOAAAAAQEAAFBhcGVycyBMaWJyYXJ5AAAFAAAAAQEAAEJsb29tAAAAPQAAAAEBAABKb3VybmFsIG9mIFJlc2VhcmNoIG9uIEVkdWNhdGlvbmFsIEVmZmVjdGl2ZW5lc3NfUG9ydGVyXzEucGRmAAAAGAAAAAEGAAAEAAAAFAAAACgAAAA8AAAAVAAAAGQAAAAIAAAABAMAAMs4AAAAAAAACAAAAAQDAAC97g4AAAAAAAgAAAAEAwAAMEEPAAAAAAAIAAAABAMAANRHDwAAAAAACAAAAAQDAAA5SQ8AAAAAAAgAAAAEAwAATUkPAAAAAAAYAAAAAQYAAMwAAADcAAAA7AAAAPwAAAAMAQAAHAEAAAgAAAAABAAAQb5MnI4AAAAYAAAAAQIAAAEAAAAAAAAADwAAAAAAAAAAAAAAAAAAAAgAAAAEAwAABAAAAAAAAAAEAAAAAwMAAPcBAAAIAAAAAQkAAGZpbGU6Ly8vCAAAAAEBAABMb2JhZG9yYQgAAAAEAwAAAJCClucAAAAIAAAAAAQAAEHHFbB+AAAAJAAAAAEBAAA4OTYxOUJFQy1DRENDLTQyQ0EtODAwNy0yOTc0QUE5RUI4MzQYAAAAAQIAAIEAAAABAAAA7xMAAAEAAAAAAAAAAAAAAAEAAAABAQAALwAAAAAAAAABBQAAHwEAAAECAAAzMWZjNjQ4OTM0YjY2MjhhNWE4YzE2ODg0ZDYxM2M1OWQwOWExZDRkNThkYzVhMTNhMDY1MDg5ZTkyMGY1MzBhOzAwOzAwMDAwMDAwOzAwMDAwMDAwOzAwMDAwMDAwOzAwMDAwMDAwMDAwMDAwMjA7Y29tLmFwcGxlLmFwcC1zYW5kYm94LnJlYWQtd3JpdGU7MDE7MDEwMDAwMGU7MDAwMDAwMDAwMDBmNDk0ZDs1NDsvdXNlcnMvbG1pcmF0cml4L2RvY3VtZW50cy9wYXBlcnMgbGlicmFyeS9ibG9vbS9qb3VybmFsIG9mIHJlc2VhcmNoIG9uIGVkdWNhdGlvbmFsIGVmZmVjdGl2ZW5lc3NfcG9ydGVyXzEucGRmAADMAAAA/v///wEAAAAAAAAAEAAAAAQQAACsAAAAAAAAAAUQAAAsAQAAAAAAABAQAABcAQAAAAAAAEAQAABMAQAAAAAAAAIgAAAkAgAAAAAAAAUgAACYAQAAAAAAABAgAACoAQAAAAAAABEgAADYAQAAAAAAABIgAAC4AQAAAAAAABMgAADIAQAAAAAAACAgAAAEAgAAAAAAADAgAAAwAgAAAAAAAAHAAAB8AQAAAAAAABHAAAAUAAAAAAAAABLAAACMAQAAAAAAAIDwAAA4AgAAAAAAAAAIAA0AGgAjAI4AAAAAAAACAQAAAAAAAAAFAAAAAAAAAAAAAAAAAAAE9g==}, bdsk-url-1 = {https://doi.org/10.1080/19345747.2016.1264518}} @article{brown1974SmallSampleBehavior, diff --git a/case_study_code/analyze_cluster_RCT.R b/case_study_code/analyze_cluster_RCT.R index 632365f..f699c1a 100644 --- a/case_study_code/analyze_cluster_RCT.R +++ b/case_study_code/analyze_cluster_RCT.R @@ -76,24 +76,28 @@ analysis_MLM_safe <- function( dat, all_results = FALSE ) { if ( is.null( M1$result ) ) { # we had an error! tibble( ATE_hat = NA, SE_hat = NA, p_value = NA, - message = M1$message, - warning = M1$warning, + message = length( M1$message ) > 0, + warning = length( M1$warning > 0 ), error = TRUE ) } else { sum <- summary( M1$result ) + tibble( ATE_hat = sum$coefficients["Z","Estimate"], SE_hat = sum$coefficients["Z","Std. Error"], p_value = sum$coefficients["Z", "Pr(>|t|)"], - message = list( M1$message ), - warning = list( M1$warning ), + message = length( list( M1$messages ) ) > 0, + warning = length( list( M1$warning ) ) > 0, error = FALSE ) } } - +#' This is to illustrate how we can implement a data analytic workflow +#' mimicing human decision-making. +#' +#' This is _not_ what we use for the simulation results. analysis_MLM_contingent <- function( dat ) { M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data=dat ) diff --git a/case_study_code/clustered_data_simulation.R b/case_study_code/clustered_data_simulation.R index 9b4ad69..db9a77f 100644 --- a/case_study_code/clustered_data_simulation.R +++ b/case_study_code/clustered_data_simulation.R @@ -11,6 +11,9 @@ library( estimatr ) options(list(dplyr.summarise.inform = FALSE)) +source( here::here( "case_study_code/gen_cluster_RCT.R" ) ) + +source( here::here( "case_study_code/analyze_cluster_RCT.R" ) ) # Utility function for printout @@ -19,148 +22,33 @@ scat = function( str, ... ) { } -gen_cluster_RCT <- function( n_bar = 10, - J = 30, - p = 0.5, - gamma_0 = 0, gamma_1 = 0, gamma_2 = 0, - sigma2_u = 0, sigma2_e = 1, - alpha = 0 ) { - - stopifnot( alpha >= 0 ) - - # generate site sizes - n_min = round( n_bar * (1 - alpha) ) - n_max = round( n_bar * (1 + alpha) ) - if ( n_max == n_min ) { - if ( alpha > 0 ) { - warning( "alpha > 0 has no effect when there is no variation in site size" ) - } - nj = rep( n_min, J ) - } else { - nj <- sample( n_min:n_max, J, replace=TRUE ) - } - - # Generate average control outcome and average ATE for all sites - # (The random effects) - u0j = rnorm( J, mean=0, sd=sqrt( sigma2_u ) ) - - # randomize units within each site (proportion p to treatment) - Zj = ifelse( sample( 1:J ) <= J * p, 1, 0) - - # Calculate site intercept for each site - beta_0j = gamma_0 + gamma_1 * Zj + gamma_2 * Zj * (nj-n_bar)/n_bar + u0j - - # Make individual site membership - sid = as.factor( rep( 1:J, nj ) ) - dd = data.frame( sid = sid ) - - # Make individual level tx variables - dd$Z = Zj[ dd$sid ] - - # Generate the residuals - N = sum( nj ) - e = rnorm( N, mean=0, sd=sqrt( sigma2_e ) ) - - # Bundle and send out - dd <- mutate( dd, - sid=as.factor(sid), - Yobs = beta_0j[sid] + e, - Z = Zj[ sid ] ) - - dd -} - - - -#### Analysis functions ##### - - -quiet_lmer = purrr::quietly( lmerTest::lmer ) -quiet_analysis_MLM <- function( dat ) { - M1 = quiet_lmer( Yobs ~ 1 + Z + (1|sid), - data=dat ) - message1 = ifelse( length( M1$message ) > 0, 1, 0 ) - warning1 = ifelse( length( M1$warning ) > 0, 1, 0 ) - M1 = M1$result +quiet_analyze_data <- function(dat, + CR_se_type = "CR2", + agg_se_type = "HC2") { - est = fixef( M1 )[["Z"]] - se = se.fixef( M1 )[["Z"]] - sc = summary(M1)$coefficients - pv = sc["Z",5] - df = sc["Z", "df"] - tibble( ATE_hat = est, SE_hat = se, df = df, p_value = pv, - message = message1, warning = warning1 ) -} - - -analysis_MLM <- function( dat ) { - M1 = lmer( Yobs ~ 1 + Z + (1|sid), - data=dat ) + # MLM + a = Sys.time() + MLM <- analysis_MLM_safe(dat) + MLM$seconds <- as.numeric(Sys.time() - a, units = "secs") - est = fixef( M1 )[["Z"]] - se = se.fixef( M1 )[["Z"]] - sc = summary(M1)$coefficients - pv = sc["Z",5] - df = sc["Z", "df"] - tibble( ATE_hat = est, SE_hat = se, df = df, p_value = pv ) -} - - -analysis_OLS <- function( dat ) { - M2 <- lm_robust( Yobs ~ 1 + Z, - data=dat, clusters=sid ) - est <- M2$coefficients[["Z"]] - se <- M2$std.error[["Z"]] - pv <- M2$p.value[["Z"]] - df <- M2$df[["Z"]] - tibble( ATE_hat = est, SE_hat = se, df = df, p_value = pv ) -} - - -analysis_agg <- function( dat ) { - datagg <- - dat %>% - group_by( sid, Z ) %>% - summarise( - Ybar = mean( Yobs ), - n = n(), - .groups = "drop" - ) + # OLS + a = Sys.time() + LR <- analysis_OLS(dat, se_type = CR_se_type) + LR$seconds <- as.numeric(Sys.time() - a, units = "secs") + LR$message <- 0 + LR$warning <- 0 + LR$error <- 0 - stopifnot( nrow( datagg ) == length(unique(dat$sid) ) ) + # Aggregated + a = Sys.time() + Agg <- analysis_agg(dat, se_type = agg_se_type) + Agg$seconds <- as.numeric(Sys.time() - a, units = "secs") + Agg$message <- 0 + Agg$warning <- 0 + Agg$error <- 0 - M3 <- lm_robust( Ybar ~ 1 + Z, - data=datagg, se_type = "HC2" ) - est <- M3$coefficients[["Z"]] - se <- M3$std.error[["Z"]] - df <- M3$df[["Z"]] - pv <- M3$p.value[["Z"]] - tibble( ATE_hat = est, SE_hat = se, df = df, p_value = pv ) -} - - - -analyze_data = function( dat ) { - MLM = analysis_MLM( dat ) - LR = analysis_OLS( dat ) - Agg = analysis_agg( dat ) - - bind_rows( MLM = MLM, LR = LR, Agg = Agg, - .id = "method" ) -} - - -quiet_analyze_data = function( dat ) { - MLM = quiet_analysis_MLM( dat ) - LR = analysis_OLS( dat ) - Agg = analysis_agg( dat ) - LR$message = 0 - LR$warning = 0 - Agg$message = 0 - Agg$warning = 0 - bind_rows( MLM = MLM, LR = LR, Agg = Agg, - .id = "method" ) + bind_rows(MLM = MLM, LR = LR, Agg = Agg, .id = "method") } @@ -197,7 +85,6 @@ if ( FALSE ) { dat <- gen_cluster_RCT( 5, 3 ) dat - undebug( quiet_analysis_MLM ) quiet_analyze_data( dat ) res <- run_CRT_sim( reps = 5, alpha=0.5 ) diff --git a/case_study_code/clustered_data_simulation_runner.R b/case_study_code/clustered_data_simulation_runner.R index ee11647..a5bb91d 100644 --- a/case_study_code/clustered_data_simulation_runner.R +++ b/case_study_code/clustered_data_simulation_runner.R @@ -5,6 +5,8 @@ # It is sourced in the Secret Run code block in the designing # multifactor simulation area. +# You can also run it directly here. + library( tidyverse ) library( future ) library( furrr ) @@ -25,6 +27,10 @@ params <- expand_grid(!!!crt_design_factors) params$seed = 1:nrow(params) * 17 + 100000 params +set.seed( 40440 ) +params = slice_sample( params, n=nrow(params) ) +params + plan(multisession, workers = parallel::detectCores() - 2) params$res = future_pmap(params, .f = run_CRT_sim, reps = 1000, @@ -36,6 +42,9 @@ res <- params %>% unnest( cols=c(res) ) saveRDS( res, file = "results/simulation_CRT.rds" ) + +# Quick check of the simulation ---- + if ( FALSE ) { # check sim format run_CRT_sim( reps = 5, diff --git a/case_study_code/clustered_data_simulation_timer.R b/case_study_code/clustered_data_simulation_timer.R new file mode 100644 index 0000000..3ffa463 --- /dev/null +++ b/case_study_code/clustered_data_simulation_timer.R @@ -0,0 +1,293 @@ + +# This script runs the multifactor simulation in parallel to generate +# the result files used in the book. + +# It is sourced in the Secret Run code block in the designing +# multifactor simulation area. + +# You can also run it directly here. + +library( tidyverse ) +library( future ) +library( furrr ) + +source( "case_study_code/clustered_data_simulation.R") + + +# Updated MLM analysis with max iterations + flags for convergence / maxfun hit +analysis_MLM_safe <- function( dat, all_results = FALSE, maxfun = 1000 ) { + + M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), + data=dat, + control = lme4::lmerControl( + optimizer = "bobyqa", + optCtrl = list(maxfun = maxfun) + ) ) + + if (all_results) { + return(M1) + } + + if ( is.null( M1$result ) ) { + # we had an error! + tibble( ATE_hat = NA, SE_hat = NA, p_value = NA, + message = length( M1$message ) > 0, + warning = length( M1$warning ) > 0, + error = TRUE ) + } else { + fit = M1$result + sum <- summary( fit ) + + optinfo <- fit@optinfo + + feval = optinfo$feval + + #if ( length( optinfo$conv$lme4 ) > 0 || + # length( optinfo$messages ) > 0 ) { + # browser() + #} + + tibble( + ATE_hat = sum$coefficients["Z","Estimate"], + SE_hat = sum$coefficients["Z","Std. Error"], + p_value = sum$coefficients["Z", "Pr(>|t|)"], + message = length( M1$messages ) > 0, + warning = length( M1$warning ) > 0 , + error = FALSE, + hit_maxfun = feval >= maxfun + ) + + } +} + + + + +run_CRT_sim <- function( reps, + n_bar = 10, J = 30, p = 0.5, + ATE = 0, ICC = 0.4, + size_coef = 0, alpha = 0, + seed = NULL ) { + + a = Sys.time() + stopifnot( ICC >= 0 && ICC < 1 ) + + scat( "Running n=%d, J=%d, ICC=%.2f, ATE=%.2f (%d replicates)\n", + n_bar, J, ICC, ATE, reps) + + if (!is.null(seed)) set.seed(seed) + + res <- + simhelpers::repeat_and_stack( reps, { + dat <- gen_cluster_RCT( n_bar = n_bar, J = J, p = p, + gamma_0 = 0, gamma_1 = ATE, gamma_2 = size_coef, + sigma2_u = ICC, sigma2_e = 1 - ICC, + alpha = alpha ) + quiet_analyze_data(dat) + }, + id = "runID" ) + + res$total_time <- as.numeric( Sys.time() - a, units="secs" ) + + res +} + + + + +if ( FALSE ) { + # Test the updated analysis function + set.seed( 40440 ) + test_dat <- gen_cluster_RCT( n_bar = 320, + J = 80, + gamma_1 = 0.2, + gamma_2 = 0.2, + sigma2_u = 0, + alpha = 0.5 ) + + debug( analysis_MLM_safe ) + analysis_MLM_safe( test_dat ) + + quiet_analyze_data( test_dat ) + + rr <- run_CRT_sim( reps = 5, + n_bar = 320, + J = 30, + ATE = 0.2, + size_coef = 0.2, + ICC = 0, + alpha = 0.5, + seed = 1234567) + rr + rr$total_time[[1]] - sum( rr$seconds) +} + + + +# Run the simulation ---- + + +crt_design_factors <- list( + n_bar = c( 20, 80, 320 ), + J = c( 5, 20, 80 ), + ATE = c( 0.2 ), + size_coef = c( 0, 0.2 ), + ICC = c( 0, 0.2, 0.4, 0.6, 0.8 ), + alpha = c( 0, 0.5, 0.8 ) +) + +params <- expand_grid(!!!crt_design_factors) + +params$seed = 1:nrow(params) * 17 + 100000 +params + +set.seed( 404450 ) +params = slice_sample( params, n=nrow(params) ) +params + + +start_time = Sys.time() + +plan(multisession, workers = parallel::detectCores() - 2) + +cat( "Starting simulation\n" ) + +R = 1000 +params$res = future_pmap(params, .f = run_CRT_sim, + reps = R, + .options = furrr_options(seed = NULL), + .progress = TRUE ) +plan(sequential) +gc() +res <- params %>% unnest( cols=c(res) ) +saveRDS( res, file = "results/simulation_CRT_timing.rds" ) + +end_time = Sys.time() + +cat( "Total computation time:\n" ) +print( end_time - start_time ) + + +# Analyze timing results ---- + + +res +nrow( params ) +nrow( res ) +nrow( res ) / nrow( params ) + +mt = mean( res$seconds ) +mt + +# Estimated total time in hours for full 1000 reps +comp_time <- mt * nrow(params) * 3 * 1000 / (60*60) +comp_time + +# Across 6 threads (est.) +comp_time / 5 + +# Timing for longest runs + +rs <- res %>% + group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>% + summarise( Etime = mean( seconds ) * 1000 / 60, + ttime = total_time[1], + sdttime = sd( total_time ), + conv = mean( hit_maxfun ) ) + +ggplot( rs, aes( method, Etime ) ) + + geom_boxplot() + + coord_flip() + + +sum( rs$Etime >= 30 ) + +# Any failure to converge? +rs$conv[ rs$method == "MLM" ] + + +ggplot( rs, aes( as.factor(n_bar), Etime, color=as.factor(J), group=interaction( J, n_bar ) ) ) + + geom_boxplot( width = 0.5 ) + + #geom_smooth() + + facet_wrap( ~ method ) + + #geom_line( aes( group=method ) ) + + #facet_grid( ICC ~ J, scales="free_y" ) + + scale_y_log10( breaks =c( 0.1, 1, 10, 30, 60 ) ) + + theme_minimal() + +# As above, but swap n_bar and J +ggplot( rs, aes( as.factor(J), Etime, color=as.factor(n_bar), group=interaction( J, n_bar ) ) ) + + geom_boxplot( width = 0.5 ) + + #geom_smooth() + + facet_wrap( ~ method ) + + #geom_line( aes( group=method ) ) + + #facet_grid( ICC ~ J, scales="free_y" ) + + scale_y_log10( breaks =c( 0.1, 1, 10, 30, 60 ) ) + + theme_minimal() + +# ICC matters? +ggplot( rs, aes( ICC, Etime, color=as.factor(n_bar), lty=as.factor(J), + group=interaction( J, n_bar ) ) ) + + geom_smooth( se = FALSE, method="lm" ) + + facet_wrap( ~ method ) + + #geom_line( aes( group=method ) ) + + #facet_grid( ICC ~ J, scales="free_y" ) + + scale_y_log10( breaks =c( 0.1, 1, 10, 30, 60 ) ) + + theme_minimal() + + +# Calculate _total_ time to run simulation ---- + +rst <- res %>% + group_by( n_bar, J, ATE, size_coef, ICC, alpha ) %>% + summarise( ttime = total_time[1], + sdttime = sd( total_time ), + dtime = ttime - sum( seconds ) ) + +nrow( rst ) +sum( rst$ttime ) / (60*60) + +rst$sdttime +summary( rst$dtime ) + +nrow( rst ) + +# Time for single iteration for all (minutes) +single_iter <- sum( rst$ttime / (R*60) ) +single_iter + +# Total sim, in hours +1000*single_iter / 60 + + +3 * 4 * 60 / single_iter + + + +# Save to core results file ---- + +res$hit_maxfun = NULL +res$total_time = NULL +res +saveRDS( res, file = "results/simulation_CRT.rds" ) + + +if ( FALSE ) { + # check sim format + run_CRT_sim( reps = 5, + n_bar = 20, + J = 5, + ATE = 0.2, + size_coef = 0.2, + ICC = 0, + alpha = 0.5, + seed = 1234567) + + + # Check the results + res = readRDS( "results/simulation_CRT.rds" ) + res +} + + + diff --git a/case_study_code/gen_cluster_RCT.R b/case_study_code/gen_cluster_RCT.R index 907d71b..0be58c1 100644 --- a/case_study_code/gen_cluster_RCT.R +++ b/case_study_code/gen_cluster_RCT.R @@ -6,12 +6,21 @@ gen_cluster_RCT <- function( gamma_0 = 0, gamma_1 = 0, gamma_2 = 0, sigma2_u = 0, sigma2_e = 1 ) { + stopifnot( alpha >= 0 ) # generate schools sizes n_min <- round( n_bar * (1 - alpha) ) n_max <- round( n_bar * (1 + alpha) ) - nj <- sample( n_min:n_max, J, replace = TRUE ) + if (n_min < n_max) { + nj <- sample( n_min:n_max, J, replace = TRUE ) + } else { + if ( alpha > 0 ) { + warning( "alpha > 0 has no effect when there is no variation in site size" ) + } + nj <- rep(n_bar, J) + } + # Generate average control outcome for all schools # (the random effects) u0j <- rnorm( J, mean = 0, sd = sqrt(sigma2_u) ) diff --git a/case_study_code/gen_cluster_RCT_rev.R b/case_study_code/gen_cluster_RCT_bug.R similarity index 82% rename from case_study_code/gen_cluster_RCT_rev.R rename to case_study_code/gen_cluster_RCT_bug.R index 647d636..907d71b 100644 --- a/case_study_code/gen_cluster_RCT_rev.R +++ b/case_study_code/gen_cluster_RCT_bug.R @@ -10,14 +10,10 @@ gen_cluster_RCT <- function( # generate schools sizes n_min <- round( n_bar * (1 - alpha) ) n_max <- round( n_bar * (1 + alpha) ) + nj <- sample( n_min:n_max, J, replace = TRUE ) - if (n_min < n_max) { - nj <- sample( n_min:n_max, J, replace = TRUE ) - } else { - nj <- rep(n_bar, J) - } - - # Generate average control outcome for all schools (the random effects) + # Generate average control outcome for all schools + # (the random effects) u0j <- rnorm( J, mean = 0, sd = sqrt(sigma2_u) ) # randomize schools (proportion p to treatment) diff --git a/results/cluster_RCT_simulation.rds b/results/cluster_RCT_simulation.rds index 74a5ba0..7f67515 100644 Binary files a/results/cluster_RCT_simulation.rds and b/results/cluster_RCT_simulation.rds differ diff --git a/results/cluster_RCT_simulation_validity.rds b/results/cluster_RCT_simulation_validity.rds index 8c9b58f..af16721 100644 Binary files a/results/cluster_RCT_simulation_validity.rds and b/results/cluster_RCT_simulation_validity.rds differ diff --git a/results/simulation_CRT.rds b/results/simulation_CRT.rds index 7421776..91b762a 100644 Binary files a/results/simulation_CRT.rds and b/results/simulation_CRT.rds differ diff --git a/results/simulation_CRT_timing.rds b/results/simulation_CRT_timing.rds new file mode 100644 index 0000000..c5f40b6 Binary files /dev/null and b/results/simulation_CRT_timing.rds differ