Skip to content

max_time Exceeded in runArraySimulation Saves Unusable Results #79

@mronkko

Description

@mronkko

Summary

When max_time is exceeded in runArraySimulation(), the function saves a result file that appears valid but contains no usable simulation results. This contradicts the documentation's promise that "any evaluations completed before the cluster is terminated can be saved."

Expected Behavior

According to the documentation (lines 93-100 in R/runArraySimulation.R):

max_time specifies the maximum time allowed for a single simulation condition to execute... In general, this input should be set to somewhere around 80-90% of the true termination time so that any evaluations completed before the cluster is terminated can be saved.

When max_time is exceeded, the saved file should contain:

  1. All successfully completed replications in stored_results
  2. Summary statistics computed from those partial results
  3. A clear indicator of actual replications completed (e.g., REPLICATIONS should show the actual count, not the target)

Actual Behavior

When max_time is exceeded, the saved RDS file contains:

  • Empty stored_results (list())
  • No summary statistics (only design conditions and metadata columns)
  • Misleading REPLICATIONS count (shows target count, e.g., 10000, not actual completions)
  • No way to determine how many replications actually completed

Example from simulation where some designs exceeded the time limit

Successful completion (Design 11):

str(readRDS("results/sim_res-11.rds"))
# SimDesgn [1 × 79] - Contains all 66 summary statistics
# $ REPLICATIONS  : num 10000
# $ stored_results: tibble [10,000 × 23]  # All replications present

Failed due to max_time (Design 12):

str(readRDS("results/sim_res-12.rds"))
# SimDesgn [1 × 11] - Only 7 design + 4 metadata columns
# $ REPLICATIONS  : int 10000  # THIS IS THE TARGET, NOT ACTUAL
# $ SIM_TIME      : num 3000   # Hit the time limit
# $ stored_results: list()     # EMPTY - no replications saved

Impact

This is a critical issue for HPC users because:

  1. Silent failure: The file exists, so SimCheck() may not flag it as problematic
  2. No diagnostic information: Cannot determine if 0, 247, or 9,999 replications completed
  3. Wasted computation: All completed replications are lost, forcing complete re-runs
  4. Impossible to optimize: Cannot make informed decisions about time allocation adjustments

Root Cause Analysis (By Claude Code)

Looking at the code flow:

  1. lapply_timer() (R/util.R:376-402) correctly handles timeouts and returns partial results with a message:

    if(time_left <= 0){
        message(sprintf("Simulation terminated due to max_time constraint (%i/%i replications evaluated)."), i, length(ret))
        ret <- ret[1L:i]  # Return partial results
        break
    }
  2. However, when partial results are returned to the analysis workflow (R/analysis.R):

    • obs_reps <- length(results) (line 205) correctly captures the actual count
    • ret <- c(sim_results, 'REPLICATIONS'=obs_reps, ...) (line 231) should save it
    • But if summarise() fails or the workflow terminates early, this never happens
  3. runSimulation() falls back to the input parameter (R/runSimulation.R:1872):

    REPLICATIONS=replications  # Uses target, not actual
  4. runArraySimulation() saves whatever is returned (R/runArraySimulation.R:376):

    saveRDS(ret, filename.u)  # Saves incomplete/empty result

Suggested Fix

The fix should ensure that when max_time is exceeded:

  1. Save actual replication count: Even if summarise() isn't called, store the actual number of completed replications
  2. Save partial results: Ensure stored_results contains all completed replications
  3. Add completion status flag: Include a field like INCOMPLETE = TRUE or TIMEOUT = TRUE
  4. Call summarise on partial data: Compute summary statistics from whatever completed, even if incomplete

Example structure for incomplete results:

list(
  REPLICATIONS_TARGET = 10000,
  REPLICATIONS_COMPLETED = 247,
  INCOMPLETE = TRUE,
  TIMEOUT_REASON = "max_time",
  stored_results = <partial results>,
  <summary statistics from partial data>
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions