Skip to content

checkpointing #20

@vlimant

Description

@vlimant

with maybe a bit of details this time @vloncar . What we would need is to run

mpirun -n 11 --tag-output python3 hyperparameter_search_option3.py --block-size 5 --verbose --example mnist --epochs 100 --num-iterations 20

kill it half way through (which slurm might do on hpc) and be able to do

mpirun -n 11 --tag-output python3 hyperparameter_search_option3.py --block-size 5 --verbose --example mnist --epochs 100 --num-iterations 20 --resume <some identifier>

which would restart the master optimizer (with previously fitted values, there is a coordinator*.pkl existing) and resume all masters on the model they were at (so that we loose not too much of what has been done for that parameterset).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions