Benchmarking Performance Different Than Reported

Hi, thanks for such an all-around repo for working with 3DSG planning!

I would like to reproduce the benchmarking results in your repo under the benchmark folder to make sure everything runs properly before testing my own planners. However, during my testing, the behaviors of the planners are quite different than what are reported.

As of 07/20/2023, I ran all available planners in `pddlgym_planners/__init__.py` with pddl_domain `taskographyv2tiny1` with the command `python scripts/benchmark/plan.py --domain-name $DOMAIN_NAME --planner $PLANNER`. The results are the following:

1. `FF`: error while running
> gcc -o ff main.o memory.o output.o parse.o inst_pre.o inst_easy.o inst_hard.o inst_final.o orderings.o relax.o search.o scan-fct_pddl.tab.o scan-ops_pddl.tab.o  -Wall -g -std=gnu99    -O6 -lm
/usr/bin/ld: search.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/search.c:110: multiple definition of `lcurrent_goals'; relax.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/relax.c:111: first defined here
/usr/bin/ld: scan-fct_pddl.tab.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/lex-fct_pddl.l:9: multiple definition of `gbracket_count'; main.o:/home/fjd/miniconda3/envs/taskographypy37/lib/python3.7/site-packages/pddlgym_planners/FF-v2.3/main.c:147: first defined here
collect2: error: ld returned 1 exit status
make: *** [makefile:74: ff] Error 1
2. `FF-X`: the same error as FF
3. `FD-lama-first`: plan failure
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}
4. `Cerberus-seq-sat`: plan falure
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}
5. `Cerberus-seq-agl`: plan failure
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}
6. `DecStar-agl-decoupled`: plan failure
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}
7. `lapkt-bfws`: slightly different behavior than `benchmark/taskographyv2tiny1_bfws`. My result:
> 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [03:21<00:00,  5.04s/it]
{'failure_rate': 0.0,
 'num_node_expansions': 468.48387096774195,
 'num_node_expansions_std': 192.6469059835003,
 'plan_length': 14.709677419354838,
 'plan_length_std': 3.828530825661262,
 'search_time': 0.4536315483870968,
 'search_time_std': 0.3696494008728636,
 'success_rate': 0.775,
 'timeout_rate': 0.225,
 'total_time': 0.4536315483870968,
 'total_time_std': 0.3696494008728636}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [05:57<00:00,  6.51s/it]
{'failure_rate': 0.0,
 'num_node_expansions': 573.3225806451613,
 'num_node_expansions_std': 338.3147405651472,
 'plan_length': 15.32258064516129,
 'plan_length_std': 4.394917128465223,
 'search_time': 0.5754497419354839,
 'search_time_std': 0.8765903350261305,
 'success_rate': 0.5636363636363636,
 'timeout_rate': 0.43636363636363634,
 'total_time': 0.5754497419354839,
 'total_time_std': 0.8765903350261305}

reported in `benchmark/taskographyv2tiny1_bfws/taskographyv2tiny1_bfws_test.json`:
> {
    "failure_rate": 0.0,
    "num_node_expansions": 609.6279069767442,
    "num_node_expansions_std": 339.64208406455214,
    "plan_length": 15.55813953488372,
    "plan_length_std": 4.15570398469826,
    "search_time": 0.8969197023255813,
    "search_time_std": 1.3382104019851668,
    "success_rate": 0.7818181818181819,
    "timeout_rate": 0.21818181818181817,
    "total_time": 0.8969197023255813,
    "total_time_std": 1.3382104019851668
}

8. `FD-seq-opt-lmcut`: plan failure
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}

9. `Delfi`: plan failure:
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}

10. `DecStar-opt-decoupled`: plan failure
> {'failure_rate': 1.0,
 'num_node_expansions': nan,
 'num_node_expansions_std': nan,
 'plan_length': nan,
 'plan_length_std': nan,
 'search_time': nan,
 'search_time_std': nan,
 'success_rate': 0.0,
 'timeout_rate': 0.0,
 'total_time': nan,
 'total_time_std': nan}


I followed the installation stated in the https://github.com/taskography/taskography-api#installation with only a few changes to fix some errors:
0. Ubuntu 22.04.
1. Conda create an empty env with python=3.7.
2. Add a comma `,` at the end of line https://github.com/taskography/taskography-api/blob/bcb47fc070fd25b1438421d9eb1f79f701945ad5/setup.py#L26 to separate the two lines.
3. Run `pip install -e .` and `pip install -r requirements.txt`.
4. Downgrade `importlib-metadata` from 6.7.0 to 4.12.0 to avoid error `'EntryPoints' object has no attribute 'get'`. Source: https://stackoverflow.com/questions/73929564/entrypoints-object-has-no-attribute-get-digital-ocean
5. Move `from __future__ import annotations` to the first line to avoid error `from __future__ imports must occur at the beginning of the file`. Source: https://stackoverflow.com/questions/38688504/from-future-imports-must-occur-at-the-beginning-of-the-file-what-defines
7. Run `scripts/validate/loader.py` and `scripts/validate/taskography_env.py`, pass both.

I'm willing to offer more details if needed. Highly appreciate it if you could offer some help as a solid benchmark is the pre-requisite to any possible future researches. Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking Performance Different Than Reported #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmarking Performance Different Than Reported #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions