Skip to content

How to train and evaluate policy models with unified dataset format? #191

@JamesCao2048

Description

@JamesCao2048

Hi there, I noticed that there are APIs to load NLU, DST, Policy and NLG data in unified data format. Besides, I found the training and evaluation guide for NLU/DST/NLG with unified data in $model/README.md or NLU/DST/NLG/evaluate_unified_datasets.py. However, I did not find a guide for how to train and evaluate policy models with unified data format. Specifically, I have the following questions:

  1. Training: I did not find support for training with unified data format in $policy_model/train.py, such as ppo/train.py and mle/train.py, it seems that they will use MultiWozEvaluator by default.
  2. Evaluation: I did not find support for evaluation with unified data format in policy/evaluate.py, it seems that it will also use MultiWozEvaluator by default.
  3. My Training Experiment: I have tried to train a PPO policy with this config file base_pipeline_rule_user.json (which has been initialized with a MLE policy weight trained with default config), and get the result: Best Complete Rate: 0.95, Best Success Rate: 0.5, Best Average Return: 4.5. It is a good start for me, but still worser than
    BERTNLU | RuleDST | PPOPolicy | TemplateNLG evaluation in ConvLab2 ReadME (75.5 completion rate and 71.7 success rate). How does this gap come from?
  4. My Evaluation Experiment: I evaluated my previously trained PPO model policy/evaluate.py, but get a much worser result: "Complete 500 0.372 Success 500 0.228 Success strict 500 0.174". During the evaluation, there are two warnings: "Value not found in standard value set: [dontcare] (slot: name domain: restaurant)", "Value [none] invalid! (Lexicalisation Error) (slot: name domain: hotel)". They seem to be the dataset format mismatch between training and evaluation process, because I am not sure whether I have used original Multiwoz format or unified data format to train and evaluate my policy model.
  5. For user simulator: I have found that tus, emoUS and genTUS could be trained and evaluated with unified data format. However, I did not found unified data format support in rule-based user simulator. Does that mean if I trained my models(NLU/NLG or Policy) with unified data format, I could not evaluate them with rule-based user simulator?

Looking forward to your reply,
James Cao

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions