Skip to content

naist-nlp/atd-vso

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATD-VSO: NAIST Academic Travelogue Dataset with Visit Status and Visiting Order Annotation

How to Restore the ATD-VSO Data

  1. Restore the ATD-MCL data. Please refer to the site: https://github.com/naist-nlp/atd-mcl/.
  2. Copy or move the atd-mcl/full and atd-mcl/meta directories under the atd-mcl directory.
  3. Execute bin/gen_full_data.sh.
    • The restored data will be placed at atd-vso/full/main/json and atd-vso/full/main/json_per_doc/.
    • The data used for calculating inter-annotator agreement scores will be placed at atd-vso/full/agreement/.

Data Format

JSON Data Format

The JSON data (atd-vso/full/main/json and atd-vso/full/main/json_per_doc) holds full annotation information as follows.

  • A document object value is associated with a key that represents the document ID (e.g., 00019). Each document object has the sets of sections, sentences, mentions, and entities.
     {
       "00019": {
         "sections": {
           "001": {
           ...
           },
         },
         "sentences": {
           "001": {
           ...
           },
         },
         "mentions": {
           "001": {
             ...
           },
         },
         "entities": {
           "C1": {
             ...
           }
         }
         "entity_pairs": {
           "P001": {
             ...
           },
         },
         "visit_order_graph": {
           ...
         },
       }
     }
    
  • A section object under sections is as follows:
    "sections": {
      "001": {
        "sentence_ids": [
          "001",
          "002",
          "003"
        ]
      },
    ...
    
  • A sentence object under sentences is as follows:
    • A sentence object may have one or more geographic entity mentions.
    • Some sentences with an ID that has a branch number (e.g., "026-01" and "026-02") indicate that a single sentence in the original ATD data was split into those multiple sentences.
    "sentences": {
      "001": {
        "section_id": "001",
        "text": "奈良公園のアイドル「しか」で~す。",
        "mention_ids": [
          "001"
        ]
      },
      ...
      "026-01": {
        "section_id": "013",
        "span_in_orig_text": [
          0,
          8
        ],
        "text": "とにかく広~い!",
        "mention_ids": []
      },
      "026-02": {
        "section_id": "013",
        "span_in_orig_text": [
          8,
          16
        ],
        "text": "そして静かです。",
        "mention_ids": []
      },
    
  • A mention object under mentions is as follows:
    • A mention object may be associated with an entity.
    • visit_status is a visit status label.
    "mentions": {
      "001": {
        "sentence_id": "001",
        "entity_id": "C1",
        "span": [
          0,
          4
        ],
        "text": "奈良公園",
        "entity_type": "FAC_NAME",
        "visit_status": "Visit"
      },
    
  • An entity object, which corresponds to a coreference cluster of one or more mentions, under entities is as follows:
    • An entity object is associated with one or more mentions.
    • accessibility is an accessibility label.
      • Open: the entity is available for use.
      • Unopened: the entity is scheduled to become available in the future, but at the time of writing the travelogue it has not yet opened.
      • Closed: the entity was available in the past, but at the time of writing the travelogue it has already closed.
    • number_of_visits indicates how many times the entity were visited.
    • visit_order_unary_relation is a unary relation of the following ones: "UnknownTime", "AnnotationDisagreement", "UnknownRegion", "Unassigned", "Other".
    • entity_pair_ids are the IDs of entity pairs that the entity appear as entity_pair entries.
    "entities": {
      "C1": {
        "normalized_name": "奈良公園",
        "entity_label_merged": "FAC",
        "has_name": true,
        "has_reference": true,
        "best_ref_type": "OSM",
        "best_ref_url": "https://www.openstreetmap.org/way/456314269",
        "best_ref_query": "奈良公園",
        "member_mention_ids": [
          "001",
          "011"
        ],
        "visit_status": "Visit",
        "accessibility": "Open",
        "number_of_visits": 1,
        "visit_order_unary_relation": {
          "1": null
        },
        "entity_pair_ids": {
          "1": [
            "P005",
            "P007",
            "P008",
            "P011",
            "P012"
          ]
        }
      },
    
  • An entity pair object under entity_pairs is as follows:
    • Each pair has its ID, such as "P001".
    • paired_entity_ids consists of two entity IDs.
    • visit_order_binary_relation is a binary relation of the following ones: "Before", "Includes", "BeforeOrIncludes", "Equals", "Overlaps".
    "entity_pairs": {
      "P001": {
        "paired_entity_ids": [
          [
            "C12",
            1
          ],
          [
            "C13",
            1
          ]
        ],
        "visit_order_binary_relation": "Before"
      },
    
  • A visit order graph consists of the following three components:
    • hierarchical_ordered_entities are hierarchical layer IDs.
    • child_and_parent_entities are pairs of entities that stand in a child–parent relation.
    • preceding_and_subsequent_entities are pairs of entities whose visit order is consecutive.
    "visit_order_graph": {
      "hierarchical_ordered_entities": {
        "ES001": {
          "level": 1,
          "parent_entity": [
            "ROOT",
            -1
          ],
          "ordered_entities": [
            [
              "C2",
              1
            ]
          ],
          "unknown_time_entities": []
        },
        ...
      },
      "child_and_parent_entities": [
        [
          [
            "C1",
            1
          ],
          [
            "C2",
            1
          ]
        ],
        ...
      "preceding_and_subsequent_entities": [
        [
          [
            "C1",
            1
          ],
          [
            "C10",
            1
          ]
        ],
        ...
    

Converting the ATD-VSO Data to the dataset for each task

  • Execute bin/gen_data_for_visit_status_prediction.sh to generate the dataset for the visit status prediction task.
    • The generated data will be saved as atd-vso/full/main/json/*_visit-status.json.
    • The data structure represents mentions and the text they appear.
    "doc-00019": {
      "001": {
        "text": "奈良公園のアイドル「しか」で~す。",
        "mention": {
          "sentence_id": "001",
          "entity_id": "C1",
          "span": [
            0,
            4
          ],
          "text": "奈良公園",
          "entity_type": "FAC_NAME",
          "visit_status": "Visit",
          "label_id": 1
        }
        ...
      },
      ...
    },
    
    • Here, doc-00019 is the document ID.
    • The text field contains the context from the original document.
      • This context is what the model actually sees as input: it provides the local context in which a target mention appears.
    • The mention field indicates the mention information.
      • The span field is the character offsets that indicate the beginning/end of the mention in the text.
      • The visit_status field is the visit status label.
      • The label_id is the ID of the visit status label.
  • Execute bin/gen_data_for_inc_relation_prediction.sh to generate the dataset for the inclusion relation prediction task.
    • The generated data will be saved as atd-vso/full/main/json/*_inc-relation.json.
    • The data structure represents inclusion relations (child-parent) for entities within a document.
    "doc-00019": {
      "C1": {
        "C2": {
          "child_id": "C1",
          "child_text": "奈良公園",
          "parent_id": "C2",
          "parent_text": "奈良",
          "text": "[CHILD]奈良公園[/CHILD]のアイドル「しか」で~す。かわい~せんべいあげたくなるよね~[PARENT]奈良[/PARENT]の有名スポットですよね!",
          "label": 1
        },
        ...
      },
      ...
    },
    
    • Here, doc-00019 is the document ID.
    • C1 refer to a target child entity, and models have to identify its parent entity in the document.
    • For each child entity, candidate parent entities, are listed up, such as C2.
    • The text field contains the context from the original document, where the child and parent mentions are explicitly marked.
      • The child entity mention (child_text, here 奈良公園) is wrapped with the tags [CHILD] and [/CHILD].
      • The candidate parent entity mention (parent_text, here 奈良) is wrapped with the tags [PARENT] and [/PARENT].
      • This context is what the model actually sees as input: it provides the local context in which both the child and the candidate parent appear, while clearly indicating which span is the child and which span is the candidate parent for the task.
    • The label field indicates the correct parent entity; label: 1 indicates the correct one and label: 0 vice versa.
  • Execute bin/gen_data_for_trans_relation_prediction.sh to generate the dataset for the transition relation prediction task.
    • The generated data will be saved as atd-vso/full/main/json/*_trans-relation.json in the following format:.
    • The data structure represents transition relations for entities within a document.
    "doc-00019_level-ES002": {
      "C1": {
        "C10": {
          "child_id": "C1",
          "child_count": 1,
          "child_text": "奈良公園",
          "parent_id": "C10",
          "parent_count": 1,
          "parent_text": "古墳",
          "text": "[CHILD]奈良公園[/CHILD]のアイドル「しか」で~す。かわい~せんべいあげたくなるよね~奈良の有名スポットですよね!大仏様はとっても大きかったなぁ~柱に穴があって、くぐるといいらしいんだけど、狭くて無理だった...。たくさんお祈りしてきました。健康と交通安全、あとちょっと金運も...。写真は猿沢池からも見える興福寺の五重塔です。国宝館と東金堂に行く場合は、優待券がHPから印刷できますよ。お昼の眺めもいいんだけど、夜は興福寺の五重塔がライトアップされて、とてもきれいですよ。かわらを寄贈してきました。今頃は名前を書いたかわらが、法隆寺の屋根に使われているのかなぁ。奈良の名物「にゅうめん」のセットを食べました。(写真)美味しかったし、寒い日だったので、あったまりました。眠そうな顔がかわいいです♪~写真だとわかりづらいけど、とっても大きな石が使われています。[PARENT]古墳[/PARENT]の中に入ると、さらに大きさを感じることができます。",
          "label": 1
        },
        ...
      },
      ...
    },
    
    • Here, doc-00019_level-ES002 represents the document ID and hierarchy ID.
    • C1 refer to a target (child) entity, and models have to identify its subsequent (parent) entity in the same hierarchy of inclusion relations.
      • It is assumed that gold inclusion relations have already been identified.
    • For each target entity, candidate subsequent (parent) entities, are listed up, such as C10.
    • The text field contains the context from the original document, where the child and parent mentions are explicitly marked.
      • The child entity mention (child_text, here 奈良公園) is wrapped with the tags [CHILD] and [/CHILD].
      • The candidate parent entity mention (parent_text, here 古墳) is wrapped with the tags [PARENT] and [/PARENT].
      • This context is what the model actually sees as input: it provides the local context in which both the child and the candidate parent appear, while clearly indicating which span is the child and which span is the candidate parent for the task.
    • The label field is the ground-truth annotation for the transition relation (1 means that there is a transition from the child entity to this candidate parent entity; 0 would indicate no transition).

Contact

  • Hiroki Ouchi <hiroki.ouchi [at] is.naist.jp>

Acknowledgements

This study was supported by JSPS KAKENHI Grant Number JP23K24904. The annotation data was constructed by IR-Advanced Linguistic Technologies Inc.

Citation

Please cite the following paper.

@inproceedings{yamamoto-etal-2025-graph,
    title = "Graph-Structured Trajectory Extraction from Travelogues",
    author = "Yamamoto, Aitaro  and
      Otomo, Hiroyuki  and
      Ouchi, Hiroki  and
      Higashiyama, Shohei  and
      Teranishi, Hiroki  and
      Shindo, Hiroyuki  and
      Watanabe, Taro",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.690/",
    doi = "10.18653/v1/2025.acl-long.690",
    pages = "14116--14132",
    ISBN = "979-8-89176-251-0"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published