- Restore the ATD-MCL data. Please refer to the site: https://github.com/naist-nlp/atd-mcl/.
- Copy or move the
atd-mcl/fullandatd-mcl/metadirectories under theatd-mcldirectory. - Execute
bin/gen_full_data.sh.- The restored data will be placed at
atd-vso/full/main/jsonandatd-vso/full/main/json_per_doc/. - The data used for calculating inter-annotator agreement scores will be placed at
atd-vso/full/agreement/.
- The restored data will be placed at
The JSON data (atd-vso/full/main/json and atd-vso/full/main/json_per_doc) holds full annotation information as follows.
- A document object value is associated with a key that represents the document ID (e.g.,
00019). Each document object has the sets ofsections,sentences,mentions, andentities.{ "00019": { "sections": { "001": { ... }, }, "sentences": { "001": { ... }, }, "mentions": { "001": { ... }, }, "entities": { "C1": { ... } } "entity_pairs": { "P001": { ... }, }, "visit_order_graph": { ... }, } } - A section object under
sectionsis as follows:"sections": { "001": { "sentence_ids": [ "001", "002", "003" ] }, ... - A sentence object under
sentencesis as follows:- A sentence object may have one or more geographic entity mentions.
- Some sentences with an ID that has a branch number (e.g., "026-01" and "026-02") indicate that a single sentence in the original ATD data was split into those multiple sentences.
"sentences": { "001": { "section_id": "001", "text": "奈良公園のアイドル「しか」で~す。", "mention_ids": [ "001" ] }, ... "026-01": { "section_id": "013", "span_in_orig_text": [ 0, 8 ], "text": "とにかく広~い!", "mention_ids": [] }, "026-02": { "section_id": "013", "span_in_orig_text": [ 8, 16 ], "text": "そして静かです。", "mention_ids": [] }, - A mention object under
mentionsis as follows:- A mention object may be associated with an entity.
visit_statusis a visit status label.
"mentions": { "001": { "sentence_id": "001", "entity_id": "C1", "span": [ 0, 4 ], "text": "奈良公園", "entity_type": "FAC_NAME", "visit_status": "Visit" }, - An entity object, which corresponds to a coreference cluster of one or more mentions, under
entitiesis as follows:- An entity object is associated with one or more mentions.
accessibilityis an accessibility label.- Open: the entity is available for use.
- Unopened: the entity is scheduled to become available in the future, but at the time of writing the travelogue it has not yet opened.
- Closed: the entity was available in the past, but at the time of writing the travelogue it has already closed.
number_of_visitsindicates how many times the entity were visited.visit_order_unary_relationis a unary relation of the following ones: "UnknownTime", "AnnotationDisagreement", "UnknownRegion", "Unassigned", "Other".entity_pair_idsare the IDs of entity pairs that the entity appear as entity_pair entries.
"entities": { "C1": { "normalized_name": "奈良公園", "entity_label_merged": "FAC", "has_name": true, "has_reference": true, "best_ref_type": "OSM", "best_ref_url": "https://www.openstreetmap.org/way/456314269", "best_ref_query": "奈良公園", "member_mention_ids": [ "001", "011" ], "visit_status": "Visit", "accessibility": "Open", "number_of_visits": 1, "visit_order_unary_relation": { "1": null }, "entity_pair_ids": { "1": [ "P005", "P007", "P008", "P011", "P012" ] } }, - An entity pair object under
entity_pairsis as follows:- Each pair has its ID, such as "P001".
paired_entity_idsconsists of two entity IDs.visit_order_binary_relationis a binary relation of the following ones: "Before", "Includes", "BeforeOrIncludes", "Equals", "Overlaps".
"entity_pairs": { "P001": { "paired_entity_ids": [ [ "C12", 1 ], [ "C13", 1 ] ], "visit_order_binary_relation": "Before" }, - A visit order graph consists of the following three components:
hierarchical_ordered_entitiesare hierarchical layer IDs.child_and_parent_entitiesare pairs of entities that stand in a child–parent relation.preceding_and_subsequent_entitiesare pairs of entities whose visit order is consecutive.
"visit_order_graph": { "hierarchical_ordered_entities": { "ES001": { "level": 1, "parent_entity": [ "ROOT", -1 ], "ordered_entities": [ [ "C2", 1 ] ], "unknown_time_entities": [] }, ... }, "child_and_parent_entities": [ [ [ "C1", 1 ], [ "C2", 1 ] ], ... "preceding_and_subsequent_entities": [ [ [ "C1", 1 ], [ "C10", 1 ] ], ...
- Execute
bin/gen_data_for_visit_status_prediction.shto generate the dataset for the visit status prediction task.- The generated data will be saved as
atd-vso/full/main/json/*_visit-status.json. - The data structure represents mentions and the text they appear.
"doc-00019": { "001": { "text": "奈良公園のアイドル「しか」で~す。", "mention": { "sentence_id": "001", "entity_id": "C1", "span": [ 0, 4 ], "text": "奈良公園", "entity_type": "FAC_NAME", "visit_status": "Visit", "label_id": 1 } ... }, ... },- Here,
doc-00019is the document ID. - The
textfield contains the context from the original document.- This context is what the model actually sees as input: it provides the local context in which a target mention appears.
- The
mentionfield indicates the mention information.- The
spanfield is the character offsets that indicate the beginning/end of the mention in the text. - The
visit_statusfield is the visit status label. - The
label_idis the ID of the visit status label.
- The
- The generated data will be saved as
- Execute
bin/gen_data_for_inc_relation_prediction.shto generate the dataset for the inclusion relation prediction task.- The generated data will be saved as
atd-vso/full/main/json/*_inc-relation.json. - The data structure represents inclusion relations (child-parent) for entities within a document.
"doc-00019": { "C1": { "C2": { "child_id": "C1", "child_text": "奈良公園", "parent_id": "C2", "parent_text": "奈良", "text": "[CHILD]奈良公園[/CHILD]のアイドル「しか」で~す。かわい~せんべいあげたくなるよね~[PARENT]奈良[/PARENT]の有名スポットですよね!", "label": 1 }, ... }, ... },- Here,
doc-00019is the document ID. C1refer to a target child entity, and models have to identify its parent entity in the document.- For each child entity, candidate parent entities, are listed up, such as
C2. - The
textfield contains the context from the original document, where the child and parent mentions are explicitly marked.- The child entity mention (
child_text, here奈良公園) is wrapped with the tags[CHILD]and[/CHILD]. - The candidate parent entity mention (
parent_text, here奈良) is wrapped with the tags[PARENT]and[/PARENT]. - This context is what the model actually sees as input: it provides the local context in which both the child and the candidate parent appear, while clearly indicating which span is the child and which span is the candidate parent for the task.
- The child entity mention (
- The
labelfield indicates the correct parent entity;label: 1indicates the correct one andlabel: 0vice versa.
- The generated data will be saved as
- Execute
bin/gen_data_for_trans_relation_prediction.shto generate the dataset for the transition relation prediction task.- The generated data will be saved as
atd-vso/full/main/json/*_trans-relation.jsonin the following format:. - The data structure represents transition relations for entities within a document.
"doc-00019_level-ES002": { "C1": { "C10": { "child_id": "C1", "child_count": 1, "child_text": "奈良公園", "parent_id": "C10", "parent_count": 1, "parent_text": "古墳", "text": "[CHILD]奈良公園[/CHILD]のアイドル「しか」で~す。かわい~せんべいあげたくなるよね~奈良の有名スポットですよね!大仏様はとっても大きかったなぁ~柱に穴があって、くぐるといいらしいんだけど、狭くて無理だった...。たくさんお祈りしてきました。健康と交通安全、あとちょっと金運も...。写真は猿沢池からも見える興福寺の五重塔です。国宝館と東金堂に行く場合は、優待券がHPから印刷できますよ。お昼の眺めもいいんだけど、夜は興福寺の五重塔がライトアップされて、とてもきれいですよ。かわらを寄贈してきました。今頃は名前を書いたかわらが、法隆寺の屋根に使われているのかなぁ。奈良の名物「にゅうめん」のセットを食べました。(写真)美味しかったし、寒い日だったので、あったまりました。眠そうな顔がかわいいです♪~写真だとわかりづらいけど、とっても大きな石が使われています。[PARENT]古墳[/PARENT]の中に入ると、さらに大きさを感じることができます。", "label": 1 }, ... }, ... },- Here,
doc-00019_level-ES002represents the document ID and hierarchy ID. C1refer to a target (child) entity, and models have to identify its subsequent (parent) entity in the same hierarchy of inclusion relations.- It is assumed that gold inclusion relations have already been identified.
- For each target entity, candidate subsequent (parent) entities, are listed up, such as
C10. - The
textfield contains the context from the original document, where the child and parent mentions are explicitly marked.- The child entity mention (
child_text, here奈良公園) is wrapped with the tags[CHILD]and[/CHILD]. - The candidate parent entity mention (
parent_text, here古墳) is wrapped with the tags[PARENT]and[/PARENT]. - This context is what the model actually sees as input: it provides the local context in which both the child and the candidate parent appear, while clearly indicating which span is the child and which span is the candidate parent for the task.
- The child entity mention (
- The
labelfield is the ground-truth annotation for the transition relation (1 means that there is a transition from the child entity to this candidate parent entity; 0 would indicate no transition).
- The generated data will be saved as
- Hiroki Ouchi <hiroki.ouchi [at] is.naist.jp>
This study was supported by JSPS KAKENHI Grant Number JP23K24904. The annotation data was constructed by IR-Advanced Linguistic Technologies Inc.
Please cite the following paper.
@inproceedings{yamamoto-etal-2025-graph,
title = "Graph-Structured Trajectory Extraction from Travelogues",
author = "Yamamoto, Aitaro and
Otomo, Hiroyuki and
Ouchi, Hiroki and
Higashiyama, Shohei and
Teranishi, Hiroki and
Shindo, Hiroyuki and
Watanabe, Taro",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.690/",
doi = "10.18653/v1/2025.acl-long.690",
pages = "14116--14132",
ISBN = "979-8-89176-251-0"
}