Accepted for publication in the IEEE Transactions on Image Processing (T-IP)

We propose a semantic-based multi-modal cognitive graph, termed SMCG, for intelligent visual navigation. SMCG provides a unified semantic-level representation of memory and reasoning, where memory is constructed by recording sequences of observed objects instead of raw perceptual features, and reasoning is performed over a semantic relation graph encoding object correlations. To effectively exploit the heterogeneous cognitive information, we further design a Hierarchical Cognition Extraction (HCE) pipeline to decode both global cognitive cues and situation-aware subgraphs for navigation decision-making. The proposed framework enables embodied agents to exhibit more informed and proactive navigation behaviors. Experimental results on image-goal navigation tasks in photorealistic environments demonstrate that SMCG significantly improves navigation success rate and path efficiency compared with existing methods.
- Python 3.8+
- PyTorch (CUDA recommended)
- NVIDIA GPU with CUDA support (optional but strongly recommended)
pip install -r requirements.txthabitat==0.2.1habitat-sim==0.2.1
Install Detectron2 following the official guide (must match PyTorch/CUDA version).
yolov3/best.pt(object detector)RetrievalNet/best.pth(retrieval / feature backbone)detectron/model/model_final_280758.pkl
We use expert demonstration data collected in the Habitat-Gibson simulator for image-goal visual navigation.
The dataset contains panoramic RGB-D observations and expert trajectories, and is organized as follows:
IL_data
├── train
│ ├── easy
│ │ ├── Anaheim_000_env0.dat.gz
│ │ ├── ...
│ ├── medium
│ └── hard
└── test
│ ├── easy
│ ├── ...
Each .dat.gz file corresponds to one navigation episode in a Gibson scene.
Difficulty splits are defined by the start-to-goal distance: easy (1.5–3 m), medium (3–5 m), and hard (5–10 m).
python main.py \
--config ./configs/vgm.yaml \
--data-dir <path_to_demos> \
--gpu 0- Logs and checkpoints are saved under
record/<date>/ - Hyperparameters, dataset splits, and evaluation cadence are defined in
configs/ - Semantic memory and reasoning graph construction is implemented in
graph.py
python evaluator.py --model_path <checkpoint_path>With visualization:
python evaluator.py --model_path <checkpoint_path> --visualizeThis work builds upon the following open-source projects:
We thank the reviewers of IEEE Transactions on Image Processing for their constructive feedback.
If you find this work useful in your research, please consider citing:
@ARTICLE{smcg2025,
author={Liu, Qiming and Du, Xinmin and Liu, Zhe and Wang, Hesheng},
journal={IEEE Transactions on Image Processing},
title={Visual Navigation for Embodied Agents Using Semantic-based Multi-modal Cognitive Graph},
year={2025},
volume={},
number={},
pages={1-13},
doi={10.1109/TIP.2025.3637722}}