Visual Navigation for Embodied Agents Using Semantic-based Multi-modal Cognitive Graph

Accepted for publication in the IEEE Transactions on Image Processing (T-IP)

Abstract

We propose a semantic-based multi-modal cognitive graph, termed SMCG, for intelligent visual navigation. SMCG provides a unified semantic-level representation of memory and reasoning, where memory is constructed by recording sequences of observed objects instead of raw perceptual features, and reasoning is performed over a semantic relation graph encoding object correlations. To effectively exploit the heterogeneous cognitive information, we further design a Hierarchical Cognition Extraction (HCE) pipeline to decode both global cognitive cues and situation-aware subgraphs for navigation decision-making. The proposed framework enables embodied agents to exhibit more informed and proactive navigation behaviors. Experimental results on image-goal navigation tasks in photorealistic environments demonstrate that SMCG significantly improves navigation success rate and path efficiency compared with existing methods.

Prequisites

System

Python 3.8+
PyTorch (CUDA recommended)
NVIDIA GPU with CUDA support (optional but strongly recommended)

pip install -r requirements.txt

Simulator

habitat==0.2.1
habitat-sim==0.2.1

Install Detectron2 following the official guide (must match PyTorch/CUDA version).

Pretrained Models and Data

yolov3/best.pt (object detector)
RetrievalNet/best.pth (retrieval / feature backbone)
detectron/model/model_final_280758.pkl

Usage

Datasets

We use expert demonstration data collected in the Habitat-Gibson simulator for image-goal visual navigation.
The dataset contains panoramic RGB-D observations and expert trajectories, and is organized as follows:


IL_data
├── train
│   ├── easy
│   │   ├── Anaheim_000_env0.dat.gz
│   │   ├── ...
│   ├── medium
│   └── hard
└── test
│   ├── easy
│   ├── ...

Each .dat.gz file corresponds to one navigation episode in a Gibson scene.
Difficulty splits are defined by the start-to-goal distance: easy (1.5–3 m), medium (3–5 m), and hard (5–10 m).

Training

python main.py \
  --config ./configs/vgm.yaml \
  --data-dir <path_to_demos> \
  --gpu 0

Logs and checkpoints are saved under record/<date>/
Hyperparameters, dataset splits, and evaluation cadence are defined in configs/
Semantic memory and reasoning graph construction is implemented in graph.py

Evaluation

python evaluator.py --model_path <checkpoint_path>

With visualization:

python evaluator.py --model_path <checkpoint_path> --visualize

Acknowledgments

This work builds upon the following open-source projects:

We thank the reviewers of IEEE Transactions on Image Processing for their constructive feedback.

Citation

If you find this work useful in your research, please consider citing:

@ARTICLE{smcg2025,
  author={Liu, Qiming and Du, Xinmin and Liu, Zhe and Wang, Hesheng},
  journal={IEEE Transactions on Image Processing}, 
  title={Visual Navigation for Embodied Agents Using Semantic-based Multi-modal Cognitive Graph}, 
  year={2025},
  volume={},
  number={},
  pages={1-13},
  doi={10.1109/TIP.2025.3637722}}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Semantic		Semantic
configs		configs
detectron		detectron
env_utils		env_utils
pic		pic
utils		utils
yolov3		yolov3
.gitignore		.gitignore
README.md		README.md
evaluator.py		evaluator.py
graph.py		graph.py
main.py		main.py
main_util.py		main_util.py
model.py		model.py
myExample.yaml		myExample.yaml
robot.py		robot.py
trainer.py		trainer.py
trainer_rl.py		trainer_rl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual Navigation for Embodied Agents Using Semantic-based Multi-modal Cognitive Graph

Abstract

Prequisites

System

Simulator

Pretrained Models and Data

Usage

Datasets

Training

Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

IRMVLab/SMCG

Folders and files

Latest commit

History

Repository files navigation

Visual Navigation for Embodied Agents Using Semantic-based Multi-modal Cognitive Graph

Abstract

Prequisites

System

Simulator

Pretrained Models and Data

Usage

Datasets

Training

Evaluation

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages