A Natural Language Interface for Multimodal Satellite Image Understanding
โIs a picture really worth a thousand words?โ
GeoSpatial NLI is an end-to-end visionโlanguage system that enables non-expert users to analyze satellite imagery using natural language queries.

Given a single satellite image, the system can:
- ๐ Generate detailed captions
- โ Answer natural language questions (VQA)
- ๐ Localize objects via oriented bounding boxes (OBB grounding)
The pipeline is designed to work across RGB, SAR, IR, and False Color Composite (FCC) imagery and supports high-resolution inputs up to 2kร2k, operating robustly across 0.5โ10 m/pixel spatial scales.
- Unified natural language interface for satellite imagery
- Multi-modal handling of RGB, SAR, IR, and FCC images
- Scale-robust inference across diverse spatial resolutions
- Oriented object grounding suitable for overhead viewpoints
- SAR grounding without SAR captions, using detector + LLM reasoning
- Fully deployable web-based system
We thank the authors of SARATR-X, VRSBench, Qwen-VL, Moondream, and SAM for open-sourcing their work, which made this project possible.