3Nanyang Technological University, 4Fudan University
This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems. A brief introduction of the project is available here.
To evaluate on R2R-style dataset, you need to set up the Matterport3DSimulator. Please follow the instructions provided in its official repository for installation and configuration.
Download NavNuances data v1 from link
We follow the R2R naming convention. To include the trajectory predictions of the NavNuances splits, simply specify the split names in the validation code of standard VLN methods trained on R2R.
We provide an example of setting up the DUET model to generate predictions for the NavNuances dataset. You can check the details in baselines/VLN-DUET.
The evaluator definitions are provided in the evaluation/evaluators directory. After generating the submission file in the standard R2R format for all NavNuances splits, modify the directories in evaluation/run_eval_template.sh. Then run:
cd evaluation
sh run_eval_template.shThis will generate the evaluation results.
To predict using the NavGPT4v model, follow these steps:
- Link Matterport3D scans
v1/scanstobaselines/navgpt4v/data/v1/scans - Place all the evaluation splits into
baselines/navgpt4v/data/R2R/annotations - Set your OPENAI_API_KEY and OPENAI_ORGANIZATION environment variable.
- Add the evaluation splits in the line 17 of
baselines/navgpt4v/NavGPT4v.py. - Run the trajectory prediction script with the following command:
cd baselines/navgpt4v
DEBUG=1 sh run_pred.sh # dump intermediate observations as detailed at line 339 of navgpt4v/LLMs/openai4v.py
sh run_pred.sh # no intermediate logIf you're using NavNuances in your research, please cite using the following BibTeX:
@inproceedings{wang2024navigating,
title={Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation},
author={Wang, Zehao and Wu, Minye and Cao, Yixin and Ma, Yubo and Chen, Meiqi and Tuytelaars, Tinne},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
year={2024},
publisher={Association for Computational Linguistics},
}