Overview
Figure 1. PinpointQA decomposes the benchmark into four progressively harder tasks for small object-centric spatial understanding.
PinpointQA is a dataset and benchmark for evaluating whether multimodal models can determine whether a specified small object appears in an indoor scene, localize it through nearby references, describe its final location precisely in natural language, and express the same grounded information in a form useful to both humans and downstream systems.
Human Assistance Evaluation Demo
This lightweight demo mirrors the human assistance setting: users browse several frames, click the target location, and immediately see the elapsed time and response accuracy for each example.
PinpointQA at a Glance
PinpointQA Task Formulation
The benchmark is organized as a progressive capability chain. It starts from target presence verification, moves to nearest reference identification and fine-grained spatial description, and ends with structured spatial prediction.
Target Presence Verification
Determine whether the target object appears in the video.
Nearest Reference Identification
Identify the reference object closest to the target.
Fine-Grained Spatial Description
Describe the target location with clear spatial language.
Structured Spatial Prediction
Organize the target location into directly usable structured fields.
Data Construction Pipeline
Figure 2. The pipeline connects scene curation, scene curation, and task-specific QA generation.
PinpointQA is built from ScanNet++ and ScanNet200, combining richly annotated indoor scenes with small object-centric spatial reasoning. The pipeline identifies candidate small objects, constructs intermediate spatial representations between the target and nearby references, and then generates aligned QA pairs for TPV, NRI, FSD, and SSP under a unified formulation.
Dataset Statistics
Figure 3. Distribution overview of tasks, data-source composition, target categories, and video duration.
The benchmark maintains a relatively balanced task distribution across TPV, NRI, FSD, and SSP. It covers a wide range of daily small objects, combines samples from both ScanNet++ and ScanNet200, and spans both short-to-medium and longer indoor videos, providing a diverse testbed for small object-centric spatial understanding.
Benchmark Performance
The benchmark reveals a clear performance gap between early-stage target perception and later-stage structured spatial prediction. Fine-tuned open-source models achieve the strongest overall results, while almost all models show a steady decline from TPV to SSP, indicating that executable spatial grounding remains the most challenging part of the task chain.
| Rank | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B-SFT Fine-tuned | 0.83 | 0.84 | 0.44 | 0.45 | 0.36 | 0.37 | 0.29 | 0.29 | 0.48 | 0.49 | |
| InternVL3.5-8B-SFT Fine-tuned | 0.82 | 0.82 | 0.41 | 0.39 | 0.34 | 0.36 | 0.23 | 0.24 | 0.45 | 0.45 | |
| Kimi K2.5 Proprietary | 0.80 | 0.84 | 0.42 | 0.44 | 0.32 | 0.33 | 0.15 | 0.15 | 0.42 | 0.44 | |
| Qwen3-VL-8B-Instruct Open-source | 0.78 | 0.80 | 0.37 | 0.37 | 0.28 | 0.29 | 0.12 | 0.12 | 0.39 | 0.40 | |
| GPT-5.4 Proprietary | 0.65 | 0.69 | 0.39 | 0.42 | 0.31 | 0.32 | 0.15 | 0.16 | 0.38 | 0.40 | |
| LLaVA-OneVision-1.5-8B Open-source | 0.76 | 0.79 | 0.30 | 0.30 | 0.26 | 0.27 | 0.07 | 0.06 | 0.35 | 0.36 | |
| Cambrian-S-7B Open-source | 0.73 | 0.78 | 0.33 | 0.35 | 0.24 | 0.25 | 0.05 | 0.06 | 0.34 | 0.36 | |
| InternVL3.5-8B-Instruct Open-source | 0.65 | 0.70 | 0.36 | 0.38 | 0.25 | 0.26 | 0.09 | 0.10 | 0.34 | 0.36 | |
| SenseNova-SI-1.3 Open-source | 0.64 | 0.66 | 0.36 | 0.40 | 0.15 | 0.16 | 0.12 | 0.13 | 0.32 | 0.34 | |
| Spatial-MLLM-v1.1 Open-source | 0.52 | 0.51 | 0.30 | 0.30 | 0.21 | 0.20 | 0.00 | 0.00 | 0.26 | 0.25 |
Citation
@article{zhou2026pinpointqa,
author = {Zhiyu Zhou and Peilin Liu and Ruoxuan Zhang and Luyang Zhang and Cheng Zhang and Hongxia Xie and Wen-Huang Cheng},
title = {PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos},
journal = {arXiv preprint arXiv:2604.08991},
year = {2026}
}