Overview

Figure 1. PinpointQA decomposes the benchmark into four progressively harder tasks for small object-centric spatial understanding.

PinpointQA is a dataset and benchmark for evaluating whether multimodal models can determine whether a specified small object appears in an indoor scene, localize it through nearby references, describe its final location precisely in natural language, and express the same grounded information in a form useful to both humans and downstream systems.

Human Assistance Evaluation Demo

This lightweight demo mirrors the human assistance setting: users browse several frames, click the target location, and immediately see the elapsed time and response accuracy for each example.

Question

Please click Start to begin the demo.

Image 1 / 4

Click Start to begin this example.

Start the demo to enable interaction

You can then switch frames, click the target location, and save your answer.

-

Current Example Sample 1 / 3

Elapsed Time -

Accuracy -

Session Summary Start the demo and complete an example to see the running summary.

PinpointQA at a Glance

10,094 QA Samples

4 Tasks

2 Data Sources

6,121 / 1,954 / 2,019 Train / Val / Test

Indoor Scene Setting

PinpointQA Task Formulation

The benchmark is organized as a progressive capability chain. It starts from target presence verification, moves to nearest reference identification and fine-grained spatial description, and ends with structured spatial prediction.

Target Presence Verification

Determine whether the target object appears in the video.

Nearest Reference Identification

Identify the reference object closest to the target.

Fine-Grained Spatial Description

Describe the target location with clear spatial language.

Structured Spatial Prediction

Organize the target location into directly usable structured fields.

Data Construction Pipeline

Figure 2. The pipeline connects scene curation, scene curation, and task-specific QA generation.

PinpointQA is built from ScanNet++ and ScanNet200, combining richly annotated indoor scenes with small object-centric spatial reasoning. The pipeline identifies candidate small objects, constructs intermediate spatial representations between the target and nearby references, and then generates aligned QA pairs for TPV, NRI, FSD, and SSP under a unified formulation.

Dataset Statistics

Figure 3. Distribution overview of tasks, data-source composition, target categories, and video duration.

The benchmark maintains a relatively balanced task distribution across TPV, NRI, FSD, and SSP. It covers a wide range of daily small objects, combines samples from both ScanNet++ and ScanNet200, and spans both short-to-medium and longer indoor videos, providing a diverse testbed for small object-centric spatial understanding.

Benchmark Performance

The benchmark reveals a clear performance gap between early-stage target perception and later-stage structured spatial prediction. Fine-tuned open-source models achieve the strongest overall results, while almost all models show a steady decline from TPV to SSP, indicating that executable spatial grounding remains the most challenging part of the task chain.

Leaderboard Sorted by Avg Micro (high to low)


Qwen3-VL-8B-SFT Fine-tuned	0.83	0.84	0.44	0.45	0.36	0.37	0.29	0.29	0.48	0.49
InternVL3.5-8B-SFT Fine-tuned	0.82	0.82	0.41	0.39	0.34	0.36	0.23	0.24	0.45	0.45
Kimi K2.5 Proprietary	0.80	0.84	0.42	0.44	0.32	0.33	0.15	0.15	0.42	0.44
Qwen3-VL-8B-Instruct Open-source	0.78	0.80	0.37	0.37	0.28	0.29	0.12	0.12	0.39	0.40
GPT-5.4 Proprietary	0.65	0.69	0.39	0.42	0.31	0.32	0.15	0.16	0.38	0.40
LLaVA-OneVision-1.5-8B Open-source	0.76	0.79	0.30	0.30	0.26	0.27	0.07	0.06	0.35	0.36
Cambrian-S-7B Open-source	0.73	0.78	0.33	0.35	0.24	0.25	0.05	0.06	0.34	0.36
InternVL3.5-8B-Instruct Open-source	0.65	0.70	0.36	0.38	0.25	0.26	0.09	0.10	0.34	0.36
SenseNova-SI-1.3 Open-source	0.64	0.66	0.36	0.40	0.15	0.16	0.12	0.13	0.32	0.34
Spatial-MLLM-v1.1 Open-source	0.52	0.51	0.30	0.30	0.21	0.20	0.00	0.00	0.26	0.25

Citation


@article{zhou2026pinpointqa,
author  = {Zhiyu Zhou and Peilin Liu and Ruoxuan Zhang and Luyang Zhang and Cheng Zhang and Hongxia Xie and Wen-Huang Cheng},
title   = {PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos},
journal = {arXiv preprint arXiv:2604.08991},
year    = {2026}
}