AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

1IDEAS NCBR, 2Jagiellonian University,
3Warsaw University of Technology, 4Tooploox,
ECCV 2024

Intelligently explore the environment visually, zooming-in on important features.

Abstract

Active Visual Exploration (AVE) is a task that involves dynamically selecting observations (glimpses), which is critical to facilitate comprehension and navigation within an environment. While modern AVE methods have demonstrated impressive performance, they are constrained to fixed-scale glimpses from rigid grids. In contrast, existing mobile platforms equipped with optical zoom capabilities can capture glimpses of arbitrary positions and scales. To address this gap between software and hardware capabilities, we introduce AdaGlimpse. It uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale. This approach enables our model to rapidly establish a general awareness of the environment before zooming in for detailed analysis. Experimental results demonstrate that AdaGlimpse surpasses previous methods across various visual tasks while maintaining greater applicability in realistic AVE scenarios.

Adaptive Visual Exploration

visual exploration - human vs. AI

Our approach selects and processes glimpses of arbitrary position and scale, fully exploiting the capabilities of modern hardware. In this example, AdaGlimpse selects a low-resolution glimpse of the whole environment. Based on this glimpse, it predicts a bird with probability 0.01, too low to make the final decision. Instead, it selects the second glimpse by zooming in to the upper left corner. The process repeats four times until the probability of the predicted class is higher than a specified threshold.

AdaGlimpse

AdaGlimpse consists of two parts: a vision transformer-based encoder with a task-specific head and a Soft Actor-Critic RL agent. At each exploration step, the RL agent selects the position and scale of the next glimpse based on the information about previous patches, their coordinates, importance, and latent representations.

Efficient image recognition

Active visual exploration with AdaGlimpse can be used to quickly recognise objects in partially observable scenes using only a fraction of maximal image resolution. In this example, AdaGlimpse explores 224 × 224 images from ImageNet with 32 × 32 glimpses of variable scale, zooming in on objects of interest and stopping the process after reaching 75% predicted probability. The model uses respectively only 20 and 8 transformer patches, where standard vision transformer would require 196 patches. The rows correspond to: A) glimpse locations, B) pixels visible to the model (interpolated from glimpses for preview), C) predicted label, D) prediction probability.

Quick scene understanding

Robotic agents face limitations in sensor and processing capabilities. By intelligently choosing areas to explore we get awareness of the environment faster. In this example, AdaGlimpse explores 224 × 224 images from MS COCO with 16 × 16 glimpses of variable scale, zooming in on objects of interest. Note, that each glimpse consists of a single vision transformer patch. The model reconstructs full resolution scene using only 12 patches, where a standard vision transformer would require 196 patches. The columns correspond to: A) glimpse locations, B) pixels visible to the model (interpolated from glimpses for preview), C) reconstruction result.

Benchmark: scene reconstruction

Reconstruction results: RMSE (lower is better) obtained by our model for reconstruction task against AttSeg, GlAtEx, SimGlim, and AME on ImageNet-1k, SUN360, ADE20K and MS COCO datasets. Regardless of the number of glimpses, as well as their resolution and regime, our method outperforms competitive solutions. Note that Pixel % denotes the percentage of image pixels known to the model, † a reproduced result not published in the relevant paper, and * zero-shot performance.

Benchmark: image classification

Classification results: Accuracy obtained by our model for classification task against DRAM, GFNet, Saccader, STN, TNet, PatchDrop and STAM on ImageNet-1k dataset. Our AdaGlimpse needs 40% less pixels to match the performance of the best baseline method. Note that Pixel % denotes the percentage of image pixels known to the model, and regimes are described in the paper.

BibTeX

@inproceedings{pardyl2024adaglimpse,
  title     = {AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale},
  author    = {Pardyl, Adam and Wronka, Michał and Wołczyk, Maciej and Adamczewski, Kamil and Trzciński, Tomasz and Zieliński, Bartosz},
  booktitle = {Computer Vision -- ECCV 2024},
  publisher = {Springer Nature Switzerland},
  year      = {2024},
  note      = {Main Track},
}