Active Visual Exploration Based on Attention-Map Entropy

¹IDEAS NCBR,
²Jagiellonian University, ³Warsaw University of Technology,
⁴Tooploox, ⁵Ardigen
IJCAI 2023

Abstract

Active visual exploration addresses the issue of limited sensor capabilities in real-world scenarios, where successive observations are actively chosen based on the environment. To tackle this problem, we introduce a new technique called Attention-Map Entropy (AME). It leverages the internal uncertainty of the transformer-based model to determine the most informative observations. In contrast to existing solutions, it does not require additional loss components, which simplifies the training. Through experiments, which also mimic retina-like sensors, we show that such simplified training significantly improves the performance of reconstruction, segmentation and classification on publicly available datasets.

Visual exploration - human versus AI

Humans naturally visually explore surrounding environment, using already observed areas as clues to where the wanted object can be located. At the same time, common state-of-the-art artificial intelligence solutions analyze all available data, which is inefficient and waste time and computational resources. In this project, we introduce a novel Active Visual Exploration method, enabling AI agents to efficiently explore their environment.

Attention-Map Entropy (AME)

Our approach chooses the most informative observations by reusing the internal uncertainty coded in the attention maps. In contrast to existing methods, it does not require any auxiliary loss functions dedicated to active exploration. Therefore, the training concentrates on the target task loss, not on an auxiliary loss, which improves overall performance.

Architecture

The agent observed two patches of the image, which are processed by the encoder to produce their feature representations (orange rectangles). These outputs are combined with the masked patches (shown as gray rectangles) and passed through the decoder. The decoder reconstructs the missing image patches. Additionally, our method generates the entropy map for one of the decoder's multi-head self-attention layers and uses it to select the location of the third glimpse. The process repeats till we reach the assumed number of glimpses.

Entropy map

To explain the idea of the entropy map based on attention in the transformer layer, let us consider an image divided into four patches (2 × 2) on the left. Its attention map will be a 4 × 4 matrix, where each row represents the attention weights used to calculate the output in the next transformer layer for a corresponding patch. Calculating Shannon’s entropy for each row will result in a 2 × 2 entropy map. The patch with the highest entropy value is selected as the next glimpse.

Active visual exploration example

The figure shows a glimpse selection process based on AME for 8 × 322 glimpses for a sample 256 × 128 image. The rows correspond to a) step number, b) model input (glimpses), c) model prediction given, d) decoder attention entropy (known areas are explicitly set to zero). The algorithm explores the image in places where the reconstruction result is blurry.

Results

Comparison of our model in reconstruction task against AttSeg, GlAtEx and SimGlim on SUN360, ADE20K and MS COCO [3] datasets. The metric used is a root mean square error (RMSE; lower is better). For each experiment, we provide a training and evaluation regime defined by a number of glimpses of a specific resolution. Pixel % and area % denote respectively: the percentage of image pixels known to the model and the percentage of image area seen by the model. Differences in both measures occur when dealing with retina-like glimpses, which have lower pixel counts by design. Our method outperforms competitive solutions in all configurations.

See the paper for results on segmentation and classification tasks.

More examples

Glimpse-based reconstruction step-by-step on MS COCO: The figure shows a glimpse selection process based on AME for 37 × 16² glimpses for a sample 224 × 224 image. The rows correspond to A) step number, B) model input (glimpses), C) model prediction given, D) decoder attention entropy (known areas are explicitly set to zero).

Glimpse-based reconstruction step-by-step on ADE20K: The figure shows a glimpse selection process based on AME for 37 × 16² glimpses for a sample 224 × 224 image. The rows correspond to A) step number, B) model input (glimpses), C) model prediction given, D) decoder attention entropy (known areas are explicitly set to zero).

Glimpse-based reconstruction step-by-step on ADE20K: The figure shows a glimpse selection process based on AME for 8×32² glimpses for a sample 256 × 128 image. The rows correspond to A) step number, B) model input (glimpses), C) model prediction given, D) decoder attention entropy (known areas are explicitly set to zero).

Glimpse-based reconstruction step-by-step on MS COCO: The figure shows a glimpse selection process based on AME for 8 × 32² glimpses for a sample 256 × 128 image. The rows correspond to A) step number, B) model input (glimpses), C) model prediction given, D) decoder attention entropy (known areas are explicitly set to zero).

BibTeX

@inproceedings{pardyl2023active, title = {Active Visual Exploration Based on Attention-Map Entropy}, author = {Pardyl, Adam and Rypeść, Grzegorz and Kurzejamski, Grzegorz and Zieliński, Bartosz and Trzciński, Tomasz}, booktitle = {Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, {IJCAI-23}}, publisher = {International Joint Conferences on Artificial Intelligence Organization}, editor = {Edith Elkind}, pages = {1303--1311}, year = {2023}, month = {8}, note = {Main Track}, doi = {10.24963/ijcai.2023/145}, url = {https://doi.org/10.24963/ijcai.2023/145}, }