Visual Affordance Prediction: Survey and Reproducibility

Istituto Italiano di Tecnologia¹
Queen Mary University of London²
Idiap Research Institute³, École Polytechnique Fédérale de Lausanne⁴

Abstract

Affordances are the potential actions an agent can perform on an object, as observed by a camera. Visual affordance prediction is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand pose estimation. This diversity in formulations leads to inconsistent definitions that prevent fair comparisons between methods. In this paper, we propose a unified formulation of visual affordance prediction by accounting for the complete information on the objects of interest and the interaction of the agent with the objects to accomplish a task. This unified formulation allows us to comprehensively and systematically review disparate visual affordance works, highlighting strengths and limitations of both methods and datasets. We also discuss reproducibility issues, such as the unavailability of methods implementation and experimental setups details, making benchmarks for visual affordance prediction unfair and unreliable. To favour transparency, we introduce the Affordance Sheet, a document that details the solution, datasets, and validation of a method, supporting future reproducibility and fairness in the community.

Highlights

Unified problem definition of visual affordance and systematic review
Reproducibility issues in visual affordance datasets and benchmarks
Affordance Sheet to favour reproducibility
Open challenges and future directions of visual affordance for robotics

Unified affordance problem formulation

Our formulation integrates the redefinitions related to affordance prediction given the task to accomplish and the RGB image. We decompose visual affordance prediction in the following subtasks and related components:

Localise the object of interest (object localisation).
Predict the actions for each localised object (functional classification).
Predict the object regions that enable to perform the action (functional segmentation).
Estimate the hand pose on the object, given the hand model and previous extracted information (hand pose estimation).
Render the hand on the RGB image (hand synthesis).

Reproducibility in visual affordance

Reproducibility challenges (RCs) in different redefinitions of visual affordance prediction include: data availability for benchmarking (RC1); availability of a method's implementation (RC2); availability of the trained model (RC3); details of the experimental setups (RC4); and details of the performance measures used for the evaluation (RC5).

Affordance sheet

To promote reproducibility in affordance prediction and overcome reproducibility challenges, we propose the Affordance Sheet, an organised collection of good practices that can facilitate fair comparisons and the development of new solutions (see the Table below). Our Affordance Sheet includes Model Cards and adds sections that complement the released information.