CVPR 2019 Workshop on

Long-Term Visual Localization under Changing Conditions

Overview

When: June 17th (Monday), 2019
Where: Hyatt Beacon B
Time: half day, afternoon (1:30 PM to 6:00 PM)
Schedule:
- 1:30 - 1:45 Introduction by organizers & overview over challenges
- 1:45 - 2:15 Invited talk: Jan-Michael Frahm
- 2:15 - 2:45 Talks by winners of Visual Localization Challenge
  - 1st place: Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, Marcin Dymczyk, From Coarse to Fine: Robust Hierarchical Localization at Large Scale
  - 1st place: Tianxin Shi, Shuhan Shen, Xiang Gao, Lingjie Zhu, Yurun Tian, Qingtian Zhu, Visual Localization Using Sparse Semantic 3D Map
- 2:45 - 3:15 Invited talk: Bernhard Zeisl
- 3:15 - 3:40 Coffee Break
- 3:40 - 4:10 Talks by winner and runner-up of Local Feature Challenge
  - 1st place: Jerome Revaud, Philippe Weinzaepfel, Cesar De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, Martin Humenberger, R2D2: Reliable and Repeatable Detectors and Descriptors for Joint Sparse Keypoint Detection and Local Feature Extraction
  - 2nd place: Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, Bohyung Han, Large-Scale Image Retrieval with Attentive Deep Local Features
- 4:10 - 4:40 Invited talk: Niko Sünderhauf
- 4:40 - 5:10 Talks by runner-ups of Visual Localization and Local Feature Challenges
  - 2nd place Visual Localization Challenge: Hugo Germain, Guillaume Bourmaud, Vincent Lepetit, Sparse-to-Dense Hypercolumn Matching for Long-Term Visual Localization
  - 3rd place Local Features Challenge: Zhe Xin, Yinghao Cai, Shaojun Cai, Jixiang Zhang, Yiping Yang, Yanqing Wang, Self-supervised Local Feature Detector with Co-Saliency Ranking Optimization
- 5:10 - 5:40 Invited talk: Srikumar Ramalingam
- 5:40 - 6:00 Closing

Deadlines

Challenge submission opens: April 10th
Challenge submission deadline: June 1st
Notification: June 4th

Note that the workshop focuses on the submissions to the challenges. There will be no contributed papers.

Abstract

Visual localization is the problem of (accurately) estimating the position and orientation, i.e., the camera pose, from which an image was taken with respect to some scene representation. Visual localization is a vital component in many interesting Computer Vision and Robotics scenarios, including autonomous vehicles such as self-driving cars and other robots, Augmented / Mixed / Virtual Reality, Structure-from-Motion, and SLAM.

Visual localization algorithms rely on a scene representation constructed from images. Since it is impractical to capture a given scene under all potential viewing conditions, i.e., under all potential viewpoints under all potential illumination conditions under all potential seasonal or other conditions, localization algorithms need to be robust to such changes. This is especially true if visual localization algorithms need to operate over a long period of time. This workshop thus focuses on the problem of long-term visual localization and is intended as a benchmark for the current state of visual localization under changing conditions. The workshop consist of both invited talks by experts in the field and practical challenges on recent datasets.

Detailed Description

There are multiple approaches to solve the visual localization problem: Structure-based methods establish matches between local features found in a query image and 3D points in a Structure-from-Motion (SfM) point cloud. These matches are then used to estimate the camera pose by applying a n-point-pose solver inside a RANSAC loop. Localization techniques based on scene coordinate regression replace the feature extraction and matching stage through machine learning by directly predicting the 3D point corresponding to a pixel patch. The resulting 2D-3D matches are then used for classical, RANSAC-based pose estimation. Camera pose regression techniques such as PoseNet replace the full localization pipeline with a CNN that learns to regress the 6DOF pose from a single image. While these approaches aim to estimate a highly accurate pose, image retrieval-based approaches aim to provide a coarser prior. Using compact image-level descriptors, they are typically much more scalable than method that represent the scene either explicitly via a SfM point cloud or implicitly via a CNN.

Common to all ways to approach the visual localization problem is that they generate a representation of the scene from a set of training images. Also common to all these approaches is that they (implicitly) assume that the set of training images covers all relevant viewing conditions, i.e., that the test images are taken under similar conditions as the training images. In practice however, the set of training images will only depict the scene under a subset of all possible viewpoints and illumination conditions. Moreover, many scenes are dynamic over time. For example, the geometry and appearance of outdoor scenes changes significantly over time.

While a substantial amount of work has focused on making visual localization algorithms more robust to viewpoint changes between training and test images, there is comparably little work on handling changes in scene appearance over time. Part of this is due to a lack of suitable benchmark datasets, which have only started to become available recently. Yet, changes over time, e.g., due to seasonal changes in outdoor scenes or changes in furniture in indoor scenes, pose very significant problems as they often lead to changes in both scene appearance (captured in the images) and scene geometry.