What does it take to develop an agent with human-like intelligent visual perception? The popular paradigms currently employed in computer vision are problem-specific supervised learning, and to a lesser extent, unsupervised and reinforcement learning. However, we argue that none of these would lead to truly intelligent visual perception unless the learning framework is specifically devised to gain abstraction and generalization power. Here we show our approach to this problem which is inspired by the developmental stages of vision skills in humans. Specifically, rather than training a new model for every individual desired problem, we train a model to learn fundamental vision tasks that serve as the foundation for ultimately solving the desired problems. As our first effort towards validating this approach, we employ this method to learn a generic 3D representation through supervising two basic but fundamental 3D tasks. We show that the learned representation generalizes to unseen 3D tasks without the need to any fine-tuning while it achieves a human-level performance on the task it was supervised for.
Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to unseen tasks and abstraction capabilities can be achieved. We use this approach to learn a generic 3D representation through solving a set of supervised proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching (please see the paper for a discussion on how these two tasks were selected). We empirically show that the internal representation of a multi-task ConvNet trained to solve the above problems generalizes to unseen 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need to fine tuning and shows traits of abstraction abilities (e.g., cross modality pose estimation).
In the context of the supervised tasks, we show our representation achieves state-of-the-art wide-baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learned features). We also demonstrate 6DOF camera pose estimation given a pair local image patches.
We contribute a large-scale dataset composed of object-centric Street View scenes along with point correspondences and camera pose information. We collected the dataset by integrating Street View images, their metadata, and large-scale geo-registered 3D building models scraped on the web. The simplified steps of collecting the dataset is show in the above animation (see the paper for details). The dataset is available to public for research purposes and currently includes the main areas of Washington DC, New York City, San Francisco, Paris, Amsterdam, Las Vegas, and Chicago. To ensure the quality of the test set and keep evaluations unimpacted by potential errors introduced by the automated data collection, every datapoint in the test set are verified by at least three Amazon Mechanical Turkers. The procedure and statistics are elborated in the supplementary material.
DOWNLOAD THE DATASET AND INSTRUCTIONS [HERE]
DOWNLOAD VISUALIZATIONS AND ACCURACY ANALYSIS OF THE TEST SET [HERE]
Each row depicts a sample collected image bundle. Each bundle shows one target physical point (placed in the center of the images) from different viewpoints. Below you can see a few snapshots of 3D models of the 8 cities using which the dataset was collected. You can see more snapshots here. The 3D models are also available for download.
We investigate the properties of the learned representation using the following methods:
The 2-dimensional embeddings (tSNE) of our representation for MIT places dataset (‘library’ category) and an unseen subset of our dataset are provided below. The representation organizes the images based on their 3D content (scene layout, relative camera pose to the scene, etc) and independent of their semantics (visible objects, architectural styles) or low-level properties (color, texture, etc). This suggests that the representation must have a notion of certain basic 3D concepts, though it was never provided with an explicit supervision for such tasks (especially for non-matching images, while all tSNE images are non-matching). The tSNE of our dataset also suggests the patches are organized based on their coarse surface normals (again, a task that the representation didn’t receive a supervision for). See the section below for quantitative evaluation of our representation for surface normal estimation on NYUv2 dataset.
Click on a query image to see its NNs based AlexNet (trained on ImageNet) and our representation. Note the geometric consistency between the NNs and their respective query.
We evaluated our representation on NYUv2 benchmark to see if it has a notion of surface normals (see the discussion on the tSNEs above). The summary of the results are provided below, showing our representation outperforms the baselines on unsupervised surfance normal estimation (see the paper for more details and additional results).
The following figures shows the tSNE embedding of several ImageNet categories based on our representation and AlexNet trained on ImageNet. Please see the paper for the tSNEs of other baseline representations. The embeddings of our representation are geometrically meaningful, while the baselines either perform a semantic organization or overfit to other aspects, such as color. NOTE: certain aspects of object pose estimation, e.g. distinguishing between the front and back of a bus, are more of a semantic task rather than geometric/3D. That adversely impacts a method that has a 3D understanding but not semantic (e.g., our representation). In this sense, the poses that are 90 degrees congruent could be considered identical and equally good (i.e., different sides of an even cube).
The following figure shows cross-category NN search results for our representation along with several baselines. This also evaluates a certain level of abstraction as some of the object categories can be drastically different looking. We also quantitatively evaluated on 3D object pose estimation on PASCAL3D with the results available in the following table. Our representation outperforms scratch network and comes close to AlexNet that has seen thousands of images from the same categories from ImageNet and other objects.
Quantitative results on PASCAL3D benchmark
We evaluated our representation on LSUN dataset. The right table provides the results of layout estimation using a simple NN classifier on our representation along with two supervised baselines, showing that our representation achieved a performance close to Hedau et al.'s supervised method on this unseen task. The left table provides the results of 'layout classification' using NN classifier on our representation and to AlexNets FC7 and Pool5 representations.
We performed an abstraction experiment on layout estimation (similar to the one on 3D object pose shown above). We performed NN retrieval between a set of 88 images showing the interior of a synthetic cube and the images of LSUN dataset. The same observation of the abstraction experiment on 3D object pose is made here as well with our NNs being meaningful while the baselines mostly overt to appearance with no clear geometric abstraction trait.
The qualitative and quantitative results of evaluations on the supervised tasks can be seen below. We used the standard evaluation protocols for both camera pose estimation and feature matching tasks. We also provide evaluation results on the (non-Street View) benchmarks of Brown et al. and Mikolajczyk&Schmid.
Follow the instructions to upload a pair of images. Press Run and the relative camera pose and matching score between the two will be shown.
You can also upload a batch of (<100) images and receive the 2D embedding of our representaion vs the baselines reported in the paper.
"Generic 3D Representation via Pose Estimation and Matching", Amir R Zamir, Tilman Wekel, Pulkit Agrawal, Jitendra Malik, Silvio Savarese, in ECCV 2016. [Paper] [Supplementary Material]