Generic Representation Learning

Results: the learned representation

We investigate the properties of the learned representation using the following methods:

1) tSNE: large-scale 2D embedding of the representation. This enables visualizing the space and getting a sense of similarity from the perspective of the representation,
2) Nearest Neighbors (NN) on the full dimensional representation
3) training a readout function (a simple classifier, such as KNN or a linear classifier) on the frozen representation (i.e., no tuning) to read out a desired variable.

We refrain from fine tuning the representation (i.e., the siamese tower is frozen and receives no supervision on the task it is being evaluated on). This is because, if our initial hypothesis is correct, training on the foundational tasks should essentially lead to generalization and abstraction with no direct supervision on the secondary tasks. We compare against the representations of related methods that made their models available, various layers of AlexNet trained on ImageNet, and a number of supervised techniques for some of the tasks.

tSNE (MIT Places & OUR DATASET)

The 2-dimensional embeddings (tSNE) of our representation for MIT places dataset (‘library’ category) and an unseen subset of our dataset are provided below. The representation organizes the images based on their 3D content (scene layout, relative camera pose to the scene, etc) and independent of their semantics (visible objects, architectural styles) or low-level properties (color, texture, etc). This suggests that the representation must have a notion of certain basic 3D concepts, though it was never provided with an explicit supervision for such tasks (especially for non-matching images, while all tSNE images are non-matching).
The tSNE of our dataset also suggests the patches are organized based on their coarse surface normals (again, a task that the representation didn’t receive a supervision for). See the section below for quantitative evaluation of our representation for surface normal estimation on NYUv2 dataset.

select a page
hover over the figure for magnification

MIT PLACES

AlexNet Ours

Whole

Nearest Neighbors

Click on a query image to see its NNs based AlexNet (trained on ImageNet) and our representation.
Note the geometric consistency between the NNs and their respective query.

click on a query image

Query Image: Nearest-Neighbors-Results

Pose:

Alexnet: Nearest-Neighbors-Query

Surface Normal Estimation

We evaluated our representation on NYUv2 benchmark to see if it has a notion of surface normals (see the discussion on the tSNEs above).
The summary of the results are provided below, showing our representation outperforms the baselines on unsupervised surfance normal estimation (see the paper for more details and additional results).

3D OBJECT POSE ESTIMATION

The following figures shows the tSNE embedding of several ImageNet categories based on our representation and AlexNet trained on ImageNet. Please see the paper for the tSNEs of other baseline representations. The embeddings of our representation are geometrically meaningful, while the baselines either perform a semantic organization or overfit to other aspects, such as color.
NOTE: certain aspects of object pose estimation, e.g. distinguishing between the front and back of a bus, are more of a semantic task rather than geometric/3D. That adversely impacts a method that has a 3D understanding but not semantic (e.g., our representation). In this sense, the poses that are 90 degrees congruent could be considered identical and equally good (i.e., different sides of an even cube).

IMAGENET

select a page
hover over the figure for magnification

Chest

AlexNet Ours

Whole

ABSTRACTION OF 3D OBJECT POSE

To evaluate the abstract generalization abilities of our representation, we generated a sparse set of 88 images showing the exterior of a synthetic cube parametrized over different view angles. The images can be seen as an abstract pose of an object. We then performed NN search between these images and the images of EPFL Multi-View Car dataset using our representations and several baselines. As apparent in the following figure, our representation retrieves meaningful NNs while the baselines mostly overfit to appearance and retrieve either an incorrect or always the same NN. This suggests that our representation, unlike the baselines, has been able to abstract away the appearance details irrelevant for a basic 3D understandin.

PASCAL 3D

The following figure shows cross-category NN search results for our representation along with several baselines. This also evaluates a certain level of abstraction as some of the object categories can be drastically different looking. We also quantitatively evaluated on 3D object pose estimation on PASCAL3D with the results available in the following table. Our representation outperforms scratch network and comes close to AlexNet that has seen thousands of images from the same categories from ImageNet and other objects.

Quantitative results on PASCAL3D benchmark
pics

Scene layout estimation

We evaluated our representation on LSUN dataset. The right table provides the results of layout estimation using a simple NN classifier on our representation along with two supervised baselines, showing that our representation achieved a performance close to Hedau et al.'s supervised method on this unseen task. The left table provides the results of 'layout classification' using NN classifier on our representation and to AlexNets FC7 and Pool5 representations.

Quantitative results on LSUN benchmark

ABSTRACTION OF Layout Estimation

We performed an abstraction experiment on layout estimation (similar to the one on 3D object pose shown above). We performed NN retrieval between a set of 88 images showing the interior of a synthetic cube and the images of LSUN dataset. The same observation of the abstraction experiment on 3D object pose is made here as well with our NNs being meaningful while the baselines mostly overt to appearance with no clear geometric abstraction trait.

Representation Learning Pipeline

How we collected the Data

Sample Collected Image Bundles

select a page

Results: the learned representation

tSNE (MIT Places & OUR DATASET)

select a page hover over the figure for magnification

MIT PLACES

Whole

Nearest Neighbors

click on a query image

Surface Normal Estimation

3D OBJECT POSE ESTIMATION

IMAGENET

select a page hover over the figure for magnification

Chest

Whole

ABSTRACTION OF 3D OBJECT POSE

PASCAL 3D

Scene layout estimation

Quantitative results on LSUN benchmark

ABSTRACTION OF Layout Estimation

Results: Supervised Tasks

Qualitative Results of Camera Pose Estimation

Qualitative Results of Matching

Quantitative Evaluation of Camera Pose Estimation

Quantitative Evaluation of Feature Matching

Evaluations on Brown's feature learning Benchmark (the metric is FPR@95)

Evaluation on Mikolajczyk & Schmid's feature matching benchmark

Live Demo

What's Next?

Publication

TEAM

Amir R Zamir

Tilman Wekel

Pulkit Agrawal

Colin Wei

Jitendra Malik

Silvio Savarese

Acknowledgements: Te-Lin Wu

Contact Us

select a page
hover over the figure for magnification

select a page
hover over the figure for magnification