Attribute and Simile Classifiers for Face Verification |
 | |
In this work, we advance the state-of-the-art for face verification ("are
these two images of the same person?") in uncontrolled settings with
non-cooperative subjects. To this end, we present two novel and complementary
methods for face verification. Common to both methods is the idea of
extracting and comparing "high-level" visual features, or traits, of a face
image that are insensitive to pose, illumination, expression, and other imaging
conditions. Our first method -- based on attribute classifiers -- uses binary
classifiers trained to recognize the presence, absence, or degree of
describable visual attributes (gender, race, age, hair color, etc.).
Our second method -- based on simile classifiers -- removes the manual labeling
required to train attribute classifiers. The simile classifiers are binary
classifiers trained to recognize the similarity of faces, or regions of
faces, to specific reference people. The idea is to automatically learn
similes that distinguish a person from the general population. An unseen face
might be described as having a mouth that looks like Barack Obama's
and a nose that looks like Owen Wilson's.
Comparing two faces is simply a matter of comparing trait vectors (i.e.,
from the attribute and/or simile classifiers). We present experimental
evaluation results on the challenging Labeled Faces in the Wild (LFW)
data set. This data set is remarkable in its variability, exhibiting all of
the differences mentioned above. Remarkably, both the attribute and simile
classifiers achieve state-of-the-art results on the LFW "restricted images"
benchmark, and a hybrid of the two results in a 31.68% drop in error rates
compared to the previous best. To our knowledge, this is the first time that a
list of such visual traits have been used for face verification. For testing
beyond the LFW data set, we introduce PubFig -- a new data set of real-world
images of public figures (celebrities and politicians) acquired from the
internet. The PubFig data set is both larger (60,000 images) and deeper (on
average 300 images per individual) than existing data sets, and allows us to
present verification results broken out by pose, illumination, and expression.
Finally, we measure human performance on LFW, showing that humans do very well
on it -- given image pairs, verification of identity can be performed almost
without error.
This research was funded in part by NSF award IIS-03-25867 and ONR award
N00014-08-1-0638. We are grateful to Omron
Technologies for providing us the OKAO face detection
system. |
Publications
"Attribute and Simile Classifiers for Face Verification," N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, IEEE International Conference on Computer Vision (ICCV), Oct, 2009. [PDF] [bib] [©]
|
Images
 |
|
Training images for attribute classifiers:
Each row shows training examples
of face images that match the given attribute label (positive examples) and
those that don't (negative examples). We have over a thousand training images
for each of our 65 attributes. Accuracies for each attribute classifier are
shown in the next image.
|
| |
|
|
 |
|
Accuracies of attribute classifiers:
We present accuracies of the 65 attribute classifiers trained for our system.
Example training images for the attributes in bold are shown in the
previous images
|
| |
|
|
 |
|
Amazon Mechanical Turk job for labeling attributes:
We use Amazon Mechanical Turk to label images
with attributes. This online service allows us to easily and inexpensively
label images using large numbers of human workers. This image shows an example
of our attribute labeling jobs. We were able to collect over 125,000 human
labels in a month, for $5,000.
|
| |
|
|
 |
|
Attribute classifier outputs:
An attribute classifier can be trained to recognize the presence or absence of
a describable aspect of visual appearance. The responses for several
such attribute classifiers are shown for a pair of images of Halle Berry. Note
that the "flash" and "shiny skin" attributes produce very different responses,
while the responses for the remaining attributes are in strong agreement
despite the changes in pose, illumination, expression, and image quality.
|
| |
|
|
 |
|
Training images for simile classifiers:
Each simile classifier is trained using several images of a specific reference
person, limited to a small face region such as the eyes, nose, or mouth. We
show here three positive and three negative examples for four regions on two of
the reference people used to train these classifiers.
|
| |
|
|
 |
|
Simile classifier outputs:
We use a large number of "simile" classifiers trained to recognize the
similarities of parts of faces to specific reference people. The
responses for several such simile classifiers are shown for a pair of images of
Harrison Ford. R_j denotes reference person j, so the first bar on the left
displays the similarity to the eyes of reference person 1. Note that the
responses are, for the most part, in agreement despite the changes in pose,
illumination, and expression.
|
| |
|
|
 |
|
Face Verification Results on LFW:
Performance of our attribute classifiers, simile classifiers, and a hybrid of
the two are shown in solid red, blue, and green, respectively. All 3 of our
methods outperform all previous methods (dashed lines). Our highest accuracy is
85.29%, which corresponds to a 31.68% lower error rate than the previous
state-of-the-art.
|
| |
|
|
 |
|
Amazon Mechanical Turk job for human verification:
We asked human users on Amazon Mechanical Turk
to perform the face verification task on the LFW data set. This image shows an
example of what these jobs looked like. Using a total of 240,000 user
responses, we were able to plot human performance on LFW
|
| |
|
|
 |
|
Human Face Verification Results on LFW:
Human performance on LFW is almost perfect (99.20%) when people are shown the
original images (red line). Showing a tighter cropped version of the images
(blue line) drops their accuracy to 97.53%, due to the lack of context
available. The green line shows that even with an inverse crop, i.e., when
only the context is shown, humans still perform amazingly well, at
94.27%. This highlights the strong context cues available on the LFW data set.
All of our methods mask out the background to avoid using this information.
|
| |
|
|
 |
|
The PubFig Data Set:
We show example images for the 140 people used for verification tests on the
PubFig benchmark. Below each image is the total number of face images for that
person in the entire data set.
|
| |
|
|
 |
|
Face Verification Results on PubFig:
Our performance on the entire benchmark set of 20,000 pairs using attribute
classifiers is shown in black. Performance on the pose, illumination, and
expression subsets of the benchmark are shown in red, blue, and green,
respectively. For each subset, the solid lines show results for the "easy" case
(frontal pose/lighting or neutral expression), and dashed lines show results
for the "difficult" case (non-frontal pose/lighting, non-neutral
expression).
|
| |
|
|
|
Slides
ICCV 2009 presentation
|
Database
 |
|
PubFig Database:
As a complement to the LFW data set, we have created a data set of images of
public figures, named PubFig. This data set consists of 60,000 images of
200 people. The larger number of images per person (as compared to LFW)
allows us to construct subsets of the data across different poses, lighting
conditions, and expressions, while still maintaining a sufficiently large
number of images within each set.
|
| |
|
|
|
FaceTracer: A Search Engine for Large Collections of Images with Faces
Face Swapping: Automatically Replacing Faces in Photographs
Appearance Matching
Labeled Faces in the Wild (UMass Project)
|
|
|