3D Morphable Face Models - Past, Present and Future
BERNHARD EGGER, Massachuses Institute of Technology, USA
WILLIAM A. P. SMITH, University of York, UK
AYUSH TEWARI, Max Planck Institute for Informatics & Saarland Informatics Campus, Germany
STEFANIE WUHRER, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP
, LJK, 38000 Grenoble, France
MICHAEL ZOLLHOEFER, Stanford University, USA
THABO BEELER, Disney Research|Studios, Switzerland
FLORIAN BERNARD, Max Planck Institute for Informatics & Saarland Informatics Campus, Germany
TIMO BOLKART, Max Planck Institute for Intelligent Systems, Germany
ADAM KORTYLEWSKI, Johns Hopkins University, USA
SAMI ROMDHANI, IDEMIA, France
CHRISTIAN THEOBALT, Max Planck Institute for Informatics & Saarland Informatics Campus, Germany
VOLKER BLANZ, University of Siegen, Germany
THOMAS VETTER, University of Basel, Switzerland
4
2
1999 2009 2019 2029
?
Fig. 1. 20 years of 3D Morphable Models. Fiing results from the original paper [Blanz and Veer 1999], the first publicly available Morphable Model [Paysan
et al. 2009a], and state-of-the-art facial re-enactment results [Kim et al. 2018a] and GAN-based models [Gecer et al. 2019b].
In this paper, we provide a detailed survey of 3D Morphable Face Models
over the 20 years since they were rst proposed. The challenges in building
and applying these models, namely capture, modeling, image formation,
and image analysis, are still active research topics, and we review the state-
of-the-art in each of these areas. We also look ahead, identifying unsolved
challenges, proposing directions for future research and highlighting the
broad range of current and future applications.
Keywords: 3D Computer Vision, Computer Graphics, Statistical Modelling,
Analysis-by-Synthesis, Generative Models
Institute of Engineering Univ. Grenoble Alpes.
Authors’ addresses: Bernhard Egger, Massachusetts Institute of Technology, USA,
[email protected]; William A. P. Smith, University of York, UK, william.smith@york.ac.uk;
Ayush Tewari, Max Planck Institute for Informatics & Saarland Informatics Campus,
Germany, atewari@mpi-inf.mpg.de; Stefanie Wuhrer, Univ. Grenoble Alpes, Inria,
CNRS, Grenoble INP
, LJK, 38000 Grenoble, France, [email protected]; Michael
Zollhoefer, Stanford University, USA, michael@zollhoefer.com; Thabo Beeler, Disney
Research|Studios, Switzerland, [email protected]; Florian Bernard, Max
Planck Institute for Informatics & Saarland Informatics Campus, Germany, fbernard@
mpi-inf.mpg.de; Timo Bolkart, Max Planck Institute for Intelligent Systems, Ger-
many, [email protected]; Adam Kortylewski, Johns Hopkins University, USA,
[email protected]; Sami Romdhani, IDEMIA, France, [email protected];
Christian Theobalt, Max Planck Institute for Informatics & Saarland Informatics Cam-
pus, Germany, theobalt@mpi-inf.mpg.de; Volker Blanz, University of Siegen, Ger-
many, [email protected]siegen.de; Thomas Vetter, University of Basel, Switzerland,
1 INTRODUCTION
It is 20 years since 3D Morphable Face Models were rst presented
at SIGGRAPH ’99. They were proposed as a general face representa-
tion and a principled approach to image analysis. Blanz and Vetter
[1999] introduced and tackled many subsidiary problems and the
results were considered groundbreaking. The impact of the original
paper has been long term, recognized by an impact paper award, and
the approach and applications are accessible to a wide audience (the
original supplementary video was one of the most popular videos
in the early days of YouTube). However, the approach is not just of
historical interest. In the past two years, 3D Morphable Face Models
have been re-discovered in the context of deep learning and are
incorporated into many state-of-the-art solutions for face analysis.
This survey aims to build a starting point for researchers new to the
topic, act as a reference guide for the community around 3D Mor-
phable Models and to introduce exciting open research questions.
1.1 Definition
A 3D Morphable Face Model is a generative model for face shape
and appearance that is based on two key ideas: First, all faces are
in dense point-to-point correspondence, which is usually estab-
lished on a set of example faces in a registration procedure and then
maintained throughout any further processing steps. Due to this
arXiv:1909.01815v2 [cs.CV] 16 Apr 2020
2 B. Egger et al.
Modeler
Morphable
Face Model
Face
Analyzer
3D Database
2D Input 3D Output
Fig. 2. The visual abstract of the seminal work by Blanz and Veer [1999]. It
proposes a statistical model for faces to perform 3D reconstruction from 2D
images and a parametric face space which enables controlled manipulation.
correspondence, linear combinations of faces may be dened in a
meaningful way, producing morphologically realistic faces (morphs).
The second idea is to separate facial shape and color and to disen-
tangle these from external factors such as illumination and camera
parameters. The Morphable Model may involve a statistical model of
the distribution of faces, which was a principal component analysis
in the original work [Blanz and Vetter 1999] and has included other
learning techniques in subsequent work.
1.2 History
The initial research question behind the idea of 3D Morphable Mod-
els (3DMM) was how a visual system, biological or articial, can
cope with the high variety of images that a single class of objects
can generate, and how objects are represented to solve vision tasks.
The leading assumption for the development of 3DMMs was that
prior knowledge about object classes plays an important role in
vision and helps to solve otherwise ill-posed problems. 3DMMs are
designed to capture such prior knowledge, and they are learned
automatically from a set of examples. The representation is general,
so it may be applied to dierent objects and tasks.
Representations of faces and the task of face recognition have
been in the focus of vision research for a long time. An important
and very inuential paradigm shift in this eld was the Eigenfaces
approach by Sirovich and Kirby [1987] and Turk and Pentland [1991],
which learned an explicit face representation from examples and
operated entirely on grey-levels in the image domain. Eigenfaces
treated images of faces as a vector space and performed a principal
component analysis, with the eigenvectors representing the main
modes of variation in that space. The drawback of Eigenfaces was
not only that it was limited to a xed pose and illumination, but
that it had no eective representation of shape dierences: When
the coecients in linear combinations of eigenvectors are changed
continuously, structures will fade in and out, rather than shift along
the image plane. As a consequence, the model fails to nd a single
parameter for, say, the distances between the eyes. The Eigenfaces
approach was also extended to 3D face surfaces by Atick et al
.
[1996]
to model shading variations in faces, yet with essentially the same
limitation.
Several research groups proceeded by adding an Eigendecompo-
sition of 2D shape variations between individual faces. This pro-
vided both an explicit shape model, and - after warping the images
- an aligned Eigenface model without blurring and ghosting arti-
facts. While in the original Eigenface approach, the images were
only aligned by a single point (e.g., the tip of the nose), the new
methods established correspondence on signicantly more points.
Landmark-based face warping for image analysis was introduced
by Craw and Cameron [1991]. Using approximately 200 landmarks,
the rst statistical shape model was proposed in Active Shape Mod-
els [Cootes et al
.
1995]. While this model used shape only, Active
Appearance Models [Cootes et al
.
1998] proposed a combination
of shape and appearance that turned out to be very successful and
inuential. Other groups computed dense pixel-wise image corre-
spondences with optic-ow algorithms for modeling the facial shape
variations [Hallinan et al
.
1999; Jones and Poggio 1998]. In all these
correspondence-based approaches, images are warped to a common
template, and the appearance variation is then performed in the
same way as the original Eigenfaces, but on the shape-normalized
images. The shape model, on the other hand, provides a powerful
and compact representation of shape dierences by shifting pixels in
the image plane. However, compared to the simple linear projection
in Eigenfaces, the image analysis task is transformed into a more
challenging nonlinear model tting problem.
These 2D models were ecient to cover the shape variation for a
xed pose and illumination setting. The framework was extended
to variations across pose by Vetter and Poggio [1997] and to other
object classes, such as images of cars [Jones and Poggio 1998]. All
this groundwork demonstrated that a separation of shape and tex-
ture information in images can model the variation of faces. On
the other hand, the price to pay for taking pose and illumination
variations into account was high: eventually, it would require many
separate models, each limited to a small range of poses and illu-
minations. In contrast, the progress of 3D Computer Graphics in
the 1990s demonstrated that variations in pose and illumination are
easy to simulate, including self-occlusion and shadowing. Adapt-
ing methods from graphics to face modeling and computer vision
led to the new face representation in 3DMMs and the idea of using
analysis-by-synthesis to map between the 3D and 2D domain. Those
were the two key contributions in the rst paper on 3DMMs [Blanz
and Vetter 1999], compare Figure 2. The name Morphable Model
was derived from their 2D counterpart [Jones and Poggio 1998], and
in fact, Jones and Poggio strongly inuenced the ideas that led to
3DMMs.
3DMMs and 2D Morphable Models rely on dense correspondence,
rather than only a set of facial feature points. In the original work,
this was established by an optical ow algorithm for image regis-
tration. The image synthesis algorithm used a standard rendering
model with perspective projection, ambient and directional lighting,
and a Phong model of surface reectance that includes a specu-
lar component. However, in analysis-by-synthesis, this approach
comes at a computational price because shape-camera [Smith 2016]
and illumination-albedo [Egger 2018] ambiguities lead to a hard
ill-posed optimization problem. Moreover, the optimization is costly
and is prone to end in unwanted local optima. Just as it is already
dramatically more complicated to t an Active Appearance Model
3D Morphable Face Models - Past, Present and Future 3
to a 2D image, compared to the simple projection needed for Eigen-
faces, the complexity of 3DMM tting raises additional problems
which have remained challenging to researchers after 20 years of
development.
At the time the initial 3DMM was developed, image-based models
were dominating computer vision and even animation [Ezzat et al
.
2002], and they were rather elaborate at that time. It was a key
decision to take the best of both the 2D and the 3D world, by using 3D
models to manipulate existing images on the one hand, and applying
2D algorithms to 3D surfaces: Unlike mesh-based algorithms, the
original 3DMM used optical ow, multi-resolution approaches and
interpolation algorithms on parameterized surfaces of faces. With
the initial face scanner delivering surfaces in a two-dimensional
cylinder parameterization, all those steps were performed in 2D,
and most of the methods involved were replaced with their 3D
equivalent only many years later. It is interesting to see that after
a development towards 3D, the computer vision community came
back to 2D representations by using deep learning, and now evolves
again to 3D, e.g., by integrating 3DMMs.
Over the past years, 3DMMs were applied beyond faces. Models
were built for the surface of the human body [Allen et al
.
2003;
Anguelov et al. 2005; Loper et al. 2015] and for other specic parts
of the body like ears [Dai et al
.
2018] and hands [Khamis et al
.
2015], animals [Sun and Murata 2020; Zu et al
.
2018] and even
cars [Shelton 2000]. In this survey, we focus on 3DMMs to model
the human face, though many of the techniques and challenges are
the same across dierent object classes.
The 3DMM was developed in a time where algorithms and data
were rarely shared across researchers and institutions. 10 years
later the rst publicly available 3DMM was released [Paysan et al
.
2009a] and in the last 10 years, all individual data and algorithmic
components needed to build and use 3DMMs were released by
various researchers. We collected a list of all available resources and
will further maintain it [Community 2019].
The 3DMM was built as a general representation for faces, not just
aiming at one specic task. Even though the model is outperformed
for some very specic applications such as face recognition, it is
unique in its generality across dierent tasks and applications.
1.3 Organization
There is a recent state of the art report on monocular 3D reconstruc-
tion, tracking and applications [Zollhoefer et al
.
2018]. This focuses
on the most recent advances, particularly related to the specic
task of tracking and reconstruction. In contrast, in this paper, we
instead focus on the 3DMM, all involved methods, and reect the
major contributions over the past 20 years while at the same time
highlighting challenges and future directions.
This survey is organized from building to applying a 3DMM.
We start with Section 2 where we present methods to acquire 3D
facial data for model building. We then describe in Section 3 the
various approaches to model the 3D shape and facial appearance.
In Section 4 we discuss the methods to generate a 2D image from
our 3D model using computer graphics. Our Section 5 surveys the
major application of 3DMMs, namely the reconstruction of a 3D
face from a 2D image. Section 6 summarizes the impact of 3DMMs
in the recent advances in the eld of deep learning and how deep
learning can be used to improve the modeling and analysis. Section
7 summarizes the various applications where 3DMMs were used in
the past 20 years. Every Section summarizes the major challenges
the authors see regarding the current limitations of 3DMM. We also
collect challenges that are shared across multiple Sections in Section
8, where we also venture an outlook on what we expect to see in the
next 10 to 20 years and how the 3DMM will keep impacting how
faces are represented.
2 FACE CAPTURE
The key ingredient to any 3DMM is a representative set of 3D shapes,
usually coupled with corresponding appearance data. The typical
way to construct such a sample pool is by acquiring data from the
real world. In this section, we give a brief overview of dierent ap-
proaches that have been used to acquire facial data as well as data of
facial parts. As we are concerned with the creation of input datasets
for 3DMMs, we limit the discussion to acquisition under controlled
conditions, as opposed to the more challenging in-the-wild setting.
Note that controlled 3D face capture may not always be necessary.
There have been attempts to learn 3DMMs directly from images
[Cashman and Fitzgibbon 2012] and state-of-the-art deep learning-
based methods simultaneously learn a 3DMM and regression-based
tting from 2D training data (see Section 6.3). In this section we
begin by covering shape acquisition methods in Section 2.1 includ-
ing geometric, photometric and hybrid methods. Sections 2.2, 2.3
and 2.4 describe methods for capture of appearance, face parts and
dynamics respectively. Section 2.5 lists publicly available 3D face
datasets that could be exploited for building 3DMMs. Finally, we
consider open challenges related to face capture in Section 2.6.
2.1 Shape Acquisition
The three-dimensional shape is arguably the most important in-
gredient to a 3DMM. The issue of shape representation has not
been widely considered in the context of 3DMMs. By far the most
commonly used representation is a triangle mesh. Rare exceptions
include cylindrical [Atick et al
.
1996] and orthographic [Dovgard
and Basri 2004] depth maps (though these representations do not per-
mit meaningful dense correspondence), per-vertex surface normals
[Aldrian and Smith 2012], and, more recently, volumetric orientation
elds [Saito et al
.
2018] and signed distance functions [Park et al
.
2019]. Using a triangle mesh representation, dense correspondence
requires that all samples exhibit the same topology and that the
vertices encode the same semantic point on all samples. Establishing
correspondence across the samples is a challenging topic in itself,
discussed in Section 3.5. In this section, we focus on the acquisition
of raw 3D data, before establishing correspondence.
2.1.1 Ge ometric methods. Geometric methods estimate directly the
3D coordinates of a shape either by observing the same surface
point from two or more viewpoints (in which case the challenge is
identifying corresponding points between images) or by observing
a projected pattern (in which case the challenge is identifying the
correspondence between the known pattern and an image of its
projection). Methods can either be considered active, i.e. they emit
light or other signals into the scene, or passive. Laser scanners,
4 B. Egger et al.
(a) Diuse albedo
(b) Specular albedo
(c) MVS geometry (d) Hybrid geometry (e) Rendering
Fig. 3. Capture of intrinsic face properties using a hybrid geometric/photometric method [Seck et al
.
2016; Smith et al
.
2020]. Multi view stereo (MVS) is used
to reconstruct a coarse mesh (c). A photometric light stage [Ma et al
.
2007] is used to capture diuse and specular albedo maps (a,b) and surface normals that
are merged with the MVS mesh to produce a mesh with fine surface detail (d). Together, these can be used to synthesize highly realistic images of the face (e).
Time-of-Flight sensors, and Structured Light systems are active
systems, where multi-view photogrammetry is a passive alternative.
Active multi-view photogrammetry may be considered a hybrid
active/passive approach, as it relies on passive photogrammetry to
reconstruct the shape, but augments the object with a well-dened
texture projection that benets the reconstruction [Zhang et al
.
2004]. Unlike structured light, the origin of the light does not matter
as the projected texture is solely meant to augment the texture
used for multi-view stereo matching. This type of technology is
used by the Intel
®
RealSense
D435 camera for example
1
. In the
early days of 3DMMs, active systems were the only real option
to acquire 3D shapes at a reasonable quality. The original paper
of Blanz and Vetter [1999] relied on laser scanning [Levoy et al
.
2000], where the face is rasterized via one or more laser-beams.
The laser beam illuminates the face surface at a point and using
the known camera/laser arrangement the 3D position of this point
may be triangulated. The biggest drawback of laser scanners is the
acquisition time, as only very few samples are gathered at any given
time even at very high frame rates, such systems require the
subjects to sit still for several seconds.
Structured light scanners [Geng 2011] overcome this limitation
to some extent by injecting not only a few beams but leveraging
projectors that oer millions of them. The challenge here is to iden-
tify which beam is illuminating the object at a given point. This is
addressed by structuring the projected light in a way that allows
to clearly identify the origin of any ray. The simplest approach is
binary encoding, which projects black and white patterns assign-
ing a unique binary code to each pixel. The required number of
patterns is still quite substantial, for VGA resolution one needs 19
distinct patterns and for 4K resolution 23 patterns, and hence this
approach is most suited for capturing static objects. However, tech-
nical improvements have begun to make these approaches viable
for dynamic capture of faces. The Intel
®
RealSense
SR300 uses
1
https://www.intelrealsense.com/depth-camera-d435/
only 9 binary patterns to obtain VGA, while the most recent Re-
alSense depth camera produces VGA resolution at 60 depth FPS with
a scanning laser technology. Other more complex structured light
methods have been proposed, such as gray codes or (colored) fringe
patterns, which can reduce the number of required frames further,
in extreme cases even to a single frame. A very popular commercial
system that was used to create face datasets ([Cao et al
.
2014b])
and that employs structured light is the rst generation Kinect sen-
sor
2
. The device employs a structured dot pattern, which allows
reconstructing depth from a single frame by sacricing spatial reso-
lution. Resolution may be improved by accumulating several frames
[Newcombe et al
.
2011]. With the increased resolution and quality
of consumer cameras, passive systems have become the method of
choice in most cases, since they are simpler to assemble and op-
erate and o-the-shelf photogrammetry software solutions, both
commercial such as Agisoft
3
or RealityCapture
4
, as well as open-
source solutions such as Meshroom
5
, provide very good results on
human faces. Also, complete systems can be purchased that come
with both hard- and software
678
. These methods typically do not
require the aggregation of information over time and hence oer
themselves for single-shot acquisition [Beeler et al
.
2010] as well as
full-frame rate performance capture [Beeler et al
.
2011; Bradley et al
.
2010; Furukawa and Ponce 2009]. A potential disadvantage of the
aforementioned systems is their form factor since they all require
at least some separation between the dierent participating compo-
nents, i.e. the cameras or lights, often referred to as the baseline. An
alternative which becomes more and more viable due to the push of
the mobile industry are time-of-ight sensors, where the elements
can be located close to each other. The second-generation Kinect
2
https://en.wikipedia.org/wiki/Kinect#Kinect_for_Xbox_360_(2010)
3
https://www.agisoft.com
4
https://www.capturingreality.com
5
https://alicevision.org/
6
https://www.caneldsci.com/imaging-systems/vectra-m3-3d-imaging-system/
7
http://www.di4d.com
8
http://www.3dmd.com/
3D Morphable Face Models - Past, Present and Future 5
sensor
9
belongs to this family, as well as many depth sensors that
are shipped with modern mobile phones. A challenge that time-of-
ight sensors share with most of the active systems is that color
information has to be acquired separately and is not intrinsically
aligned with the 3D data, which is another advantage of passive
setups.
2.1.2 Photometric methods. Photometric methods typically esti-
mate surface orientation, from which the 3D shape may be recov-
ered via integration. The challenge here is to select models that
accurately capture reectance properties of the surface and obtain-
ing sucient measurements that the inversion of these models is
well-posed. Compared to geometric methods, photometric methods
typically oer higher shape detail and do not rely on the presence
of matchable features (so are applicable to smooth, featureless sur-
faces) but often suer from low-frequency bias in the reconstructed
positions caused by modeling errors in reectance and illumination.
Photometric stereo [Ackermann et al
.
2015] estimates the surface
normal at each pixel by observing a scene from a xed position
under at least three dierent illumination conditions, which can be
spectrally multiplexed [Hernández et al
.
2007] in order to reduce
the number of frames required. Early work assumed known lighting
directions and perfectly diuse reectance. When illumination is
uncalibrated and a more suitable glossy reectance model is used,
generic face priors can be used to resolve the resulting ambiguity
[Georghiades 2003]. Typically more lighting conditions are used to
increase robustness and coverage, such as four [Zafeiriou et al
.
2013]
or even nine [Gotardo et al
.
2015]. Gradient-based illumination takes
the number of conditions to the extreme, by illuminating the subject
not with discrete individual point lights, but by an ideally continu-
ous, omnidirectional incident illumination gradient. An advantage
of this set up is that hard light source occlusions (cast shadows) are
replaced by soft partial occlusions of the illuminating hemisphere
(ambient occlusion). In practice, the omnidirectional illumination is
realized via a light-stage [Debevec et al
.
2000], which discretizes the
gradient with a large number (several hundred) of light sources. The
original work of Ma et al
.
[2007] suggests the use of four distinct
gradients, which has later been extended using complementary gra-
dients [Wilson et al
.
2010]. Again, variants of temporal, spectral and
polarization multiplexing have been proposed to reduce the number
of required conditions.
2.1.3 Hybrid methods. Hybrid methods combine the strength of
geometric and photometric methods, specically, they reduce the
low-frequency bias typically present in photometric methods and
increase the high-frequency details when compared to geometric
methods. Nehab et al
.
[2005] propose a method for merging the low
frequencies of positional information and the high frequencies of
surface normals. The method is particularly ecient, involving only
the solution of a sparse linear system of equations, and has been
used in the context of 3DMM tting [Patel and Smith 2012]. Various
combinations of geometric and photometric methods have been
considered. For example, Zivanov et al
.
[2009] combine structured
light with photometric stereo, Ma et al
.
[2007] combine structured
light with gradient-based illumination, Ghosh et al
.
[2011] combine
9
https://en.wikipedia.org/wiki/Kinect#Kinect_for_Xbox_One_(2013)
multi-view stereo with gradient-based illumination, and Beeler et al
.
[2010] combine passive multi-view photogrammetry with shape-
from-shading. Figure 3(d) shows the output of a hybrid method in
which photometric surface normals are merged with a multi-view
stereo mesh.
2.2 Appearance Capture
In addition to shape, appearance is also required for many 3DMM
tasks, such as synthesizing images (see Section 4) and inverse ren-
dering (see Section 5). Unlike shapes, which are almost exclusively
represented as triangular meshes, appearance representation varies
substantially. While in theory, every vertex of the mesh could have
an associated appearance property, typically shapes are parameter-
ized to the 2D domain and textures are used to store appearance
properties. Appearance can be as simple as backprojecting the color
of the images onto the shapes, which causes shading eects to be
baked in. Self-occlusion, in particular when only a single viewpoint
is available, results in missing data in the occluded areas, which
must be hallucinated somehow. Booth et al
.
[2018b] use 3DMM
ts to in-the-wild images and Principal Component Pursuit with
missing values to complete the unobserved texture. They build their
appearance model directly on the sampled textures. Such a sim-
plistic approach, however, does not allow intrinsic face appearance
properties to be separated from shading/shadowing (and hence illu-
mination/geometry). A partial solution to this problem is to control
illumination conditions during capture, for example by using mul-
tiple light sources to create approximately ambient lighting. Note
that a truly Lambertian convex surface observed under truly ambi-
ent light gives exactly the albedo [Lee et al
.
2005]. The appearance
models in the most popular 3DMMs [Booth et al
.
2018a; Dai et al
.
2017; Paysan et al
.
2009a] use this approach, combining images from
multiple cameras to provide full coverage of the face with diuse
lighting to approximate albedo. A better approach is to explicitly
separate shading from skin color, often referred to as intrinsic de-
composition. This allows relighting of the face under novel incident
illumination conditions and a 3DMM built on such data truly mod-
els intrinsic characteristics of the face. Several approaches have
been presented over the years to acquire reectance data suited for
parametric rendering, measuring surface reectance [Marschner
et al
.
1999] and even subsurface scattering properties [Ghosh et al
.
2008]. The polarised spherical illumination environment used by
Ma et al
.
[2007] enables diuse albedo to be captured in a single shot
and specular albedo in two images (see Figure 3(a) and (b)). While
such approaches have predominately used active setups, recently
capture under passive conditions has been demonstrated [Gotardo
et al. 2018].
2.3 Face part specific methods
Certain parts of the human face require more targeted acquisition
methods and devices since they do not conform with the assump-
tions typically made by abovementioned approaches. For example,
the frontmost part of the eye, the cornea, is for obvious reasons
fully transparent and distorts the appearance of the underlying iris
due to refraction. Bérard et al
.
[2014] leverage a combination of
several specialized algorithms, including shape-from-specularity,
6 B. Egger et al.
in order to reconstruct all visible components of the eye. Another
challenging example are teeth [Wu et al
.
2016a], which exhibit ex-
tremely challenging appearance [Velinov et al
.
2018]. Hair violates
the common assumption that the reconstructed shape is a smooth
continuous surface, and requires specialized approaches that esti-
mate hair bers [Beeler et al
.
2012], hair strands [Hu et al
.
2014a;
Luo et al
.
2013] and braiding [Hu et al
.
2014b], or even encode hair
as a surface [Echevarria et al
.
2014] for manufacturing. While most
hair acquisition focuses on static reconstruction, some do capture
hair in motion [Xu et al
.
2014] or estimate physical properties for
hair simulation [Hu et al
.
2017a]. Especially challenging is the acqui-
sition of partially or completely hidden properties, such a the tongue
[Hewer et al
.
2018], the skull [Achenbach et al
.
2018; Beeler and
Bradley 2014], or the jaw [Zoss et al
.
2019, 2018], where oftentimes
specialized imaging systems are required, such as Computer Tomog-
raphy (CT), Magnetic Resonance Imaging (MRI), or Electromagnetic
Articulography (EMA). Lastly, even skin itself requires specialized
treatment in some areas, such as lips [Garrido et al
.
2016b] or eyelids
[Bermano et al
.
2015], where the local appearance and deformation
exceed the capabilities of the more generic methods.
2.4 Dynamic capture
Historically, 3DMMs have been mostly concerned with static shapes,
for example with a set of neutral shapes from dierent individuals
or with a discrete set of expressions per individual, neglecting how
the face transitions between expressions. Most capture systems used
to build 3DMMs were hence static systems, focused on capturing
individual shapes rather than full performances. As the eld begins
to integrate more temporal information into the models, the need
for dynamic capture systems will rise. Active systems have been
considered, both geometric [Zhang et al
.
2004] and photometric
[Wilson et al
.
2010]. However, passive systems [Beeler et al
.
2011;
Bradley et al
.
2010] are currently the technologies of choice, since
they do not require temporal multiplexing and still deliver high-
quality shapes, and more recently even per-frame reectance data
[Gotardo et al
.
2018]. A benecial side-eect of such technologies is
that they often provide shapes that are already in correspondence,
removing the need to establish correspondence in a post-processing
step (Section 3.5), and making them attractive solutions even when
only a discrete set of shapes is desired. Available commercial solu-
tions include Di4D
10
, 3dMD
11
, or the Medusa system
12
.
2.5 Publicly available face datasets
A relatively large number of publicly available datasets exist that
could be leveraged in the construction of 3DMMs, though many
have never been used for this purpose. We believe there is not broad
awareness of the range of 3D datasets available and so collect them
together in Table 1. We hope that this will encourage work that
seeks to exploit multiple datasets for 3DMM building.
10
http://www.di4d.com/
11
http://www.3dmd.com/
12
https://studios.disneyresearch.com/medusa/
2.6 Open challenges
The eld of face capture is far ahead of face modeling in general and
3DMMs in particular. There is a large gap between the quality of
data that can be captured and the data actually used to build 3DMMs.
There is a further gap between the quality of this already-decient
data and what a 3DMM is able to synthesize (see Section 3). Hence,
from the perspective of 3DMMs, the open challenges in capture do
not generally relate to improving the acquisition quality, but to the
lack of publicly available data. While there is a decent number of
datasets publicly available (see Section 2.5), most of these contain
only moderate quality shape data and no appearance information,
with the exception of [Stratou et al
.
2011], which consists of 23
identities only. We believe that the lack of high-quality datasets is
due to a variety of reasons. On the one hand, high-quality acquisi-
tion devices that can capture both shape and appearance are not
readily available. Most of them are custom-built, cannot easily be
purchased or licensed, and require expert knowledge for operation.
On the other hand, acquiring and processing data may be a time
and resource-intense eort, since many systems in the research
community were not conceived for scalable deployment but for
experimental use; slow capture methods are not applicable to young
or elderly people, expensive setups are challenging to replicate on a
global scale to capture whole populations, and methods requiring
very bright illumination makes it unpleasant to be captured with
eyes open. Furthermore, most high-quality systems, in particular
ones that also measure appearance, generally require controlled lab
conditions which makes it dicult to capture large numbers of the
general public. Advances in face capture may alleviate some of these
issues.
Additionally, there are many important broader questions related
to data acquisition that remain unanswered. How many faces do we
really need to capture in order to build a representative (universal)
model? How can we ensure we capture natural expressions? Most
people are not trained to perform specic expressions (i.e. FACS
13
),
and will have diculties performing naturally when put in a capture
setup, leading to a biased dataset. How should we deal with bias
in general and what is the right sampling strategy with respect to
age, gender, ethnicity and so on? Are the capture methods them-
selves biased? For example, capturing faces with very dark skin is
challenging for both photometric and geometric methods. Should
we accept that we cannot hope to capture suciently broad data
and therefore rely on synthesizing additional data or using captured
data to build a bootstrap model that is rened on large 2D datasets?
These approaches are discussed in Section 6.
Finally, there are some philosophical and ethical issues to con-
sider. The human face is unique and highly personal. Once a face
has been captured in high detail, it is possible to synthesize new
images that are almost indistinguishable from photos. If captured
datasets are made publicly available, it is very dicult to control the
distribution and use of such data. Obtaining proper informed con-
sent is, therefore, both legally and ethically important but perhaps
even this does not go far enough, particularly when consent for
minors is given by parents. These issues are beyond the expertise
13
https://en.wikipedia.org/wiki/Facial_Action_Coding_System
3D Morphable Face Models - Past, Present and Future 7
dataset format and resolution coverage no. samples scanner
Spacetime faces [Zhang et al. 2004]
triangle mesh (23k vertices, consis-
tent topology)
inner face only
1 individual
×
384 frame dynamic
sequence
structured
light
CASIA 3D Face Database [cas 2005]
640
×
480 depth map and texture im-
age
face, neck, some-
times ears
123 individuals
×
37-38 scans (ex-
pression, pose, illumination)
Minolta
Vivid910
BU-3DFE [Yin et al. 2006]
triangle mesh (20k-35k triangles),
two texture images (1,300 × 900)
face, neck, some-
times ears
100 individuals × 25 expressions 3dMD
BU-4DFE [Yin et al. 2008]
triangle mesh (35k vertices), tex-
ture image (1,040 × 1,329)
face, neck, some-
times ears
101 individuals
×
six 100 frame ex-
pression sequences
Dimensional
Imaging
Bosphorus [Savran et al. 2008]
1
,
600
×
1
,
200 depth map and tex-
ture image
inner face only
105 individuals
×
up to 35 expres-
sions per subject + 13 poses
Inspeck
Mega Cap-
turor II
York 3D Face Database
[Heseltine et al. 2008]
depth map containing 5k-6k points,
texture image
inner face only 350 individuals × 15 expressions
projected
pattern
stereo
B3D(AC)ˆ2 [Fanelli et al. 2010]
raw scan: triangle mesh (55k ver-
tices), 780
×
580 texture image; pro-
cessed: triangle mesh (23k vertices,
consistent topology), 1,024
×
768
UV texture map
inner face only
14 individuals
×
around 80 dynamic
sequences (speech-4D)
structured
light stereo
Florence 3D Faces [Bagdanov et al
.
2011]
triangle mesh (60k-80k triangles),
4 MPixel texture, additonal 2D HD
video
face, neck, some-
times ears
53 individuals 3dMD
D3DFACS [Cosker et al. 2011]
triangle mesh (30k vertices), 1,024
× 1,280 UV texture map
face, neck, some-
times ears
10 individuals
×
around 52 dynamic
sequences, FACS coded
3dMD
3DRFE [Stratou et al. 2011]
triangle mesh (1.2M vertices), 1,296
×
1,944 diuse and specular albedo
maps and hybrid normal maps
inner face, neck 23 individuals × 15 expressions light stage
Hi4D-ADSIP [Matuszewski et al. 2012]
triangle mesh (20k vertices), tex-
ture image
inner face only
80 individuals
×
around 42 dynamic
sequences
Dimensional
Imaging
BP4D-Spontaneous [Zhang et al. 2014]
triangle mesh (30k-50k vertices),
texture image (1,040 × 1,329)
face, neck, some-
times ears
41 individuals
×
eight one minute
dynamic sequences
Dimensional
Imaging
3D Dynamic Database for Uncon-
strained Face Recognition
[Alashkar et al. 2014]
3.5k vertices for dynamic, 50k ver-
tices for static, texture image
inner face only
58 individuals
×
one static scan +
seven dynamic sequences
Artec
FaceWarehouse [Cao et al. 2014b]
raw: 640
×
480 RGBD; processed:
triangle mesh (11k vertices, consis-
tent topology)
150 individuals × 20 expressions
Microsoft
Kinect
MMSE [Zhang et al. 2016a]
triangle mesh (30k-50k vertices),
1,040 × 1,392 texture image
inner face only
140 individuals
×
four dynamic se-
quences
Dimensional
Imaging
Headspace [Dai et al. 2017]
triangle mesh (180k vertices), 2,973
× 3,055 UV texture map
full head including
face, neck, ears
1,519 individuals 3dMD
4DFAB [Cheng et al. 2018]
triangle mesh (60k-75k vertices),
UV texture map
face, neck and ears
180 individuals
×
4k-16k frames of
dynamic sequences
Dimensional
Imaging
CoMA [Ranjan et al. 2018]
triangle mesh (80k-140k vertices),
texture images (avg resolution
3
,
700
×
3
,
200), six raw camera im-
ages (each 1
,
600
×
1
,
200), align-
ments in FLAME topology
full head including
face, neck, ears
12 individuals
×
12 extreme expres-
sion sequences
3dMD
VOCASET [Cudeiro et al. 2019]
triangle mesh (80k-140k vertices),
texture images (avg resolution
3
,
700
×
3
,
200), six raw camera im-
ages (each 1
,
600
×
1
,
200), align-
ments in FLAME topology
full head including
face, neck, ears,
speech
12 individuals
×
40 dynamic se-
quences (speech-4D)
3dMD
Table 1. Overview of publicly available 3D shape and/or appearance scans of human faces.
of computer graphics and vision researchers and perhaps suggest a
need for discussion and debate with other disciplines.
8 B. Egger et al.
3 MODELING
This section outlines how to compute a 3DMM by modeling the
variations of digitized 3D human faces. In particular, the following
three types of variations are commonly considered. First, geometric
variations across dierent identities are captured in a shape model,
as outlined in Section 3.1. Commonly used models include global
models, which represent variations of the entire face surface, and
local models, which represent variations of facial parts. Second,
geometric variations across dierent facial expressions are captured
in an expression model, as outlined in Section 3.2. Commonly used
models can be mainly classied into additive and multiplicative
models. More recently, nonlinear expression models are starting to
be explored. Third, variation in appearance and illumination are
captured in a separate appearance model as outlined in Section 3.3.
It is interesting to note that the landmark paper on 3DMMs pub-
lished 20 years ago [Blanz and Vetter 1999] proposed rst models
for all three types of variation that are still commonly used today.
To compute shape, expression, or appearance models, statistics
are performed over a database of face data, where traditionally 3D
scans of faces were used, and more recently some approaches also
learn face models directly from 2D images, as outlined in Section 6.3.
This computation of statistics requires correspondence information,
that is, anatomically corresponding parts of the faces need to be com-
pared, and hence known either explicitly or implicitly. An overview
of how correspondence information is computed for faces is given
in Section 3.5. The most commonly used approach is to compute
correspondence information explicitly before computing the 3DMM.
Some recent methods compute correspondence information at the
same time while the 3DMM is built.
3DMMs are generative models and the ability to synthesize novel
faces is a key feature and briey discussed in Section 3.6. Finally,
this section provides a list of available models and discusses open
challenges on 3D face modeling in Sections 3.7 and 3.8, respectively.
3.1 Shape models
This section considers modeling geometric variation across dierent
subjects computed using classical modelling approaches that use
3D data. To use a set of 3D scans as training data, we require a dis-
tance measure between any pair of scans, and computing a distance
between raw scans consisting of dierent numbers of unstructured
vertices is a complex problem. Most commonly, the community
proceeds by rst pre-processing the dataset by deforming a tem-
plate mesh to all scans, which establishes anatomic correspondences
between the points of the scans (see Section 3.5). We denote the
surface of such a pre-processed mesh by
S
in the following. The
i
-th vertex of
S
is denoted by
v
i
R
3
, and its associated vec-
tor
c R
3n
contains the coordinates of
v
i
in a xed order. All
meshes share a common triangulation. We denote the
i
th triangle
by
t
i
= (t
1
i
, t
2
i
, t
3
i
) {
1
, . . . , n}
3
, where
t
1
i
, t
2
i
, t
3
i
provide indices to
the associated vertices
v
t
1
i
, v
t
2
i
, v
t
3
i
, and we denote the complete
triangulation by
T = (t
1
, . . . , t
m
)
. Distances between shapes
S
1
and
S
2
are computed as dierence between
c
1
and
c
2
after rigidly
aligning S
1
and S
2
in R
3
.
3DMMs most often follow Dryden and Mardia [2002] for their
denition of shape
S
as containing the geometric information re-
maining after having removed dierences caused by translation,
rotation, and sometimes uniform scaling. While scaling is typically
not removed for human faces, this is often done in geometric mor-
phometrics (e.g., [Dryden and Mardia 2002, Section 2]).
A shape space is traditionally dened as the set of all congu-
rations of
n
vertices in
R
3
with xed connectivity. Since we are
interested in modeling human faces only in the context of 3DMMs,
in the following, the term shape space refers to a
d
-dimensional
parameter space (with
d n
) that represents plausible 3D human
faces. In this way, each 3D face has an associated parameter vector
w R
d
.
In 3DMMS, statistical shape analysis is used as generative model,
i.e. the shape space has an associated probability distribution called
prior that is dened by a density function
f (w)
and that measures
the likelihood that a realistic 3D face would be represented by a
particular vector
w
in shape space. With a slight abuse of notation,
we interpret c as a generator function in the following as
c : R
d
R
3n
(1)
that maps the low-dimensional parameter vector
w
to the vector
of all vertex coordinates
c(w) R
3n
. We again use
v
i
(w) R
3
to
refer to the
i
th vertex of the mesh given by
w
. While the resolution
(number of vertices) of the model is usually xed, a progressive
mesh representation based on edge collapse simplication of the
generator function has been considered [Patel and Smith 2011].
This part considers the case where all faces in the training data
have a similar (typically neutral) expression; generator functions
that additionally model varying expressions are discussed in Sec-
tion 3.2. As in [Brunton et al
.
2014b], our discussions distinguishes
global models that model the entire face or head area from lo cal
models that perform statistics over localized areas.
3.1.1 Global models. Let
{S
i
}
i
denote the training shapes and
{c
i
}
i
their associated coordinate vectors. The seminal work on
3DMMs [Blanz and Vetter 1999] proposed a global shape model
that uses principal component analysis (PCA) to compute the linear
generator function as
c(w) =
¯
c + Ew , (2)
where
¯
c
is the mean computed over the training data,
E R
3n×d
is a matrix that contains the
d
most dominant eigenvectors of the
covariance matrix computed over the shape dierences
{c
i
¯
c
i
}
, and
w
is the low-dimensional shape parameter vector. One hypothesis
of this model is that training faces can be linearly interpolated to
generate new 3D faces. Another hypothesis is that the 3D faces
in the reduced parameter space
R
d
follow a multivariate normal
distribution, which can be directly deduced from the eigenvalues
corresponding to
E
. This implies that the density function
f (w)
evaluating the likelihood of the parametric representation
w
in
shape space is simply the Mahalanobis distance of w to the origin.
The 3DMM was originally computed over 200 subjects and has
proven to be useful in a variety of applications thanks to its power
to generate plausible shapes, and its simple underlying model. A
3D Morphable Face Models - Past, Present and Future 9
recent study rebuilds such a model from a very large dataset con-
taining 9,663 3D scans and revisits best practices [Booth et al
.
2016],
demonstrating that the originally proposed generator function for
shape remains highly relevant in the research community.
One observation by Blanz and Vetter [1999] is that moving the
representation vector
w
away from the mean face increases their
distinctiveness, eventually leading to caricatures of the identity. In
order to model distinctive facial identities, Patel and Smith [2016]
propose an alternative density function
f (w)
based on the following
observation. Consider the squared Mahalanobis distances from the
mean for a set of
d
-dimensional vectors that follow a multivariate
Gaussian distribution. These distances form a
χ
2
d
-distribution, which
has expected value
d
. Hence, to preserve the shape distinctiveness
related to identity, Patel and Smith restrict the representation
w
to
have Mahalanobis distance
d
from the mean. Lewis et al
.
[2014b]
propose a similar argument showing that, even if faces are truly
Gaussian distributed (which has been shown for the Basel data by
a Kolmogorov Smirno test for shape and per-vertex color, where
the marginal distribution for the shape is close to a Gaussian [Egger
et al
.
2016b]), methods that make the assumption that typical faces
lie near the mean are not valid.
Recently, Lüthi et al
.
[2018] proposed a nonlinear shape space
that models deformations from the mean as Gaussian processes.
3.1.2 Local models. Using a global generator function in Equa-
tion
(1)
is known to lead to representations that do not model
ne-scale geometric details. To improve the modeling of impor-
tant localized areas, such as the eye or nose regions, Blanz and
Vetter [1999] initially experimented manually segmenting the face
into regions and learning separate PCA models per region. Their
results demonstrate that this localized modeling allows for recon-
structions of higher delity. This idea has been extended since with
representations that achieve much higher accuracy than the global
PCA model, and this comes in general at the cost of a less compact
representation w.
First local models segmented the face manually [Basso and Verri
2007; Kakadiaris et al
.
2007; ter Haar and Veltkamp 2008]. Smet
and Gool [2010] and Tena et al
.
[2011] propose automatic ways
of segmenting the faces into areas based on information learned
over the displacements of corresponding vertices in the training
set. Brunton et al
.
[2011] propose a model that combines shape vari-
ations that are localized in dierent areas with a multi-resolution
framework that uses a wavelet decomposition of the 3D face mod-
els. Fine-scale geometric detail can alternatively be modeled using
hierarchical pyramids that consider dierences between a smooth
face and increasingly high-resolution geometry representing e.g.,
wrinkles [Golovinskiy et al. 2006].
It is also possible to perform localized analysis using dierent
statistical approaches than PCA. Neumann et al
.
[2013] propose the
use of sparse PCA combined with a group sparsity constraint to
identify localized deformation components over the training data.
Ferrari et al
.
[2015] follow a related idea and learn a dictionary of
deformation components oversampled regions for the application
of face recognition. Wu et al
.
[2016b] combine a local deformation
subspace model with an anatomical bone structure that acts as
a regularizer of the deformation. The local deformation subspace
is computed over overlapping localized patches, and the statistical
model explicitly factors the rigid and non-rigid deformations applied
to each patch.
3.2 Expression models
As simple linear models similar to the ones described can be used to
model expression variation for one subject, this section considers
models that capture variations of both identity and expression. Un-
like simple linear models learned over a dataset of varying identities
and expressions (e.g., [Booth et al
.
2017]), our focus is on models
that explicitly decouple the inuence of identity and expression by
modeling them in separate coecients. We classify these methods
into additive, multiplicative, and nonlinear models, depending on
how the two sets of coecients are combined.
3.2.1 Additive models. Given two shapes of the same subject, one
with expression
c
exp
and one neutral shape
c
ne
, Blanz and Vetter
[1999] transferred expressions between subjects by adding the ex-
pression osets
c
:
= c
exp
c
ne
to the neutral shape of another
subject.
Several other methods then built on this idea, and model expres-
sion variations as an additive oset to an identity model with a
neutral expression. Formally, additive models are given by
c(w
s
, w
e
) =
¯
c + E
s
w
s
+ E
e
w
e
, (3)
where
¯
c
is a mean,
E
s
and
E
e
are the matrices of basis vectors of
the shape and expression space, and
w
s
and
w
e
are the shape and
expression coecients. Note that the basis vectors of the expression
space can be interpreted as a data-driven blendshape model, where
the basis vectors are orthogonal and do not carry interpretable
semantic meaning in general [Lewis et al. 2014a].
Starting with Blanz et al
.
[2003], several methods propose to learn
two PCA models, one over shape and one over expression to derive
E
s
and
E
e
, and to compute
¯
c
as the mean over training data, either in
neutral expression, or as sum of two means (one over shape and one
over expression). Blanz et al
.
[2003] learned the expression space
from a single subject captured in multiple expressions. Amberg et al
.
[2008] extended this work to include expression data from multiple
subjects. This leads to a statistical expression model which does
not enable control over specic facial expressions. It is therefore
feasible for analysis-by-synthesis tasks but limited for controlling
or synthesizing specic interpretable expression variation. Thies
et al
.
[2015] use blendshapes as the basis vectors of the expression
space. These expression blendshapes are not orthogonal and hence
information of dierent blendshapes are potentially redundant.
3.2.2 Multiplicative mo dels. Another body of work model shape
and expression variations in a multiplicative manner. Li et al
.
[2010]
propose a method to adapt a pre-dened blendshape model to a
specic subject given a small number of static face scans in dierent
expressions, which provides a personalized facial rig. Bouaziz et al
.
[2013] combine a morphable shape model
c(w
s
)
(Eq. 2) with a set
of
d
e
linear expression transfer operators
T
j
:
R
3n
R
3n
that
transform the neutral shape to generate personalized blendshapes.
10 B. Egger et al.
Formally, this model is dened as
c(w
s
, w
e
) =
d
e
Õ
j=1
w
e
j
T
j
c(w
s
) + δ
s
+ δ
e
j
, (4)
where
δ
s
and
δ
e
j
are corrective vectors to adapt the blendshapes to
the tracked subject, and w
e
j
is the j-th coecient of w
e
.
A commonly used multiplicative model is the multilinear model
that extends the idea of PCA of performing a singular value de-
composition to tensor data by performing a higher-order tensor
decomposition (HOSVD) of 3D face data stacked into a training
tensor. In particular, given a training set of dierent identities all
captured in the same set of expressions, the vertex coordinates are
stacked into a data tensor on which HOSVD is performed. This
allows to model correlations of shape changes caused by identities
and expressions. This model was rst applied to 3D face modeling
by Vlasic et al. [2005a], and can be dened as
c(w
2
, w
3
) = M ×
2
w
s
×
3
w
e
, (5)
where
M R
3n×d
s
×d
e
denotes the multilinear model tensor, and
×
i
denotes the tensor mode-product. Thanks to its expressiveness
and simplicity, this model is being used extensively for various ap-
plications [Bolkart and Wuhrer 2015a; Dale et al
.
2011; Fried et al
.
2016; Mpiperis et al
.
2008; Yang et al
.
2012]. To allow modeling local-
ized variations, the multilinear model has been applied to wavelet
coecients at dierent levels of detail [Brunton et al. 2014a].
Computing a multilinear model with HOSVD requires a com-
plete tensor of data, where each identity needs to be present in all
expressions, and the data need to be in semantic correspondence
specied by expression labels. This severely limits the kind of data
that can be used for training. Recently, a number of methods have
been proposed to address this limitation using an optimization ap-
proach [Bolkart and Wuhrer 2016], a custom tensor decomposition
method [Wang et al
.
2017], and an autoencoder structure [Fernán-
dez Abrevaya et al. 2018], respectively.
3.2.3 Nonlinear models. Facial shape and expression are mostly
modeled with a linear subspace, often assuming a Gaussian prior
distribution. Few methods exist to model facial variations with
nonlinear transformations. Li et al
.
[2017] introduce FLAME, an
articulated expressive head model that provides nonlinear control
over facial expressions by combining jaw articulation with linear
expression blenshapes. Ichim et al
.
[2017] use a muscle activation
model driven by physical simulation. Koppen et al
.
[2018] instead
of a single Gaussian distribution use a Gaussian mixture model to
represent facial shape and texture. In another line of works, Shin et al
.
[2014] capture facial wrinkles in multi-scale maps and nonlinearly
transfer them to other faces to enhance realism.
Recently, several deep learning-based models were published that
fall into this group of nonlinear models [Bagautdinov et al
.
2018;
Lombardi et al
.
2018; Ranjan et al
.
2018; Tewari et al
.
2019, 2018;
Tran and Liu 2018a]. Section 6 covers these models in more detail.
3.3 Appearance models
This section describes approaches for modelling the facial appear-
ance, where we distinguish between linear and nonlinear models.
The appearance of a face is inuenced by its albedo and illumination.
However, most 3DMMs do not completely separate these factors,
so that oftentimes the illumination is baked into the albedo. Hence,
in the following, we call the problem of statistically capturing this
information appearance modeling. The most common way to build
an appearance model is by performing statistics on appearance in-
formation of the training shapes, where the appearance information
is usually either represented in terms of per-vertex values or as a
texture in uv-space.
3.3.1 Linear per-vertex models. Usually, color information is mod-
eled as a low-dimensional subspace that explains the color variations.
This leads to an analogous model to the linear shape model:
d(w
t
) =
¯
d + E
t
w
t
, (6)
where
¯
d
and
E
t
shares the same number of rows as
¯
c
and
E
and
w
t
is the low-dimensional texture parameter vector.
Booth et al
.
[2017] and Booth et al
.
[2018b] use a convex ma-
trix factorization formulation for learning a per-vertex appearance
model from images based on back-projection, where it is assumed
that the 3D geometry of the face in the image is known. Their ap-
pearance model is not built using the color images directly but rather
features computed from the images, for example, SIFT. This brings
advantages that the features may be somewhat invariant to illumi-
nation changes and also that they depend on a local neighborhood
which may widen the basin of convergence. In a similar vein, Wang
et al
.
[2009] construct a linear model of spherical harmonic bases
(see Section 4). This jointly models texture (more precisely diuse
albedo) and ne-scale shape (surface normal orientation) such that
appearance under any illumination can be synthesized as a linear
function of the basis.
3.3.2 Linear texture-space models. A downside of per-vertex mod-
els is that they require compatible resolutions between the shape and
appearance representation. This is rather uncommon in computer
graphics, where usually a low(er) resolution geometry model (of-
tentimes including normals) is used in conjunction with a high(er)
resolution 2D texture map. Working with a 2D texture also has
other advantages, such as the possibility of using image processing
techniques to modify the texture maps. With that, such a represen-
tation is also amenable for being processed by convolutional neural
networks (CNNs), as will be addressed in the next section.
We now turn our attention towards works that build linear ap-
pearance models in texture space. The original work by Blanz and
Vetter [1999] used a texture-based representation by representing
the face in a cylindrical way. Later, texture-based representations
were used to add textural details like wrinkles [Pascal 2010], or to
segment skin and detect moles [Pierrard 2008]. Cosker et al
.
[2011]
model appearance variation in uv-space based on sequences of facial
images recorded from dierent views. The images of the dynamic
sequences are aligned based on a non-rigid registration so that the
color variation can be modeled using a linear subspace model based
on PCA. Dai et al
.
[2017] also use a uv-space appearance repre-
sentation that is dened for the entire head. Huber et al
.
[2016]
use a per-vertex appearance variation model based on PCA, but in
addition, also dene a common uv-mapping so that the model can
be textured based on given facial images. Moschoglou et al
.
[2018]
formulate a robust matrix factorization problem in order to learn
3D Morphable Face Models - Past, Present and Future 11
attributed facial uv-maps from a collection of training textures. A
study on the eect of dierent uv-space embeddings of the texture
was presented by Booth and Zafeiriou [2014].
3.3.3 Nonlinear models. Traditionally, the facial appearance is mod-
eled as a linear subspace, where oftentimes a Gaussian distribution
is assumed. However, as empirically shown by Egger et al
.
[2016a],
the Gaussian assumption is not very accurate and may lead to a
sub-optimal facial appearance model. Hence, the authors proposed
to replace a PCA-based appearance model with a Copula Component
Analysis model [Han and Liu 2012]. Subsequently, this idea was ex-
tended to jointly model facial shape, texture, and attributes [Egger
et al
.
2016b]. Recent work learned a joint shape and texture model
using neural networks with an adverserial loss [Gecer et al
.
2019a].
Alotaibi and Smith [2017] use the observation that skin color forms
a nonlinear manifold in RGB space, approximately spanned by the
colors of the pigments melanin and hemoglobin. They inverse ren-
der maps of these parameters and then construct a linear statistical
model in the parameter space. The resulting biophysical 3DMM is
guaranteed to produce plausible skin colors. In addition to global
facial appearance models, there are also approaches that consider
models of local skin variations. For example, Dessein et al
.
[2015]
use a texture model based on small overlapping patches that are
extracted from a face database, and Schneider et al
.
[2018] have
presented a stochastic model that is able to synthesize freckles.
More recently, a range of appearance modeling approaches based
on deep learning have been proposed, where many of these methods
are also built within an analysis-by-synthesis framework. These
aspects will be discussed in-depth in Secs. 5 and 6.
3.4 Joint shape and appearance models
Blanz and Vetter [1999] originally proposed building separate, in-
dependent models for shape and texture. Interestingly, in 2D the
Active Appearance Model [Cootes et al
.
1998] was originally pro-
posed with a combined shape and appearance model. The advantage
of such a joint model is that correlations between shape and texture
can be learned and exploited as a constraint during tting with
fewer parameters. On the other hand, separate models are more
exible and, since shape and texture parameters can be adjusted
independently, sequential algorithms can t the two models indepen-
dently. However, 3DMMs that jointly model shape and texture have
subsequently been considered. Schumacher and Blanz [2015] use
canonical correlation analysis to study shape/texture correlations
and also correlations between face parts. Egger et al
.
[2016a] use
copula component analysis that can deal with the dierent scales
of shape and texture data. Zhou et al
.
[2019] propose a deep con-
volutional colored mesh autoencoder that learns a joint nonlinear
model of shape and texture.
3.5 Correspondence
The previously discussed models typically require the data with
point-to-point correspondence between all shapes. We refer to the
process of establishing such a dense correspondence between scans
as registration in the following.
Many methods exist to establish point-to-point correspondence
for general classes of objects (e.g., [Tam et al
.
2013; van Kaick et al
.
2011]), yet the space of face deformations is strongly constrained.
Most commonly used face registration methods follow the principle
of deforming a template mesh to each scan in the dataset. This
registration process typically starts with a rough alignment (often
using sparse correspondences) and leads to dense correspondences
in the end.
While several image-based methods can also be seen as jointly
learning correspondence (between images) and building a statistical
model (e.g., [Tewari et al
.
2019; Tran and Liu 2018a]), we cover such
deep learning-based methods in Section 6 in more details.
3.5.1 Sparse correspondence computation. Several methods exist
to establish a sparse correspondence for a dataset of 3D scans by
predicting landmarks, i.e. a common set of salient points, for each
scan. This sparse correspondence then typically serves as automatic
initialization for dense correspondence methods.
Most of the methods use some local descriptors, or combination of
local descriptors and connectivity information between descriptors,
to predict salient points. While landmark localization in images is
widely researched (e.g., Bulat and Tzimiropoulos [2017]), our focus
is on methods that establish sparse correspondence between 3D
scans.
Existing methods use combinations of dierent geometric descrip-
tors. Passalis et al. [2011] use shape index and spin image features,
Berretti et al
.
[2011] use curvature and scale-invariant feature trans-
form (SIFT) features, and Creusot et al
.
[2013] consider combinations
of local features such as Gaussian curvature, mean curvature, and a
volumetric descriptor, and learn the statistical distribution of these
descriptors for each landmark.
Further, existing methods use geometric relations or relations
between landmarks along with geometric feature descriptors. Guo
et al
.
[2013] project a scan into an image and predict landmarks with
a 2D PCA model and geometric relations with additional texture
constraints. Salazar et al
.
[2014], similarly to Creusot et al
.
[2013],
learn the statistical distribution of local surface descriptors with
additional Markov network to additionally consider connections
between landmarks. Bolkart and Wuhrer [2015a] extend this further
to sequences by additionally considering temporal edges within the
Markov network.
3.5.2 Dense correspondence computation. Methods that deform a
template to establish correspondence mostly dier in the parame-
terization of the deformation. We group existing methods according
to the type of scan data they register. We distinguish here between
static methods, i.e. methods that aim at registering static 3D scan,
and dynamic methods that register 3D motion sequences. Blanz
and Vetter [1999] use a bootstrapping approach to iteratively t
a 3DMM to a scan, rene the correspondence between the model
t and the scan with a ow eld, and rene the model. Blanz et al
.
[2003] later extend this approach to expressive scans. Amberg et al
.
[2008] register expressive scans with a non-rigid ICP. Hutton et al
.
[2001] establish a thin-plate spline (TPS) mapping to warp each scan
to a reference and establish correspondence using nearest neighbor
search.
Passalis et al
.
[2011] register scans by deforming an annotated
face model (AFM) [Kakadiaris et al
.
2005], i.e. an average 3D face
template that is segmented into dierent annotated areas, by solving
12 B. Egger et al.
a second-order dierent equation. Mpiperis et al
.
[2008] initially
t a subdivision surface to a scan, where the deformation of the
base mesh (i.e. the mesh of the lowest subdivision level) is guided
by a sparse landmark correspondence. After registering a training
set, they parametrize the deformation of the base mesh with a PCA
model over the training data. Salazar et al
.
[2014] use a generic
expression blendshape model to t the expression of the scan, fol-
lowed by a non-rigid ICP to closely t the surface of the scan. [Gerig
et al
.
2018] establish dense correspondence with a Gaussian process
deformation model with the spatially varying kernel.
Several methods exist to sequentially register motion sequences.
Weise et al
.
[2009] use an identity PCA model to register a neutral
scan, and then track motion sequences by optimizing sparse and
dense optical ow between consecutive frames. Fang et al
.
[2012]
and Li et al
.
[2017] initialize the optimization by the registration
of the previous frame to exploit temporal information. Fang et al
.
[2012] use an AFM, Li et al
.
[2017] a non-rigid ICP regularized by
FLAME. Fernández Abrevaya et al
.
[2018] uses a spatiotemporal
method to register entire motion sequences by iteratively rening
the registration of entire sequences by explicitly encoding temporal
information with a Discrete Cosine Transform (DCT).
Further, non-template tting based registration methods exist.
Sun et al
.
[2010] use a conformal mapping to parameterize two
meshes and establish dense correspondence between the resulting
planar meshes by extrapolating sparse landmark correspondences.
Ferrari et al
.
[2015] segment the face scans into non-overlapping
parts divided by geodesic curves between selected landmark pairs,
and consistently re-sample each part.
3.5.3 Jointly solving for correspondence and statistical model. Li
et al
.
[2013] and Bouaziz et al
.
[2013] jointly update person-specic
blendshape models and register motion sequences in a real-time
facial animation framework. During tracking, Li et al
.
[2013] use an
adaptive PCA model that combines the person-specic blendshapes
with additional corrective basis vectors that are successively up-
dated, and Bouaziz et al
.
[2013] optimizes for corrective deformation
elds (Equation 4).
Bolkart and Wuhrer [2015b] and Zhang et al
.
[2016b] instead
optimize correspondence for a dataset of dierent subjects in multi-
ple expression in a groupwise fashion. Bolkart and Wuhrer [2015b]
jointly update the point correspondence within the mesh surface by
minimizing an objective function that measures the compactness of
a multilinear face model. Zhang et al
.
[2016b] optimize functional
maps across the entire dataset.
3.6 Synthesis of novel model instances
3DMMs can be used to synthesize new 3D faces that are dierent
from any of the observed training data, yet realistic. This can be
achieved by altering the coecients in parameter space (i.e. shape
space, expression space or appearance. Common operations in pa-
rameter space include interpolating or extrapolating between the
coecients of training samples. Furthermore, any of the generative
models presented in this section can be used to directly synthesize
new 3D faces by drawing random samples in parameter space accord-
ing to the prior distribution. Depending on the model, this sampling
allows to synthesize or alter identity, expression, or appearance of a
static 3D face. Synthesis works are heavily used for entertainment
purposes, and these works are discussed in Section 7.2.
Synthesis of static 3D faces notably includes the generation of
face caricatures by moving the identity coecient linearly away
from the mean [Blanz and Vetter 1999] which is mainly explored to
study the human face processing system as discussed in Section 7.5.
With 3DMMs that encode and decouple identity and expression
information, it is easy to synthesize dynamic sequences by xing
the identity coecients while modifying the expression coecients.
Some works aim to synthesize coherent dynamic 3D face videos of
a xed identity with the help of 3DMMs. These include works that
synthesize 4D videos from a static 3D mesh paired with semantic
label information [Bolkart and Wuhrer 2015a], and from a static 3D
mesh and audio information [Cudeiro et al. 2019].
3.7 Publicly available models
In Table 2 we list publicly available shape and/or appearance models
of human faces. Figure 4 visualizes geometry or appearance vari-
ations of some models. We also refer to the curated list of 3DMM
software and data that we collected, share and update [Community
2019].
3.8 Open challenges
While 3D face modeling has received considerable attention during
the past two decades, some challenges remain. First, the statistics of
most models are limited to the face and do not include information
on eyes, mouth interior or hair. These details are however crucial
for many applications, and it is not straightforward to combine a
3DMM with specic models e.g., for hair. Second, the interpretabil-
ity of the representations would benet from being improved. PCA
is the most commonly used method to perform statistics on 3D
faces, and as it is an unsupervised method, the principal compo-
nents do not coincide with attributes that humans would use to
describe a face. Third, methods that incorporate dierent levels of
detail typically come at the cost of a less compact representation,
and it is unknown how many parameters are required to accurately
represent facial geometry and appearance at varying levels of detail.
Fourth, the dierent models presented in this section have dierent
advantages and drawbacks, making them most suitable for specic
applications. It is unknown whether one integrated optimal model
for all applications exists. Fifth, all currently available models, even
the large scale ones have a very strong racial bias towards white.
This can be alleviated in the future by scanning eorts in dierent
parts of the world. Another potential solution to overcome a racial
bias can be the generative model itself, as these allow to gener-
ate and add synthetic data to biased datasets. Sixth, learning from
inhomogeneous data presents another open challenge. There are
many available datasets with dierent resolution, coverage, noise
characteristics, biases and so on (see Table 1). To make the best
use of this data requires methods that can learn models from all
data sources simultaneously but this requires explicit ways to deal
with data inhomogeneity. Some very recent work begins to look at
this problem [Liu et al
.
2019b]. Finally, there are some fundamental
open questions related to the statistical modeling of shape. Two
face shapes dier by nonlinear shape deformation superposed on
3D Morphable Face Models - Past, Present and Future 13
BFM 2019
FLAME
FaceWarehouse
CFHM
Fig. 4. Model variations of existing face models. Top le: CFHM [Ploumpis et al
.
2019] shape variations. Top right: FaceWarehouse [Cao et al
.
2014b] shape
and expression variations (while the original model is not available to the best of our knowledge, the visualized multilinear face model is trained from the
published FaceWarehouse dataset). Middle: BFM 2019 [Gerig et al
.
2018] shape, expression, and appearance variations. Boom: FLAME [Li et al
.
2017] shape,
expression, pose, and appearance variation. For shape, expression, and appearance variations, three principal components are visualized at
±
3 standard
deviations. The FLAME pose variations are visualized at ±π /6 (components three and four) and at 0, π /8 (component six).
top of rigid body motion. Conventionally, this is dealt with by rst
rigidly aligning, then modeling the residual shape dierences but
this makes the model dependent on the choice of alignment metric.
For faces specically, estimated skull position has been used for
rigid alignment [Beeler and Bradley 2014]. Although not applied to
faces, a recent method uses a rigid body motion invariant distance
measure to learn nonlinear principal components [Heeren et al
.
2018].
4 IMAGE FORMATION
A 3DMM provides a parametric representation of face geometry
and appearance. One key usage of such a model is synthesis, which
involves two steps. First, generating a new model instance via sam-
pling from the parameter space or manual interaction with model
parameters (see Section 3.6). Second, rendering the generated model
into a 2D image via a simulation of the image formation process,
i.e. the computer graphics pipeline. The synthesis also forms an im-
portant part of 3DMM-based face analysis, either through classical
analysis-by-synthesis (see Section 5) or as a model-based decoder
14 B. Egger et al.
model geometry appearance data comment
Basel Face Model (BFM) 2009
[Paysan et al. 2009b]
shape per-vertex
200 individuals, each in
neutral expression
includes sep-
arate models
for facial parts
FaceWarehouse [Cao et al. 2014b] shape, expression -
150 individuals, each with
20 expressions
Global and local linear model
[Brunton et al. 2014b]
shape - 100 individuals
Multilinear Wavelet model
[Brunton et al. 2014a]
shape, expression -
99 individuals, 25 expres-
sions
Multilinear face model
[Bolkart and Wuhrer 2015b]
shape, expression -
2500 scans (100 individu-
als, 25 expressions)
Multilinear face model
[Bolkart and Wuhrer 2016]
shape, expression -
2510 scans (205 individu-
als, up to 23 expressions)
Large Scale Facial Model (LSFM)
[Booth et al. 2016]
shape - 9663 individuals
Surrey Face Model
[Huber et al. 2016]
shape, expression per-vertex 169 individuals
multi-
resolution
Liverpool-York Head Model (LYHM)
[Dai et al. 2017]
shape per-vertex 1212 individuals
full head (no
hair, no eyes)
Faces Learned with an Articulated
Model and Expressions (FLAME)
[Li et al. 2017]
shape, expression,
head pose
texture
3800 individuals for
shape, 8000 for head
pose, 21000 frames for
expression
female, male,
gender neu-
tral model,
full head (no
hair)
Basel Face Model (BFM) 2017
[Gerig et al. 2018]
shape, expression per-vertex
200 individuals for shape
and appearance, a total of
160 expression scans
BFM 2019
with full head
and multi-
resolution
York Ear Model [Dai et al. 2018] shape -
20 3D ear scans, aug-
mented with 605
landmark-annotated
2D ear images
ear only
Multilinear autoencoder
[Fernández Abrevaya et al. 2018]
shape, expression -
5000 scans from 195 indi-
viduals, 500000 after aug-
mentation
Convolutional Mesh Autoencoder
(CoMA) [Ranjan et al. 2018]
shape, expression -
12 individuals, 12 ex-
treme expressions, 20466
meshes in total
full head (no
hair)
Combined Face & Head Model (CFHM)
[Ploumpis et al. 2019]
shape -
Merged from LYHM and
LSFM models
full head (no
hair)
Morphable Face Albedo Model
[Smith et al. 2020]
-
per-vertex dif-
fuse and spec-
ular albedo
73 individuals (50
scanned + 23 3DRFE
[Stratou et al. 2011])
extends
BFM2017
Table 2. Overview of publicly available 3D shape and/or appearance models of human faces.
within a deep learning architecture (see Section 6). In this section,
we focus on modeling the image formation process. This potentially
encompasses the whole of the rendering literature, so we restrict our
attention to techniques and models that have been applied in the
context of 3DMMs. We cover the geometry and photometry of image
formation in Sections 4.1 and 4.2, the rendering pipelines used for
3DMM tting in Section 4.3 and nally in Section 4.4 we highlight
where there are future opportunities for exploiting state-of-the-art
rendering techniques to improve 3DMM synthesis.
4.1 Geometric image formation
A camera model describes the geometr y of image formation, speci-
cally, how positions in the 3D world project to 2D locations in the
image plane. A variety of camera models have been used in the
3D Morphable Face Models - Past, Present and Future 15
3DMM literature which are described here in order of increasing
accuracy with respect to a real camera. We denote the projection
of a 3D point
v = [u, v, w]
T
onto the 2D point
x = [x, y]
T
by
x = project[C, v] R
2
, where
project
represents one of the cam-
era projection models below and
C = (C
intrinsic
, C
extrinsic
)
contains
the camera parameters.
C
extrinsic
= (R, t)
describes the pose in terms
of a rotation
R SO(
3
)
and translation
t R
3
that transform from
world to camera coordinates.
C
intrinsic
is a set of internal parame-
ters specic to each projection model. The task of estimating
C
is
known as camera calibration or camera resectioning and is usually
done from known or estimated 2D-3D correspondences. Estimat-
ing
C
extrinsic
with known
C
intrinsic
is called pose estimation or, in
the case of a perspective camera model, the perspective-n-point
problem.
Scaled orthographic. The scaled orthographic projection model
comprises an orthographic projection whose sole parameter is a
uniform scaling s R
>0
:
ortho[v, R, t, s] = sP(Rv+t) = s
r
1
t
1
r
2
t
2
˜
v = C
˜
v, P =
1 0 0
0 1 0
(7)
where
˜
v = [u, v, w,
1
]
T
is the homogeneous representation of
v
and
r
1
,
r
2
are the rst two rows of
R
. This model is not physically
meaningful but is linear in vertex position, translation and scale and
avoids size/distance/perspective ambiguities introduced by more
realistic camera models. Since
R
must be restricted to
SO(
3
)
, the pro-
jection is nonlinear in any parameterization of
R
. In the context of
3DMMs, scaled orthographic projection has been used for example
by Bas et al
.
[2017b]; Blanz et al
.
[2004a]; Knothe et al
.
[2006]; Patel
and Smith [2009]. The scaled orthographic model can be interpreted
as an approximation to perspective projection when the distance be-
tween the surface and camera is large relative to the depth variation.
Concretely, when
max
w
min
w
¯
w
with
¯
w = mean
w
the mean
distance between the surface and the camera, then the nonlinear
division in perspective projection can be approximated by a xed
scale
s = f /
¯
w
where
f
is the focal length of the camera. This gives
physical meaning to the scaled orthographic model.
Ane. The ane camera generalizes the orthographic model by
allowing arbitrary ane transformations. Specically, this addition-
ally allows non-uniform scaling and skew transformations which
can approximate perspective eects whilst remaining linear. An
ane camera can be represented by an arbitrary matrix
C R
2×4
with the projection given simply by
x = ane[v, C] = C
˜
v
. The
projection is linear in
C
and since its 8 entries are unconstrained,
they can be estimated using linear least squares (though note that
numerical stability entails rst performing a normalization proce-
dure). In the context of 3DMMs, the ane camera has been used for
example by Aldrian and Smith [2013]; Huber et al. [2016].
Perspective. A nonlinear perspective projection is given by the
pinhole camera model x = pinhole[v, K, R, t]. The matrix:
K =
f
x
γ c
x
0 f
y
c
y
0 0 1
contains the intrinsic parameters of the camera, namely the focal
length in the
x
and
y
directions
f
x
, f
y
R
>0
, the skew
γ R
and
the principal point
[c
x
, c
y
] R
2
. Common assumptions are that the
pixels are square (in which case a single focal length
f = f
x
= f
y
parameter is used), that the camera sensor is perpendicular to the
camera view vector (in which case
γ =
0) and that the principal point
is in the centre of the image (
c
x
= w/
2 and
c
y
= h/
2). Note that
f
is actually a product of two quantities: the physical focal length in
world units (e.g., mm) and the conversion factor from world units
to pixels (i.e. with units of pixels/mm). The nonlinear perspective
projection can be written in linear terms by using homogeneous
representations:
λ
˜
x = K
R t
˜
v = C
˜
v, (8)
where
˜
x = [x, y,
1
]
T
,
λ
is an arbitrary scaling factor and
C R
3×4
is
known as the camera matrix. The nal image coordinate is obtained
by the nonlinear homogenization of
˜
x
. Unlike the linear models, the
pinhole model captures the eect of distance on projected shape.
This becomes important when a face is close to the camera. At
“sele” distance (e.g., 0.5m), the dierence between perspective and
orthographic projection of 3D face landmarks is about 6% of the in-
terocular distance [Bas and Smith 2019]. For this reason perspective
projection is commonly used in the context of 3DMMs, for example,
in the original Blanz and Vetter [1999] paper and more recently in
a shape-from-landmarks setting [Cao et al
.
2014a, 2013; Saito et al
.
2016]. Unfortunately, since calibration information is rarely avail-
able the increased complexity of this model introduces ambiguities
between shape, scale and focal length that have only recently been
studied [Bas and Smith 2019; Smith 2016], though the ambiguity
has often been hinted at in the literature. For example, the original
Blanz and Vetter [1999] paper relied on a xed, manually provided
subject-camera distance. Booth et al
.
[2018b] state “we found that it
is benecial to keep the focal length constant in most cases, due to
its ambiguity with
t
z
. Schönborn et al
.
[2017] explored the ambi-
guity of estimated distance from the camera under perspective and
observed a very high posterior standard deviation and the distance
can not be resolved even by using a strong prior for the face shape.
In Tewari et al
.
[2017], a similar eect is observed, indicated by
the learning rate on the z translation (i.e. subject-camera distance),
which is set three orders of magnitude lower than all other parame-
ters. Both approaches in practice x the face distance to avoid this
diculty.
4.2 Photometric image formation
The appearance of a face is determined by the interaction of light
with the material of the face, predominately skin. Hence, the pho-
tometry of both illumination and reectance must be modeled in
order to simulate the image formation process.
Reectance models. The reection of light from a surface is often
described using a Bidirectional Reectance Distribution Function
(BRDF). This describes the directional dependence of local light
reection from an opaque surface. It is represented by a four dimen-
sional function
f
r
(ω
i
, ω
o
)
that gives the ratio of outgoing reected
radiance in direction
ω
o
to incoming incident irradiance from direc-
tion
ω
i
. A BRDF allows us to express irradiance
L
o
(ω
o
)
in direction
16 B. Egger et al.
ω
o
as a function of light reected from all incident directions:
L
o
(ω
o
) =
(n)
f
r
(ω
i
, ω
o
)L
i
(ω
i
)cos θ
i
dω
i
, (9)
where
L
i
(ω
i
)
is incident irradiance from direction
ω
i
,
(n)
is the
hemisphere around the local surface normal
n
and
θ
i
is the an-
gle between
ω
i
and
n
. Note
cosθ
i
= n · ω
i
, where
·
denotes the
inner product. Physically-valid BRDFs must exhibit a number of
properties: nonnegativity (
f
r
(ω
i
, ω
o
)
0), Helmholtz reciprocity
(f
r
(ω
i
, ω
o
) = f
r
(ω
o
, ω
i
)) and conservation of energy:
ω
i
,
(n)
f
r
(ω
i
, ω
o
)cos θ
i
dω
o
1. (10)
A particularly simple and commonly used physically valid BRDF
is the Lambertian model for a perfectly diuse reector. The Lam-
bertian model assumes incident light is scattered equally in all di-
rections resulting in a constant BRDF:
f
Lambert
(ω
i
, ω
o
) = ρ
d
/π
.
ρ
d
[
0
,
1
]
is the diuse reectivity or albedo, that is usually wave-
length dependent and can be thought of as the color of the object.
Work predating the original 3DMM used a linear statistical 3D face
shape model with the Lambertian reectance model in a shape-from-
shading context [Atick et al
.
1996]. Subsequently, the Lambertian
model has been used for 3DMM tting in the context of the spheri-
cal harmonic lighting model [Zhang and Samaras 2006] (see below)
where its simplicity yields closed-form expressions. This is now
very common, including in the current state-of-the-art, e.g., [Tran
et al
.
2019; Tran and Liu 2018b]. In general, the Lambertian model
is a poor approximation for the complex reectance properties of
facial skin, hair, eyes, etc., and so more sophisticated models have
been considered.
Blanz and Vetter [1999] originally used the Phong model which
augments the Lambertian term with a constant ambient term and a
phenomenological specular model enabling the simulation of glossy
reectance. The Phong model can be described in terms of the
following BRDF:
f
Phong
(ω
i
, ω
o
) =
ρ
a
+ ρ
s
(r · ω
o
)
η
n · ω
i
+ ρ
d
(11)
where
r
is the reection of
ω
i
about
n
,
η
is the shininess that controls
the width of the specular lobe and
ρ
a
,
ρ
s
are ambient and specular
“albedos”. In the context of 3DMMs, usually only
ρ
d
is allowed to
vary spatially. Note that the Phong BRDF does not satisfy the con-
straints above for physical validity. In graphics, extremely complex,
physically-valid BRDF models have been developed specically for
materials of relevance to face 3DMMs, for example for skin [Kr-
ishnaswamy and Baranoski 2004] and hair [Marschner et al
.
2003].
Note that skin is a layered, partially translucent material and so a
local BRDF model is inadequate to describe the actual subsurface
scattering eects that take place. More complex 8-dimensional bidi-
rectional subsurface scattering reectance distribution functions
(BSSRDF) have been proposed for such materials. However, both
these and the more complex BRDFs have proven to be too com-
plex to integrate into 3DMM tting pipelines and so the majority
of work has used Lambertian or non-physical models of moderate
complexity.
Lighting. In
(9)
,
L
i
(ω
i
)
represents the hemispherical incident il-
lumination environment at the surface point. Natural illumination
is usually complex, comprising multiple, possibly extended sources
as well as secondary illumination reected from other surfaces. A
common assumption is that the illumination environment is distant
relative to the size of the object in which case it can be represented
by a constant 2D environment map, a discrete approximation of
L
i
(ω
i
)
that is used for every point on the surface. However, the
space of possible natural illumination is very high dimensional and
rendering with an environment map is computationally expensive,
so a number of further simplications are commonly used.
The simplest illumination model is a point source, in which
L
i
(ω
i
)
is a delta function characterized by a unit vector in the light source
direction,
s
, and an intensity,
L
i
. Ignoring constants and assuming
image intensity is proportional to surface radiance, we can plug in
the simple BRDF models above and obtain the following shading
models:
I
Lambert
= L
i
ρ
d
n · s, I
Phong
= L
i
ρ
a
+ ρ
d
n · s + ρ
s
(r · v)
η
,
(12)
where
v
is a unit vector in the viewer direction. Usually, the light
source intensity and albedos would be RGB values. Often
ρ
a
= ρ
d
,
representing the intrinsic color of the surface. In a 3DMM, this is
described by the statistical texture model
(6)
. Note that these simple
models are purely local, this means that they neglect self occlusion
of the light source, i.e. cast shadows. These can be added at the cost
of computing the occlusion function which is not dierentiable.
A better approximation of complex natural illumination is pro-
vided by the spherical harmonic illumination model [Ramamoorthi
and Hanrahan 2001]. Spherical harmonics provide an orthonormal
basis for functions on the sphere, analogous to a Fourier basis in
Euclidean space:
I
SH
(n) =
Õ
l=0
l
Õ
m=l
l
l,m
B
l,m
(n), (13)
where
B
l,m
(n)
are the orthonormal basis functions, and
l
l,m
are
coecients describing reectance and illumination. The subscript
l
denotes the degree and
m
the order of the spherical harmonics. In
the Lambertian case, the contribution from the reectance function
is constant and 98% of the energy of the reectance function can
be captured for any illumination environment using an order 2
(
l = {
0
,
1
,
2
}
) approximation. In practice, this means that a good
approximation for appearance can be obtained using 9 illumination
coecients per color channel. Combining this model with the linear
texture model for albedo d(w
t
) yields the following:
i
SH
= d(w
t
) vec(B(w
s
)L), (14)
where
is the Hadamard (element-wise) product. The matrix
B(w
s
) R
n×9
contains the spherical harmonic basis for each vertex
which depends on the vertex normal direction and hence the ge-
ometry which in turn is determined by the shape parameter vector.
The matrix
L R
9×3
contains the lighting coecients for each
color channel. Zhang and Samaras [2006] were the rst to t this
model in the context of 3DMMs. Aldrian and Smith [2013] addition-
ally used the same model for specular reection, showing that the
coarse structure of the illumination environment can be recovered
3D Morphable Face Models - Past, Present and Future 17
from a face image. They also introduced priors to help resolve light-
ing/texture ambiguities. Egger et al
.
[2018] went further by learning
a low dimensional illumination prior from spherical harmonic light-
ing coecients estimated from real in-the-wild images.
An alternate model that takes a step towards capturing global illu-
mination eects is the ambient occlusion model. Here, it is assumed
that
L
i
(ω
i
)
is constant everywhere, i.e. that illumination is perfectly
diuse. In this case, shading depends only upon the degree to which
the incident hemisphere is occluded. The ambient occlusion,
A
v
, at
vertex v is given by:
A
v
=
1
π
(n)
V (v, ω)(n · ω)dω, (15)
where
V (v, ω)
is the visibility function dened as zero if vertex
v
is
occluded in direction
ω
, and one otherwise. One can also dene the
bent normal as the average unoccluded direction. Using the bent
normal with the spherical harmonic illumination model and scaling
the result by the ambient occlusion provides a rough approximation
of global illumination eect. Ambient occlusion and bent normal
direction depends on the geometry and hence the 3DMM shape
parameters. Aldrian and Smith [2012] proposed to learn a linear
model of ambient occlusion and bent normals and included this in
their 3DMM synthesis model. Zivanov et al
.
[2013] similarly con-
struct a joint linear model of spherical harmonic bases and ambient
occlusion.
The most complex global model of appearance considered in the
context of 3DMMs is the precomputed radiance transfer (PRT) model
[Sloan et al
.
2002]. This uses an ecient representation (such as
spherical harmonics) to approximate the local light transport at each
vertex, accounting for shadowing and inter-reection. These are
precomputed but can then be used with any incident illumination
at render time. Schneider et al
.
[2017] learn a linear model of PRT
transfer matrices as a function of the 3DMM shape coecients and
use this in a rendering framework.
We denote by
L
the set of illumination parameters for any of the
above illumination models.
Color transformation. In a real camera, the actual image irradiance
measured by the sensor is usually transformed in a complex way in
order to achieve a pleasing visual appearance. Often, this amounts
to multiplication by a 3
×
3 color transformation matrix followed by
a nonlinearity. The color transformation matrix can be decomposed
into a product of three 3
×
3 matrices:
T = T
xyz2rgb
T
raw2xyz
T
wb
,
where
T
wb
is a diagonal matrix that performs white balancing (com-
pensating for the color of the illumination),
T
raw2xyz
is specic to
each camera and maps from the native color space to the standard-
ized XYZ space and
T
xyz2rgb
is a xed matrix that transforms to
sRGB space. Unfortunately, introducing such a color transforma-
tion into the 3DMM image formation model further exacerbates
the lighting/albedo ambiguity by providing an additional explana-
tion for observed color. Finally, a nonlinearity is applied, which can
be approximated by
i
sRGB
= i
1/γ
linRGB
, where usually
γ =
2
.
2. This
nonlinear transformation is important because it means the, often
linear, reectance and illumination models described above cannot
explain the nal image appearance.
Despite their importance, camera color transformations and non-
linear gamma are almost always ignored in the context of 3DMMs.
There are some notable exceptions. Schneider et al
.
[2017] apply
gamma correction to input images to transform back to a linear
space. Blanz and Vetter [2003] estimate a per-channel scale and
oset as well as scalar color contrast, allowing them to synthesize
grayscale images. The same model was used by Aldrian and Smith
[2013] and Hu et al. [2013].
4.3 Rendering and visibility
3DMM tting algorithms dier in whether they synthesize a discrete
image in image space (i.e. one color per-pixel) or perform rendering
in object space (i.e. one color per-vertex or per-triangle). The for-
mer holds the advantage that it is straightforward to incorporate
a texture model built in a high-resolution UV space and also that
the output is a regular pixel grid that can be passed to CNN, for
example for an adversarial loss [Shamai et al
.
2020]. Methods that
work in object space compute an appearance error by projecting the
model vertices into the image and sampling image intensities onto
the visible vertices. Visibility can also be computed in either object
space or image space, with the latter usually being more ecient.
The original Blanz and Vetter [1999] paper used object space
rendering in which a single color was computed for each triangle
center (equivalent to at shading) with image space z-buering
used for visibility testing. Many subsequent methods also worked
in object space but usually with per-vertex colors computed us-
ing the reectance models described above with per-vertex surface
normals. This has begun to change recently when more conven-
tional rasterization pipelines have been included in 3DMM syn-
thesis. Rasterization associates with each pixel
(x, y) I
, where
I = {
1
, . . . , w} × {
1
, . . . , h}
, a triangle index or a
NULL
value if the
pixel is not covered by a triangle:
raster
C, T, w
s
, w
e
: I 7→ {1, . . . , m, NULL}, (16)
recalling that
w
s
,
w
e
are the shape and expression parameters
respectively and
T
the mesh triangulation. Since this is a discrete
function it is not smooth and not dierentiable. In addition, for
each pixel, three weights are calculated that are associated with the
vertices of the rasterized triangle:
a
C, T, w
s
, w
e
(x, y) R
3
0
. These
weights depend on the projected positions of the vertices
v
t
i
raster
C, T, w
s
, w
e
(x, y)
, i {1, 2, 3}. (17)
Often, these weights are barycentric coordinates of the pixel center
within the triangle. These weights are a smooth function of the
vertex positions and hence of the shape and camera parameters.
Hence, rendering is dierentiable up to a change in rasterization,
i.e. so long as the triangle index associated with each pixel does
not change. Tran and Liu [2018a] incorporate such a conventional
rasterization pipeline into an in-network dierentiable renderer.
Collecting together all of the parameters relating to the camera,
illumination, face geometry and texture,
Θ = (C, L, w
s
, w
e
, w
t
)
,
we can write the rendered appearance in object space of vertex
j
as
I
j
model
(Θ)
. For an image space rendering we denote the appearance
of the model at pixel
(x, y)
by
I
x,y
model
(Θ)
. In the simplest case, the
image space rendering is computed directly from the object space
18 B. Egger et al.
rendering using Gouraud interpolation shading:
I
x,y
model
(Θ) = a
C, T, w
s
, w
e
(x, y)
T
I
t
1
j
model
(Θ)
I
t
2
j
model
(Θ)
I
t
3
j
model
(Θ)
, (18)
where
j = raster
C, T, w
s
, w
e
(x, y)
. Other rasterization strategies may
be more complex. For example, Genova et al
.
[2018] use rasterization
in a dierentiable deferred shading renderer more akin to Phong
interpolation shading. Here, vertex normals and colors are rasterized
and interpolated, then reectance calculations are done in image
space.
Note that overcoming the non-dierentiable nature of rasteriza-
tion is an open problem. Hiroharu Kato and Harada [2018] present
an approximately dierentiable renderer based on rasterization. Liu
et al
.
[2019a] propose a rasterizer in which triangles make a soft
(and hence dierentiable) contribution to image appearance. More
ambitiously, dierentiable rendering using other pipelines is now
also being considered, for example, dierentiable path tracing [Li
et al. 2018].
Very recently, the explicit xed models used in conventional
rendering are being augmented or replaced by learning components,
so-called neural rendering. For example, Kim et al
.
[2018a] train
an image to image network that transforms a low-quality 3DMM
rendering into a photorealistic video frame.
4.4 Open challenges
The image formation models used in the context of 3DMMs are
much simpler than those used in graphics and many other areas of
computer vision. For example, we are not aware of any work that
allows for a center of projection dierent to the center of the image,
even though many face image datasets consist of images cropped
(probably non-centrally) from larger images. Similarly, nonlinear
distortion is always ignored. The eect of this assumption is not
understood. In other elds like structure-from-motion, it is standard
to impose constraints derived from metadata, knowledge of physical
camera parameters and so on. This is not currently being done to a
signicant extent with 3DMMs.
Advances in rendering in computer graphics are slowly propa-
gated into the world of 3DMMs and especially into the analysis-by-
synthesis process. One of the reasons is that almost every model
extension makes the model adaptation more complicated and a lot
of methods rely on the rendering process to be dierentiable. There
is a dramatic gap between what current computer graphics or also
deep learning-based image generation methods are capable of and
what is state of the art for 3DMMs. Also generated instances usually
lack facial details like wrinkles or moles which are challenging to
render properly. Recent work aims at those challenges by using
generative adversarial networks as texture models [Slossberg et al.
2018] but they are not modeled in the shape and not specially treated
during rendering. A possible future direction is to either model or
learn the gap between current 3DMM renderings and state of the
art computer graphics or real-world 2D images.
An interesting open challenge is to better exploit the constraint
of the 3DMM. Existing work uses generic pipelines for tasks such
as rasterization or visibility calculation. However, the geometry
is dened by a low dimensional parameter vector from which the
per-vertex visibility could presumably be inferred more eciently
than treating the resulting mesh as a generic shape. The attempt
of [Schneider et al
.
2017] to learn the relationship between PRT
coecients and shape parameters is a rst step in this direction.
5 ANALYSIS-BY-SYNTHESIS
3DMMs have been widely used for image-based reconstruction. Re-
constructing a 3D face from an observed image(s) involves estimat-
ing the 3DMM coecients which can best explain the observation.
This is the inverse of the image synthesis process covered in the
previous section.
Analysis-by-synthesis refers to a class of optimization problems
which solves this by minimizing the dierence between the ob-
served image(s) and the synthesis of an estimated 3D face. Such
an optimization problem can be ill-posed with several ambiguities
and multiple minima. This is a widely researched problem, with a
variety of solutions exploring dierent input modalities (Sec. 5.1),
energy functions (Sec. 5.2) and optimization strategies (Sec. 5.3). We
present publicly available approaches in Table 3.
Analysis-by-synthesis techniques have also recently been used
in combination with deep learning architectures for learning-based
reconstruction algorithms. We will discuss these methods in Sec. 6.
5.1 Input Modalities
Analysis-by-synthesis methods have been explored using multiple
image modalities, from multi-view to monocular images and videos.
While multi-view methods produce very detailed and high-quality
results, capturing such data requires expensive setups. A lot of
recent focus has been on obtaining similar quality reconstructions
with much lower cost solutions, e.g., using a single RGB image.
This has also led to an increase in commercial applications for the
mass market. Fitting a 3DMM to 3D scans can also be considered
as analysis-by-synthesis. This is related to registration techniques,
covered in Sec. 3.
Multi-View Systems. We will start our discussion with multi-view
solutions which minimize the photometric consistency between the
multi-view images and the synthesis of the estimated reconstruction.
Most multi-view methods, such as those covered in Sec. 2 do not
require a strong prior in the form of 3DMMs. However, there are
several methods which use 3DMMs to aid reconstruction in stereo
camera systems. Model-based stereo reconstruction was explored
in Wallraven et al
.
[1999]. The reconstruction quality was improved
by eliminating the estimation of illumination and reectance in
Amberg et al
.
[2007]; Fransens et al
.
[2005]. 3DMMs also prove to be
very valuable in low-resolution settings where high-quality image
textures cannot be exploited, or under occlusions [Romeiro and
Zickler 2007; Thies et al
.
2018b]. Most of the methods discussed
here solve very large optimization problems, and are not real-time.
Thies et al
.
[2018a] is one real-time method which has a data-parallel
implementation on a GPU.
3D Morphable Face Models - Past, Present and Future 19
publication input estimates approach comment
Edge tting
[Bas et al. 2017b]
2D image, landmarks
pose, shape edge features, ICP
Eos tting library
[Huber et al. 2016]
2D image, landmarks
pose, shape
landmark and con-
tour tting
Huber [2017] handles
expressions
Basel Face Pipeline
[Gerig et al. 2018]
2D image, landmarks
pose, shape, expression,
texture, illumination
MCMC Sampling
estimates posterior
distribution, Egger
et al
.
[2018] handles
occlusion
Deep 3D Face Recon-
struction
[Deng et al. 2019]
2D image(s)
pose, shape, expression,
texture, illumination
deep (ResNet)
PRNet
[Feng et al. 2018]
2D image pose, shape deep (convolutional)
outputs mesh in BFM
topology
Expression-Net
[Chang et al. 2018]
2D image
pose, shape, expression,
texture
deep (ResNet)
bundles [Chang et al
.
2017; Tran et al
.
2017]
RingNet
[Sanyal et al. 2019]
2D image pose, shape, expression deep (ResNet) handles occlusion
Pix2vertex
[Sela et al. 2017]
2D image pose, shape, expression
deep + shape from
shading
shape beyond 3DMM
Facial Details Synthesis
[Chen et al. 2019]
2D image
pose, shape, expression,
appearance
UNet for details
3DMMs as STNs
[Bas et al. 2017a]
2D image pose, shape, expression
spatial transformer
network
3D Face Reconstruction
[Tran et al. 2018]
2D image, output of
[Tran et al. 2017]
shape details
estimate bump
map using encoder-
decoder architecture
handles occlusions
FLAME
[Li et al. 2017]
2D / 3D landmarks pose, shape, expression landmark tting
Basel Face Pipeline
[Gerig et al. 2018]
3D scan, landmarks
pose, shape, expression,
texture
Gaussian process
regression, nonrgid
ICP
LSFM Pipeline
[Booth et al. 2016]
3D scan pose, shape, expression nonrigid ICP fully automatic
Model Fitting
[Brunton et al. 2014b]
3D scan, landmarks pose, shape
nonrigid ICP, tem-
plate and model t-
ting
handles occlusions
Multilinear Model
Fitting [Bolkart and
Wuhrer 2015a; Brunton
et al. 2014a]
3D scan, landmarks pose, shape, expression
nonrigid ICP, global
model in Bolkart and
Wuhrer [2015a], lo-
cal model in Brunton
et al. [2014a]
handles occlusions
Table 3. Overview of publicly available model adaptation and registration frameworks for 3DMMs.
Monocular RGBD. RGB-D sensors capture RGB as well as depth
information of the scene. Consumer stereo cameras either use pas-
sive stereo, IR projection-mapping, or time-of-ight technology.
The depth channel in the input helps in resolving depth ambiguities
due to the lack of multiple views. Thus, in addition to photometric
consistency, these methods also minimize depth consistencies using
point-to-point and point-to-plane distances, see Sec. 3. Since monoc-
ular reconstruction methods solve a smaller optimization problem
compared to multi-view methods, many real-time solutions exist
[Bouaziz et al
.
2013; Hsieh et al
.
2015; Li et al
.
2013; Thies et al
.
2015;
Weise et al
.
2011a]. While most methods heavily rely on 3DMMs,
some try to adapt them to capture user-specic details. [Weise et al
.
2011a] build a user-specic expression model by adapting a gen-
eral one. This is done in an oine stage before the online tracking.
[Bouaziz et al
.
2013; Li et al
.
2013] adapt the 3DMM online, thus
removing the need for an oine step. [Hsieh et al
.
2015] introduced
an occlusion robust tracking system using face segmentation masks.
[Liang et al
.
2014] reconstruct a single image by retrieving instances
20 B. Egger et al.
of 3D shapes from a dataset and merging them, thus avoiding the
need for 3DMMs.
Monocular RGB. Without the presence of the depth channel, the
analysis-by-synthesis problem becomes even more ill-posed. These
methods cannot easily resolve depth ambiguities. Thus, the prior
knowledge of a 3DMM becomes important. Monocular RGB videos
can provide more constraints. The identity component, in this case,
can be estimated by fusing information from multiple frames in a
preprocessing step. Many methods can track the face in real-time
[Cao et al
.
2015, 2014a, 2013, 2016a; Ichim et al
.
2015; Thies et al
.
2016]. As in the case of RGB-D based methods, there are methods
which try to add details over the 3DMM reconstructions to make
the results user-specic and detailed. [Garrido et al
.
2016b] add
medium-scale correctives based on spectral basis vectors. Cao et al
.
[2015]; Garrido et al
.
[2013, 2016b]; Shi et al
.
[2014]; Suwajanakorn
et al
.
[2014] also add high-frequency wrinkle-level details. Wu et al
.
[2016b] use local blendshape models to capture more details com-
pared to global blendshape based methods. [Cao et al
.
2013, 2016a;
Ichim et al. 2015] compute user-specic 3DMMs using images of a
person performing specic known expressions.
Photo-collections, i.e., collections of images of a person can also
be used to constrain the identity components of the reconstructions
[Kemelmacher-Shlizerman and Seitz 2011; Liang et al
.
2016; Pio-
traschke and Blanz 2016; Roth et al
.
2015, 2016; Suwajanakorn et al
.
2015]. This is a more unconstrained setting compared to multi-view
images where all views are captured at the same time in the same
environment. Approaches which use photo-collections and videos
are more practical than multi-view images since such data is widely
available for most people.
Reconstruction from a single image is the most challenging sce-
nario. However, the original work of Blanz and Vetter [1999] already
proposed an analysis-by-synthesis solution, see Fig. 5. While they
required manual initialization for the optimization problem, several
approaches made the approach more robust to enable automatic
reconstruction [Aldrian and Smith 2011a; Bas et al
.
2017b; Egger
et al
.
2018; Fried et al
.
2016; Hu et al
.
2017b; Kortylewski et al
.
2018c;
Paysan et al
.
2009b; Schneider et al
.
2017; Schönborn et al
.
2017;
Tewari et al
.
2018]. Most analysis-by-synthesis approaches evalu-
ate the photometric consistency between the observations and the
estimates. Some approaches have explored the use of other image
features, such as edges, or SIFT [Booth et al
.
2017; Romdhani and
Vetter 2005], in order to obtain higher delity reconstructions. Oc-
clusion robust reconstruction by jointly solving for segmentation
has been explored in Egger et al
.
[2018]. Monocular reconstruction
methods primarily dier in their formulated energy functions. We
will look at these in detail in Sec. 5.2.
5.2 Energy Functions
The analysis-by-synthesis paradigm involves the solution of a non-
linear optimization problem made up of a number of energy func-
tions. Methods dier in their combination and precise design of
these energy functions, their relative weights and (dealt with in
the following subsection) the optimization strategy used to mini-
mize the energy. Here we describe the most commonly used energy
Initializing
the
Morphable Model
rough interactive
alignment of
3D average head
Automated 3D Shape and Texture Reconstruction
Illumination Corrected Texture Extraction
Detail
Detail
2D Input
Fig. 5. The analysis-by-synthesis pipeline used by Blanz and Veer [1999] for
reconstruction from a single image. The dierent steps include initialization,
optimization, and refinement of the optimized 3DMM texture.
functions for tting to RGB images. For single image reconstruc-
tion, the energy functions are expressed in terms of a single set
of unknown parameters
Θ
. In the case of multi-view images of a
static face, the camera parameters are indexed by viewpoint while
all others are xed across views. In the case of an image sequence
of a dynamic face, camera, expression and lighting parameters are
indexed by frame while neutral shape and texture parameters are
xed throughout the sequence.
Appearance error. The key ingredient of analysis-by-synthesis is
to measure the dierence between observed data and a synthesis
using the model. Most directly, this is the appearance error between
an input image and the rendered face. A number of variants of this
term have been used. The pixel-wise formulation sums the appear-
ance error over the pixels of the image, necessitating rasterization
of the model:
E
pixel
appearance
(Θ, I
obs
) =
Õ
(x,y)foreground
I
obs
(x, y) I
x,y
model
(Θ)
2
(19)
3D Morphable Face Models - Past, Present and Future 21
where
foreground = {(x, y) I|raster
C, T, w
s
, w
e
(x, y) , NULL}
is the set of pixels covered by the union of all triangles. This formu-
lation naturally weights the contribution of model vertices in terms
of their contribution to the appearance of a pixel. An alternative is
to compute the appearance error vertex-wise by rendering in model
space and sampling the image intensities onto the vertices:
E
vertex
appearance
(Θ, I
obs
) =
Õ
j visible
interp[I
obs
, project[C, v
j
(w
s
, w
e
)]] I
j
model
(Θ)
2
(20)
where
interp[X , (x, y)]
represents dierentiable interpolation of 2D
object
X
at location
(x, y)
and
visible
is the set of visible vertices. A
common variant of this approach uses a random subset of vertices
rather than all of them. This is more ecient, introduces stochastic-
ity that may help avoid local minima and avoids overly conservative
ts near the boundary where background may be sampled. Dier-
entiable interpolation of the image can either be done with explicit
dierentiable sampling (e.g., bilinear sampling as in Jaderberg et al
.
[2015]) or precomputing the image gradient and then interpolating
this along with the image intensities (this was done in the original
Blanz and Vetter [1999] paper). A drawback of the vertex-wise error
is that regions of the image with dense coverage from projected
vertices are weighted more heavily than more sparsely sampled
regions. This can be overcome by using weights related to projected
area. Blanz and Vetter [1999] accomplished the same eect by using
the triangle area as the probability by which the triangle would be
selected in their random sampling. For multi-image methods, the
above energies are simply summed over each image.
Feature-based energies. There are other, less direct, ways to com-
pute an error between the observed data and model. This is done
by rst computing features from observed data and then measuring
the dierence between those features and the corresponding ones in
the model. By far the most commonly used features are landmarks
(alternatively known as keypoints or ducial points) which often
used for initialization and are still important in much of the state-of-
the-art, e.g., [Sanyal et al
.
2019]. A landmark detector returns a set of
2D landmark coordinates
{x
j
}
J
j=1
with
x
j
R
2
. As a one-o proce-
dure, each landmark is associated with the corresponding vertex in
the 3DMM such that the
j
th landmark corresponds to vertex index
k
j
{
1
, . . . , n}
. The reprojection error of the model landmarks with
respect to detected positions is then given by:
E
landmarks
(Θ, {x
j
}
J
j=1
) =
J
Õ
j=1
x
j
project[C, v
k
j
(w
s
, w
e
)]
2
.
(21)
Sometimes the landmarks are allowed to slide on the face surface
such that each landmark has a set of vertices to which it could
correspond [Zhu et al. 2015].
Edges directly convey geometric information about occluding
boundaries and texture edges. Misalignments between model and
image edges seriously degrade the perceptual quality of a reconstruc-
tion and lead to the wrong part of the face, or the background, being
sampled onto the mesh. Moghaddam et al
.
[2003] were the rst to
exploit this cue by tting to multi-view silhouettes. Romdhani and
Vetter [2005] computed the distance transform of detected edges in
an input image providing a distance-to-edge cost surface that was
sampled at projected positions of vertices lying on model texture
edges or the occluding boundary. Amberg et al
.
[2007] extended
this to multiple views and improved robustness by averaging the
cost surface over dierent parameters of the edge detector. Keller
et al
.
[2007] showed that these cost functions are neither continuous
nor dierentiable. Bas et al
.
[2017b] transformed edge tting into
landmark tting by alternating between computing an explicit cor-
respondence between edge pixels and model edges and minimizing
the resulting landmark energy. Sánchez-Escobedo et al
.
[2016] di-
rectly regress shape parameters from a set of multi-view occluding
contours.
Finally, some other features have been considered. Romdhani
and Vetter [2005] used the position of specularities in the image to
constrain the surface normal direction at the corresponding location
on the model via a specular reection model. Booth et al
.
[2017]
and Booth et al
.
[2018b] compute dense SIFT features from the
input image and compare these to the SIFT features on which their
statistical texture model is built in a similar fashion to the vertex-
wise appearance error above.
Background Modeling. A common challenge when optimizing
for pose and shape is the varying visibility of vertices for vertex-
wise errors and the varying number of pixels covered by the face
for pixel-wise errors. This leads commonly to the undesired eect
of shrinking. Having the model covering fewer pixels or having
fewer vertices visible leads to an undesired local optimum of most
error terms. Common strategies to overcome this are xed visibility,
restrictive regularization, relying on landmarks, enforcing edge
or contour terms or explicit image segmentation. Schönborn et al
.
[2015] demonstrated the problems with an implicit background
model which is present in all error formulations and have shown that
even simple background models
b
like a constant, a Gaussian or an
image histogram-based model can solve this issue. The background
model can easily be added to the existing formulations, e.g., for the
pixel-wise formulation as:
E
image
appearance
(Θ, I
obs
) =
E
pixel
appearance
(Θ, I
obs
) +
Õ
(x,y)background
b(I
obs
(x, y)). (22)
Occlusions and Segmentation. Occlusion of faces by other objects,
that are not part of the generative model, are a common challenge
for the so far presented error terms and for analysis-by-synthesis in
general. There are various methods presented on how to identify
occlusions. Those methods range from appearance-based methods
[De Smet et al
.
2006; Pierrard 2008] to detection [Morel-Forster 2016]
and segmentation-based methods [Egger et al
.
2018; Saito et al
.
2016].
They share the basic idea, that occluded pixels are excluded from
the model evaluation:
E
semantic
appearance
(Θ, I
obs
) =
Õ
l label
Õ
(x,y)R(l)
E
pixel
label
(Θ, I
obs
(x, y), l), (23)
22 B. Egger et al.
where each
E
pixel
label
is a separate model per label, and
R(l)
is the image
region covered by label
l
. Those labels could e.g., be face, occlusion
and background or also contain more detailed labels like beards.
Whilst the segmentation is based on detection and xed in Morel-
Forster [2016]; Pierrard [2008]; Saito et al
.
[2016], other methods
solve for segmentation and model parameter estimation jointly in
an Expectation-Maximization-based manner [De Smet et al
.
2006;
Egger et al. 2018].
Priors. A 3DMM is a statistical model and so provides a natural
probabilistic prior over the parameter space. Under the assump-
tion that the original data is Gaussian distributed the natural cost
function to express this prior for either the shape or texture model
is:
E
prior
(Θ) =
d
Õ
i=1
w
2
i
σ
2
i
, (24)
where
σ
2
i
is the variance associated with the
i
th principal component.
The drawback to this prior is that it is minimized by the mean face
and, if weighted heavily, leads to model dominance where recovered
faces are too close to the average. There has been in discussion in the
literature [Lewis et al
.
2014b; Patel and Smith 2016] as to whether
this prior is appropriate in high dimensional space and alternatives
have been considered, as will be described next.
One class of techniques allows reconstructed shape and/or texture
to deviate from the 3DMM subspace enabling recovery of ne-scale
detail not captured by the model. Allowing arbitrary shape or albedo
changes transforms the problem into classical shape-from-shading
and becomes highly ill-posed. For this reason, additional generic
priors are used. Patel and Smith [2012] use a piecewise smoothness
prior on per-vertex diuse albedo which is allowed to vary per-
vertex along with surface normals to satisfy a shape-from-shading
constraint. This is regularized using the squared vertex distance
between the updated shape and the closest shape in the 3DMM
space. Richardson et al
.
[2017] use the same regularization, though,
expressed in terms of per-pixel depth. To ensure smoothness, they
also use the L1 norm of the discrete Laplacian of the depth map. The
L2 norm of the mesh Laplacian has also been used as a smoothness
prior [Garrido et al. 2016a; Tewari et al. 2018]
When reconstructing a dynamic face from video, parameters can
either be assumed xed (if identity dependent) or smoothly varying
(pose, expression, lighting). These latter parameters can, therefore,
be regularized with generic temporal smoothness priors. A common
and simple way to express this prior is to initialize each frame with
the estimate from the previous one. This encourages convergence to
a local minimum close to the solution for the previous frame. More
sophisticated priors have also been considered. For example, Cao
et al
.
[2013]; Weise et al
.
[2011a] build a Gaussian mixture model
over expression parameters from the previous
k
frames. This model
is then used to regularize the estimate for the current frame.
5.3 Optimization
From the perspective of optimizing the energy functions above, there
are a number of signicant challenges. First, most of the energy
terms are nonconvex in theory and we observe in practice that there
are many local minima. Second, the appearance error is not even
continuous due to rasterization/vertex visibility and shadowing all
being noncontinuous functions. Third, the appearance error has a
small basin of convergence. When a model feature is completely
misaligned to the image (or in the extreme case, the whole model
aligned entirely to background), the gradient of the appearance error
conveys no useful information. Fourth, all parameters have global
inuence. Fifth, computing the appearance error and its gradient is
computationally expensive, amounting to the rendering of an image.
For these reasons, a signicant eort has gone into the selection
of optimization algorithms and engineering of the optimization
schedule to develop methods that are suciently fast and robust.
The majority of existing approaches optimize based on gradient
information of the energy function. The original Blanz and Vetter
[1999] approach used rst-order gradient descent, as have other
more recent methods [Bouaziz et al
.
2013; Fried et al
.
2016; Ichim
et al. 2015]. Since they computed the appearance error over only a
small subset of randomly selected triangles, this is strictly stochastic
gradient descent (SGD). An interesting parallel here is that modern
deep learning-based methods (see Section 6) are usually trained
with SGD and use similar energy functions so they are learning
from the same signal used in the original method.
Since the energy terms above can easily be formulated as nonlin-
ear least-squares problems, specialized pseudo-second-order meth-
ods like Gauss-Newton or Levenberg-Marquardt have often been
used [Garrido et al
.
2013, 2016a,b; Romdhani and Vetter 2005; Thies
et al
.
2015, 2016]. Booth et al
.
[2017] use a project-out strategy
in which appearance parameters are implicitly solved in a least
squares sense and optimization takes place only over geometric pa-
rameters. General pseudo-second-order methods such as BFGS have
been used [Cao et al
.
2013; Weise et al
.
2011a] as well as genuine
second-order methods, specically a stochastic variant of Newton’s
method [Blanz and Vetter 2003]. As the problem size increases, as
in the case of shape-from-shading, gradient descent becomes the
most common optimization approach [Garrido et al
.
2016a; Shi et al
.
2014; Suwajanakorn et al
.
2014; Tewari et al
.
2018]. In all the above
methods, the discontinuity of the appearance function is dealt with
by xing rasterization/visibility when computing gradients or even
keeping them xed for a certain number of iterations. Importantly,
this means that the gradient cannot convey information about a
change in visibility. Many other tricks have been considered, for
example, hierarchical optimization (both in parameter space and
spatially, i.e. multiresolution [Thies et al
.
2016]) and using an opti-
mization schedule in which dierent energy functions are switched
on or weighted dierently at dierent phases on the optimization
[Blanz and Vetter 1999].
Several approaches have decomposed the energy terms into sev-
eral smaller, often linear, problems (sometimes with closed-form
solutions) that can be solved eciently and in sequence [Aldrian
and Smith 2010, 2011a, 2013, 2011b; Bas et al
.
2017b; Cao et al
.
2014a,
2013; Hu et al
.
2017c; Romdhani et al
.
2002; Saito et al
.
2016; Zhu
et al
.
2015]. These alternating approaches are usually very ecient
but not guaranteed to obtain the optimum solution that comes from
optimizing all parameters simultaneously.
Gradient-based methods are typically initialized by tting only
to landmarks, i.e. to optimize the landmark energy in isolation.
Originally, the landmark positions were provided manually but
3D Morphable Face Models - Past, Present and Future 23
(d) (e) (f)
ambient light mean albedo
(a) (b) (c)
Fig. 6. An example of the albedo-illumination ambiguity presented by Egger
[2018]. The target image in the first row (a), its color rendered under ambient
illumination (b) and its illumination rendered on the mean albedo of the
Basel Face Model (c). The second row shows a model instance with dierent
color (e) and illumination (f) parameters but very similar appearance (d).
combining with an automatic landmark detector provided fully au-
tomatic methods [Breuer et al
.
2008]. From a landmark detector that
outputs many hypothesized landmark locations, including many
false positives, Amberg and Vetter [2011] use Branch and Bound to
select the subset conguration of landmarks that is most consistent
with the 3DMM. Bas and Smith [2019] show how to express the
landmark energy as a separable nonlinear least squares problem.
While gradient-based methods are widely used mainly due to
computational eciency and ease of implementation, these methods
are sensitive to initialization and often end up in local minima.
Probabilistic methods based on Bayesian inference were proposed
to deal with these limitations [Egger et al
.
2018; Kortylewski et al
.
2018c; Schneider et al
.
2017; Schönborn et al
.
2017]. These methods
do not require any gradient computation of the energy terms to
update the estimates. They are stochastic and thus, less susceptible
to getting stuck in local minima. Dierent from optimization-based
methods which only provide a single solution, these approaches
approximate the full posterior distribution and thus provide access
to a manifold of possible solutions.
5.4 Open challenges
Reconstructing 3D shape and albedo from a 2D image is an ill-posed
problem. Ambiguities like the perspective face shape ambiguity
[Smith 2016] and the albedo illumination ambiguity [Egger 2018]
have been demonstrated (see Figure 6). These ambiguities can not
be resolved completely and priors are our best approach to at least
nd a reasonable estimation. They are the major reason why there
is a huge gap between the estimates we can get from multi-view and
3D data vs. from monocular images. Even in the state-of-the-art, it is
often evident that overall skin color is explained using the lighting
while the albedo colors are similar for very dierent skin types
[Tewari et al
.
2018]. This is somewhat improved by discriminative
methods that do not need to synthesize the same appearance as a
given image, only an image with the same identity [Genova et al
.
2018], thereby sidestepping explicit estimation of illumination and
camera parameters. Reporting the geometric errors obtained by the
model mean is not common. Only three papers demonstrated their
3D reconstructions to be closer in mesh distance to the ground truth
face compared to the model mean [Aldrian and Smith 2013; Sanyal
et al. 2019; Schönborn et al. 2017].
Current state of the art techniques also lack dramatically in accu-
racy across pose and in terms of matching the contours and edges.
It is very dicult to evaluate these beyond qualitative evaluation
which makes it dicult to compare dierent approaches. Recently
a rst benchmark with natural images and ground truth shape
was published and will help to better compare competing methods
[Sanyal et al
.
2019]. However, as methods get more accurate, the
mesh distance errors get close to the range of error in computing
“ground truth” using multi-view methods. This makes it dicult to
quantitatively compare dierent approaches.
Another challenge which is usually neglected are occlusions.
Faces are mostly occluded by objects which are frequently in front
of faces like glasses, cigarettes, hands or microphones, but can also
be occluded by virtually every other object. Analysis-by-synthesis
methods fail when they do not explicitly model occlusions. Fur-
thermore, reconstruction methods based on 3DMMs are limited
to the space of faces covered by the models. A lot of residual er-
ror in the results stems from the fact that 3DMMs do not model
detailed and high-frequency geometry and texture. Furthermore,
most approaches use simple lighting models which cannot explain
many in-the-wild images. These limitations are also shared with the
learning-based methods which use analysis-by-synthesis in their
pipeline, see Sec. 6.
Recent techniques have been focused on reconstruction from
a single face individually. The aim of face image analysis would,
however, go beyond interpreting a single face or each face sepa-
rately. We would like to analyze and interpret interactions between
people and perhaps also ease the analysis task by exploiting scene
constraints, such as shared illumination parameters to deal with
albedo-illumination ambiguity, or constraints on the perspective
face shape ambiguity by analyzing multiple faces jointly.
6 DEEP LEARNING
So far, we have mainly discussed classical face modeling and pa-
rameter estimation techniques based on optimization-based inverse
graphics. We now discuss how these processes can be replaced by
or combined with deep learning, see Figure 7. There are a number
of reasons for wanting to do this. On the modeling side, the use
of nonlinear, deep representations oers the possibility to surpass
classical linear or multilinear models in terms of generalization,
compactness and specicity [Styner et al
.
2003]. On the parameter
estimation side, we can exploit the speed and robustness of deep
networks to achieve reliable performance on uncontrolled images.
We begin by discussing deep modeling and deep model tting
before nally discussing methods that simultaneously learn both
the model and how to t it within a single deep network.
24 B. Egger et al.
Model-based self-supervision (model-based decoder)
Deep
encoder
Model-
based
decoder
Θ
Deep
encoder
Θ
Supervised regression
Model-
based
decoder
Θ
Synthesis
Analysis-by-synthesis
Model-
based
decoder
Θ
Deep
encoder
Θ
Supervised regression
(a)
(b)
(c)
(d)
Data
Optimized parameters
Fixed model
Loss function
Fig. 7. The relationship between classical analysis-by-synthesis and deep learning approaches. (a) analysis-by-synthesis, e.g., [Blanz and Veer 1999]. (a)+(c)
training a regressor based on the output of an analysis-by-synthesis algorithm, e.g., [Tran et al
.
2017], (b)+(c) training a regressor using synthetic data
generated by a model, e.g., [Richardson et al. 2016], (d) self-supervision, e.g., [Tewari et al. 2017].
6.1 Deep Face Models
The traditional modeling techniques discussed in Section 3 aim to
represent face shape, expression, and appearance as vector
w
in a
low-dimensional latent space
R
d
. The projection into (respectively
reconstruction from) this latent space is dened by linear or multi-
linear operations, and can be thought of as encoding (respectively
decoding) the high-dimensional information in
R
d
. Deep learning
provides a new tool for building 3DMMs using nonlinearities both
in the encoder and the decoder. This way of building morphable
models is currently a very active area of research.
We can see the relationship between the encoder and decoder
learned using deep learning and classical works using the example
of linear models commonly used for shape and texture modeling.
In the context of deep learning, such a linear model formalized in
Equation
(2)
, is exactly equivalent to a fully connected layer in a
neural network. Concretely, the parameter vector
w
plays the role
of the input features, the principal components
e
j
are the weights
and the mean
¯
c
is the bias. This can be viewed as deco ding from the
latent parameter space to the data space
c
. Projection onto the model
can similarly be viewed as encoding with a fully connected layer in
which the input features are the data, the weights are the rows of
the transposed principal component matrix and the biases are given
by
e
T
j
¯
c
. Concluding the analogy, a PCA can be accomplished by
combining the encoder and decoder as a linear autoencoder with
a single hidden layer. Such an autoencoder with
d
neurons in the
hidden layer will learn a latent space with the same span as a
d
dimensional PCA, though without the guarantee of orthogonality
(though this could be ensured with appropriate loss functions).
Given this close relationship between classical methods and deep
learning, it is natural to ask if there exist more powerful nonlinear
models that can be trained based on current advances in deep neural
networks. As in the classical work, this has been considered for the
2D case. Duong et al
.
[2019] propose a deep appearance model for
2D facial images that extends 2D AAMs to model nonlinearities.
This is achieved using deep Boltzmann machines to model 2D shape
and texture information. For modeling 3D faces, rst successful
models using autoencoders, GANs, and hybrid structures have been
proposed, as detailed in the following.
Fernández Abrevaya et al
.
[2018] proposed the rst encoder-
decoder architecture to model the 3D geometry of faces. The encoder
rst projects the 3D face to a 2D image and uses a standard image-
based encoder, while the decoder is xed to a classical tensor-based
face model. This allows decoupling shape variations caused by iden-
tity and expression. Bagautdinov et al
.
[2018] introduced a VAE that
models dierent levels of detail of facial geometry by representing
global and increasingly localized shape variations in dierent layers
of the network. The 3D geometry is again represented using a two-
dimensional mapping, and convolutions are performed in the image
domain. This work allows representing highly detailed geometric
information in latent space. Lombardi et al
.
[2018] extend this work
to jointly encode variations in appearance and geometry, for the ap-
plication of highly detailed facial rendering from novel viewpoints.
Ranjan et al
.
[2018] proposed the rst autoencoder architecture for
the geometry of faces that performs convolutions in 3D mesh space
directly instead of going through a 2D image representation. The
model, named CoMA, allows for very compact representations of
the facial geometry. This work was recently extended to encode
both texture and shape information jointly [Zhou et al. 2019].
An alternative line of work considers learning GANs for 3D face
modeling. Slossberg et al
.
[2018] proposed the rst 3DMM using
GANs. In this work, the facial texture is mapped to a coherent 2D
image domain, and two-dimensional convolutions are employed
to build a GAN of facial texture. This is combined with a standard
PCA-based 3DMM for facial geometry, where for a generated face
texture, a suitable PCA-based geometry is computed. Recently, multi-
ple methods were proposed to generate 3D facial geometry, possibly
with texture information. Fernández Abrevaya et al
.
[2019] proposed
to train a GAN for the geometry of 3D faces that is able to decouple
dierent factors of variation such as identity and expression. Shamai
et al
.
[2020] proposed a GAN architecture to generate both facial
geometry and texture, with a focus on highly detailed texture infor-
mation by mapping the face to a unit rectangle. Cheng et al
.
[2019]
proposed the rst intrinsic GAN architecture that operates directly
on 3D meshes. As in the case of 2D images, GANs are generally able
to generate more detailed and realistic 3D faces than autoencoders
at the cost of being more dicult to train.
3D Morphable Face Models - Past, Present and Future 25
Finally, hybrid structures can be eective to learn nonlinear
3DMMs. Tran and Liu [2018a] jointly learn a 3DMM and 3D re-
construction from a 2D image using a dierentiable renderer in the
training loss, see also Section 6.3. The network takes as input a 2D
image and encodes it into projection, shape and texture parameters.
Two decoders are then used to infer 3D shape and texture, respec-
tively. Wang et al
.
[2019b] proposed an adversarial auto-encoder
structure that allows disentangling factors of variation such as iden-
tity, expression, or pose of 2D facial images, and that is trained in
an unsupervised way. While the method’s input and output are 2D
images, the 3D geometry of the face can be reconstructed.
Recently, appearance modeling approaches based on deep learn-
ing have also been proposed. The rise of deep learning methods
facilitated to learn per-vertex appearance models directly from im-
ages, such as done by Tewari et al
.
[2018], who learn per-vertex
albedo model osets in order to improve the generalization ability
of an existing PCA-based model. Similarly, Tewari et al
.
[2019], learn
a per-vertex albedo model from scratch based on video data. Zhou
et al
.
[2019] train a mesh decoder that jointly models the texture
and shape on a per-vertex basis, which, however, relies on the avail-
ability of 3D shape and appearance data. There are also several
deep learning approaches that consider a texture-based appearance
modelling. Without the need of 3D data, Tran and Liu [2018a] learn
a nonlinear facial appearance model represented in uv-space based
on CNNs, which, however, does not explicitly consider lighting. In
follow-up work, the authors considered a more elaborate model
where the albedo and the lighting is separately modeled [Tran et al
.
2019; Tran and Liu 2018b]. Moreover, a range of generative methods
that synthesize facial textures have been proposed, e.g., by Saito
et al
.
[2017], Slossberg et al
.
[2018], Deng et al
.
[2018], Lombardi
et al
.
[2018], Nagano et al
.
[2018] and Yamaguchi et al
.
[2018]. Gecer
et al
.
[2019b] use GAN-based texture model for the task of 3D face
reconstruction, and Nagano et al
.
[2019] use GAN-based texture
models for the task of face normalization.
6.2 Deep Face Reconstruction
In the following, we discuss dense monocular face reconstruction
approaches that are based on deep neural networks. We discuss
requirements on the used training data, as well as dierent training
strategies. Let us rst have a closer look at the reconstruction prob-
lem, Blanz and Vetter [1999] tackle monocular face reconstruction
by tting a parametric model based on an optimization approach,
i.e., gradient descent. Deep learning approaches follow a similar op-
timization strategy, but instead of solving the optimization problem
at ‘test’ time, they for example train a parameter regressor based
on a large dataset of training images, see Figure 7. The regressor
can be interpreted as an encoder network that takes a 2D image as
input and outputs the low-dimensional face representation. Learned
encoders can be combined with decoders based on classical face
models to give rise to end-to-end encoder-decoder architectures.
This methodology is widely-used and enables the fusion of classical
model-based and deep learning approaches.
6.2.1 Sup ervised Reconstruction. Supervised regression approaches
are trained based on paired training data, i.e., a set of monocular
images and the corresponding ground truth 3DMM parameters. One
of the essential questions here is how to eciently obtain the ground
truth for such a supervised learning task. In the following, we will
categorize the approaches based on the type of employed ground
truth training data.
One option would be to let users annotate the ground truth. While
this is a popular strategy, which is often employed for sparse recon-
struction problems [Saragih et al
.
2011], the accurate annotation
of dense geometry, appearance, and scene illumination is almost
intractable. A related approach is for example employed in the work
of Olszewski et al
.
[2016], where three professional animators man-
ually created the blendshape animation to match a video clip.
For dense reconstruction tasks, some approaches [Laine et al
.
2017] are trained based on images captured in a controlled multi-
view capture setup. Thus, ground truth can be obtained by a multi-
view reconstruction approach followed by tting a 3DMM to the
resulting 3D data. Normally, the ground truth is of very high quality,
but the distribution of the captured monocular images does not
match in-the-wild data, which can lead to generalization problems
at test time.
The approach of Tran et al
.
[2017] performs monocular recon-
struction for multiple images of the same person and computes a
consolidated face identity based on simple averaging of the 3DMM
parameters.
Currently, many approaches [Feng et al
.
2018; Kim et al
.
2018b;
Klaudiny et al
.
2017; McDonagh et al
.
2016; Richardson et al
.
2016;
Sela et al
.
2017; Yu et al
.
2017] in the research community are trained
on synthetic training data, since it is easy to acquire and comes by
design with perfect annotations. Given a face 3DMM, random identi-
ties and expressions can be sampled in parameter space. Afterward,
the models can be rendered under randomized illumination condi-
tions and from dierent viewpoints to create the monocular images.
Often, background augmentation is employed by rendering the gen-
erated faces on top of a large variety of real-world background
images. Since all the parameters are controlled, they are explicitly
known and can be used as ground truth. While it is easy to get
access to synthetic training data, there is often a large domain gap
between synthetic and real-world images, which severely impacts
generalization to real images. For example, hair, facial hair, torsos,
or mouth interiors are often not modeled at all. One possibility to
counteract this problem in the future would be better models that
include all these components.
To leverage the advantages of both real as well as synthetic train-
ing data, many current approaches [Kim et al
.
2018b; Richardson
et al
.
2017] are trained on a mixture of data from these two domains.
The hope here is that the approach learns to deal with real-world
images, while the perfect ground truth of the synthetic training
data can be used to stabilize training. One interesting variant of this
is self-supervised bootstrapping [Kim et al
.
2018b] of the training
corpus. Other approaches that can be trained without requiring
ground truth data are presented in the next sections.
6.2.2 Self-Supervised Reconstruction. Supervised training of a con-
volutional neural network requires an annotated dataset. Most of
the methods we have discussed so far use such datasets, either syn-
thetic or real. Recently, some approaches explored self-supervised
learning i.e., training on real image datasets without any 3D labels.
26 B. Egger et al.
This was made possible by a combination of analysis-by-synthesis
(Sec. 4) and deep learning techniques. Tewari et al
.
[2017] intro-
duced a model-based encoder-decoder architecture, which replaces
the trainable decoder with an expert-designed xed decoder. This
expert-designed decoder takes the 3DMM parameters (latent code)
predicted by an encoder as input and transforms it into a 3D re-
construction using the 3DMM. It further renders a synthetic image
of the reconstruction using a dierentiable renderer. Extrinsic pa-
rameters required for rendering are also predicted by the encoder.
The loss function used is very similar to those used in analysis-
by-synthesis (Sec. 5.2), consisting of photometric alignment and
statistical regularization. We can think of such a technique as a joint
analysis-by-synthesis optimization problem over a large training
dataset, instead of a single image, see Figure 7. This allows for train-
ing a parameter regressor without any 3D supervision. This concept,
usually in combination with supervised synthetic data has also been
explored using higher-level loss functions like identity preservation
[Genova et al
.
2018; Sanyal et al
.
2019], or perceptual and adver-
sarial losses [Tran et al
.
2017]. Gecer et al
.
[2019b] employ GANs
in combination with dierentiable rendering to learn a powerful
generator of facial texture. [Richardson et al
.
2017; Sengupta et al
.
2018] rene 3DMM predictions for higher quality or more detailed
results. [Deng et al
.
2019; Sanyal et al
.
2019] extend the network
architecture to allow for training using multiple images of a person
as constraint. Bas et al
.
[2017a] use a 3DMM as a spatial transformer
network such that model tting is learned as a by-product of solving
a downstream task.
6.3 Joint Learning of Model and Reconstruction
Model-based encoder-decoder networks consist of a trainable en-
coder and a xed decoder, where the decoder implements a 3DMM.
However, the 3DMM itself could be trainable. We could simply up-
date its values using the gradients from the loss function. This would
allow face model learning using only 2D supervision. Learning 3D
models entirely from 2D data was rst shown in [Cashman and
Fitzgibbon 2012] without the use of deep learning. Several deep
learning approaches have explored rening an existing 3DMM us-
ing large image datasets [Lin et al
.
2020; Tewari et al
.
2018; Tran
et al
.
2019; Tran and Liu 2018b]. Nonlinear convolutional decoders
have also be used to build nonlinear face models [Tran et al
.
2019;
Tran and Liu 2018b]. Models learned from 2D data are more gener-
alizable to dierent identities, as the image datasets contain signi-
cantly more identities compared to the 3D datasets used to compute
3DMMs. Recently, an extension of the model-based encoder-decoder
architecture was used to learn the identity component of a face
model from videos [Tewari et al. 2019].
6.4 Open Challenges
Applying deep learning to the analysis of 3D face data is an active
research topic that the community has only started to explore during
the past few years, with many ongoing advances. Hence, many
challenges currently remain to be solved. The most pressing ones
include analyzing the limitations of current methods and providing
comprehensive comparisons. This includes a clear analysis of the
methods’ tendency to overt, especially when mostly synthetic
data is used for training and the interpretability of the learned
representations. It also includes a clear analysis of whether training
in the 2D or 3D domain oers clear benets for dierent applications.
It is interesting that deep learning methods are learning from
essentially the same energy functions as classical methods using
similar optimization approaches (e.g., stochastic gradient descent).
The dierence is that backpropagation updates are averaged over
batches and whole datasets, seemingly alleviating problems of local
minima or overtting to a single sample. The problem then becomes
overtting to the distribution of faces in the training set. The training
data used in these learning-based methods are often biased (e.g., Liu
et al
.
[2015] includes mostly smiling faces). This leads to biases in
the reconstruction methods. A practical question that requires to
be solved is to determine the minimum amount of data required to
apply deep learning methods. This is important when high-quality
data is used for supervised training.
As learning-based and analysis-by-synthesis methods come to-
gether through self-supervised reconstruction methods, there are
many shared challenges such as perspective face shape ambiguities
and dealing with occlusions (e.g., Tran et al
.
[2018] already did a rst
step in this direction). Learning-based methods typically are very
fast and robust to initialization but achieve lower quality results
compared to analysis-by-synthesis methods. One way to combine
the desirable properties of these dierent paradigms is to use the
learning-based solution as initialization for analysis-by-synthesis
optimization [Tewari et al. 2018].
While some recent methods have tried to build 3DMMs just from
2D data for better generalization, the resulting models are not as
high-quality and lack details due to the low resolution of faces in
currently available in-the-wild images. Bridging the gap in terms of
details between models trained using high-quality data, and those
built using only 2D data is an important open challenge.
Other challenges include extending recently developed methods
to new applications. For instance, while monocular face reconstruc-
tion has started being explored, there is not yet much work on
reconstructing a coherently deforming facial geometry from 2D
video data.
7 APPLICATIONS
Parametric face models enable many compelling applications. In
the following, we will discuss applications in the domains of face
recognition, entertainment, medical applications, forensics, cogni-
tive science, neuroscience, and psychology. All these applications
have been pushed by the availability of publicly shared models and
code (see Tab. 2), as well as other resources [Community 2019].
7.1 Face Recognition
In the context of face recognition, 3DMMs have a manifold of po-
tential applications. Blanz et al
.
[2002] proposed to perform face
recognition using the cosine angle on the shape and color coef-
cients estimated from a pair of 2D images as a distance metric
for identication and recognition. This distance metric exploits the
natural disentanglement of 3DMMs separating identity (shape and
color) from camera and illumination variation. It was shown that
this 3DMM-based distance metric enables to recognize faces across
3D Morphable Face Models - Past, Present and Future 27
Source
Target
Composite
Live Facial Reenactment Setup
SourceTarget
Fig. 8. The first real-time facial reenactment approach [Thies et al
.
2015] was based on RGB-D sensors. The approach tracks the facial expressions of a source
and target actor, transfers the expression from source to target, and re-renders the target actor with the new expression on top of the input video stream.
large pose and illumination variations [Blanz et al
.
2002; Blanz and
Vetter 2003; Paysan et al
.
2009a], while being robust to facial expres-
sions [Gerig et al
.
2018], as well as being able to do recognition from
features in the facial texture [Pierrard and Vetter 2007]. Recently,
Tran et al
.
[2017] have shown that the performance of face recogni-
tion with 3DMM parameters can be enhanced by specically taking
the face recognition task into account when regressing the 3DMM
parameters. Whilst most work focuses on face recognition from
2D images, the 3DMM was also applied to the 3D face recognition
task focusing on the shape coecients and robust recognition with
respect to facial expressions [Amberg et al
.
2008; Paysan et al
.
2009a;
ter Haar and Veltkamp 2008].
Although 3DMMs have shown promising results at face recog-
nition in controlled settings, they did not achieve a convincing
performance on in the wild data. This arises from the ill-posed prob-
lem of estimating shape and color parameters from a 2D image,
at the same time a high precision of this estimation is needed for
face recognition. Therefore, purely data-driven approaches have
remained the dominant approach to face recognition, in particular
since the advancement of deep learning technology [Parkhi et al
.
2015; Schro et al
.
2015; Taigman et al
.
2014]. However, data-driven
approaches have fundamental problems such as their dependence
on large-scale training data and their lack of generalization to out-
of-distribution samples [Klare et al
.
2012]. One of the main issue
for face recognition is the alignment of images. Careful alignment
of face images has a big impact on face recognition accuracy and
even for state of the art deep learning systems. 3DMMs are particu-
larly useful for tackling these limitations, e.g., by using 3DMMs as
a tool for face frontalization [Blanz et al
.
2005; Hassner et al
.
2015;
Tena et al
.
2007]. In this context, it was shown that 9 out of 10 2D
algorithms in the Face Recognition Vendor Test 2002 [Phillips et al
.
2003] improved considerably when combined with a 3DMM for
face frontalization [Blanz et al
.
2005]. Other applications of 3DMMs
include augmenting real-world data in 3D [Masi et al
.
2016] and the
generation of synthetic data for training [Kortylewski et al
.
2018b;
Sela et al
.
2017] and for analyzing the eects of dataset bias on face
recognition systems [Kortylewski et al. 2018a, 2019].
Almost all applications of 3DMMs in the context of face recogni-
tion would benet from improvements of the parametric model as
well as the tting process. A more realistic texture model including
textural details and modeling hair would enhance the quality of syn-
thetic data, possibly further reducing the amount of real-world data
needed to train data-driven models. A more accurate tting process
would enhance the model’s performance on face frontalization and
face recognition from 3DMM parameters.
7.2 Entertainment
3DMMs are an integral building block for many compelling applica-
tions in the entertainment sector. Such applications normally have
to work in the wild and based on a low number of sensors, e.g., only
the images captured by a single color camera are accessible. In such
underconstrained scenarios, the statistical prior that is encapsulated
in the face model is a powerful tool to better constrain the underly-
ing reconstruction problems. In the following, we discuss several
entertainment applications in detail. These applications are also
covered in more depth in the state of the art report of Zollhoefer
et al. [2018].
7.2.1 Controlling 3D Avatars for Games and VR. Realistic 3D face
avatars can be reconstructed based on multi-view video [Lombardi
et al
.
2018], a few images [Cao et al
.
2016b; Ichim et al
.
2015] or
given only a single image [Hu et al
.
2017b; Wang et al
.
2019a].
Such avatars or even artist-designed characters can be controlled in
gaming scenarios based on dense trackers that employ a parametric
face model. Such vision-based control was rst demonstrated in an
o-line setting [Chai et al
.
2003; Chuang and Bregler 2002; Pighin
and Lewis 2006; Wang et al
.
2004; Weise et al
.
2009]. Nowadays,
dense facial performance capture is feasible at real-time rates based
on RGB-D [Li et al
.
2013; Thies et al
.
2015; Weise et al
.
2011b] and
color [Bouaziz et al
.
2013; Thies et al
.
2016] cameras. Besides vision-
based animation, there is extended work on audio-based control
[Cudeiro et al
.
2019; Karras et al
.
2017; Kshirsagar and Magnenat-
Thalmann 2003; Taylor et al
.
2017]. Face tracking can also be used to
enable face-to-face communication [Li et al
.
2015a; Lombardi et al
.
2018; Olszewski et al. 2016; Thies et al. 2018c] in virtual reality.
7.2.2 Virtual Try-On and Make-Up. Face reconstruction and track-
ing based on a parametric face model can also be employed to build
virtual mirrors that enable the try-on of accessories or make-up.
28 B. Egger et al.
To this end, rst, a personalized model of the face is recovered
and tracked across the video resulting in a dense set of correspon-
dences. These enable spatio-temporal re-texturing, e.g., to virtually
place tattoos [Garrido et al
.
2014] and can be used to add facial
make-up [Bronstein et al
.
2007] and try out dierent suggestions
[Scherbaum et al
.
2011]. Virtual make-up can be applied based on
a reectance/shading decomposition [Li et al
.
2014b, 2015b]. Sim-
ilar techniques enable the try-on of accessories, e.g., eyeglasses
[Azevedo et al. 2016; Niswar et al. 2011].
7.2.3 Face Replacement a.k.a. Face Swap. Face replacement enables
the replacement of the inner face region in a target video with that
from a source video. To this end, both persons are reconstructed
based on the same parametric model resulting in dense inter-person
correspondences. First approaches enabled face replacement be-
tween images [Bitouk et al
.
2008; Blanz et al
.
2004b; Jones et al
.
2008;
Kemelmacher-Shlizerman 2016]. Later works extended those ideas
including skin and hair segmentation to deal with glasses and occlu-
sion by hair [Pierrard 2008]. Other techniques focus on swapping
faces between video sequences [Dale et al
.
2011; Garrido et al
.
2014].
Today the eect is mostly known under the term ‘face swap’ and
has been popularized by a SnapChat
14
lter.
7.2.4 Face Reenactment and Visual Dubbing. Facial reenactment is
the process of transferring the facial expressions from a source to a
target video. First, o-line techniques have been proposed [Blanz
et al
.
2003; Bregler et al
.
1997; Kemelmacher-Shlizerman et al
.
2010;
Li et al
.
2014a, 2012; Theobald et al
.
2009; Vlasic et al
.
2005b]. The rst
real-time facial reenactment approach [Thies et al
.
2015] was based
on an RGB-D sensor, see Fig. 8. Afterward, also real-time techniques
for reenacting standard video have been proposed [Thies et al
.
2016].
Other approaches enable to take control of a single image [Averbuch-
Elor et al
.
2017; Saragih et al
.
2011]. Follow-up work focused on
controlling more than just the face region, e.g., the complete upper
body [Thies et al
.
2018b]. Nowadays, many reenactment approaches
are based on deep generative models [Kim et al
.
2018a; Pumarola
et al. 2018].
Facial reenactment [Kim et al
.
2018a; Thies et al
.
2016] can also
be applied to the problem of visual dubbing, i.e., the task of adapting
the mouth motion of a target actor to match a new audio track.
More sophisticated visual dubbing approaches [Garrido et al
.
2015]
directly take the new audio track into account for better audio-visual
alignment. There is also some work on audio-based animation of
video [Brand 1999; Suwajanakorn et al. 2017].
7.3 Medical Applications
The clinical applications of the 3DMM cover both, analysis as well as
synthesis. The dominant applications lie in analysis, where diseases
can be recognized by facial shape. One example of such an eect
is the classication and early diagnosis of fetal alcohol spectrum
disorder [Suttie et al
.
2013] or epilepsy [Ahmedt Aristizabal 2019].
Similarly, Hammond et al
.
[2004] demonstrated both visualization
and recognition of congenital craniofacial growth disorders. Both
14
https://www.snapchat.com/
these works used 3D data. However, the capability of 3D reconstruc-
tion from 2D images was explored for the screening of acromegaly
[Learned-Miller et al. 2006] and genetic disorders [Tu et al. 2018].
In the direction of synthesis, the 3D shape model was explored
to perform reconstruction of missing face parts based on the model
statistics [Basso and Vetter 2005; Mueller et al
.
2011]. Such a recon-
struction can be applied for personalized implant design. Another
work explored the synthesis capabilities for analysis and generated
controlled stimuli to study responses in the fusiform face area and
correlates them with autism spectrum disorder [Jiang et al. 2013].
3DMM and statistical shape models, in general, are a popular
standard framework in the eld of medical imaging for segmentation
and as models of variations in anatomical structures [Zheng et al
.
2017]. A lot of those applications deal with pathologies in young or
elderly people which are underrepresented even in the biggest face
models [Ploumpis et al
.
2019]. Those applications would prot from
models built from a wider population or models than can better
generalize beyond the data they are trained on.
7.4 Forensics
Applications in forensics range from identikit pictures over virtual
aging to face reconstruction from dry skulls and recently also the
detection of manipulated videos.
Describing faces from vague mental images is a challenging task.
A tool based on a 3DMM [Blanz et al
.
2006] allows exploring correla-
tions within the face to generate indentikit pictures when providing
descriptions based on vague features.
Virtual aging is a challenging task and can be helpful to later nd
missing children or victims of sexual abuse. The 3DMM helps to
reduce the subjectivity of age progression methods. Several works in
this direction are modeling age trajectories on 3DMM shape [Hutton
et al
.
2003; Koudelová et al
.
2015; Shen et al
.
2014] and at least two
attempts have been made to do so for both 3DMM shape and texture
[Hunter and Tiddeman 2009; Scherbaum et al
.
2007]. Most methods
focus on children and neglect textural details or wrinkles which are
modeled in Pascal [2010]; Schneider et al. [2019].
Face reconstruction from dry skulls is an ill-posed problem. The
mapping from the skull to face is not a one-to-one, but a one-to-many
mapping. Models allow to control attributes for this reconstruction
[Paysan et al
.
2009b], explicitly estimate the posterior solution of
possible faces per skull [Madsen et al
.
2018] or model soft tissue
thickness directly grounded by a 3DMM [Gietzen et al. 2019].
Recently 3DMMs were used successfully to generate or manipu-
late images and videos as discussed in Chapter 7.2.4. At the same
time 3DMMs are also helpful to detect those manipulations from
state of the art methods with high accuracy [Rossler et al. 2019].
7.5 Cognitive Science, Neuroscience, and Psychology
The ability to generate faces that can be controlled via parameters
is very popular when studying how the human and non-human
primate brain process faces. Studies with generated stimuli from a
3DMM can be found in Cognitive Science, Neuroscience, Psychology,
and Social Science.
One of the earliest works using 3DMMs presented high-level
aftereects that indicate a model related to a statistical face model
3D Morphable Face Models - Past, Present and Future 29
in the human brain. Those aftereects were demonstrated using
caricaturized faces and antifaces [Leopold et al. 2001]. Later it was
shown that those results can not only be observed as aftereects
but also as responses of single neurons across caricaturization in
macaque monkey to principal axes of a 3DMM [Leopold et al
.
2006].
Later those aftereects were shown to incorporate 3D information
[Jiang et al
.
2009a]. The eects based on caricatures for recognition
were recently also investigated with 3DMMs in articial neural
networks trained on face recognition [Hill et al. 2018].
A topic that was heavily researched over the past decades and
is still under investigation is how much the 3D shape contributes
to face perception and if the face representation in our brain is
built as a 3D model. Early studies based on functional MRI and
behavioral techniques evaluated a shape-based model of human
face discrimination [Jiang et al
.
2006]. Later studies investigated
the importance of 3D shape and surface reectance [Jiang et al
.
2009b] and event-related potentials to 3D shape are faster than to
surface reection [Caharel et al
.
2009]. Other work explored how
well humans can estimate a prole picture from a frontal view
[Schumacher and Blanz 2012].
Recently it was shown that a face-processing system based on
stepwise inverse rendering correlates better to neural measurements
in macaque monkey than state of the art articial neural networks
[Yildirim et al. 2020].
Face image manipulation is another key application of the 3DMM
to generate stimuli [Walker and Vetter 2009] to e.g., investigate
social judgments based on facial appearance. Again the ability to
control exactly what is manipulated is key for those research results
sometimes measuring subtle eects [Walker et al
.
2011]. Recently a
dataset of controlled manipulated images was released to perform
such experiments [Walker et al. 2018].
One of the major limitations compared to 2D based methods is
that 3DMMs do not include hair. In a lot of studies faces and hair
are not separated since faces without hair appear less face-like. For
those models, it plays a substantial role to have controls over the
parameters and that parameters can be interpreted which secures
the future of 3DMMs in those elds.
8 PERSPECTIVE
In this last section, we want to look beyond the state of the art. We
explicitly highlight the unsolved challenges in the eld. In addition
to focusing on face models, we look further and share our thoughts
about the scalability of 3DMMs beyond faces. We also share our
thoughts of the applicability of models including data, model and
algorithm sharing also with its potential of misuse. We close with
an outlook on how a 3DMM could look like in 10 or 20 years.
8.1 Global Challenges
In this section, we summarize the major open challenges that are
shared across the dierent parts of 3DMMs. Local challenges that
are specic to capturing, modeling, image formation or analysis-by-
synthesis are mentioned in the respective sections.
One of the leading challenges is the balance between a low-
dimensional parametric model and the degree of detail we are capa-
ble of modeling. Parametric models for eyes, teeth, hairs, skin details,
soft tissue or even anatomical grounded muscles are not available.
Additional complexity also renders analysis-by-synthesis even more
challenging. Building faces with all those details is currently pos-
sible for a single face with a lot of manual labor, but automatic
methods to extract those details or build models on top of them are
in their beginnings. Current state of the art methods from capture,
to modeling over image formation to analysis-by-synthesis use a
lot of oversimplifying assumptions. Besides including more facial
details there are also models that exploit the knowledge that a face
is part of the body. Whilst faces and bodies are mostly analyzed
separately, there exist rst models that include faces and bodies
jointly [Joo et al
.
2018; Pavlakos et al
.
2019]. Pavlakos et al
.
[2019]
presented rst results indicating that tting the whole body is also
benecial for the quality measured in the face region only.
Another major challenge is the comparability of all the compo-
nents of a 3DMM. Already the modeling itself can only be evaluated
on specic tasks and dierent models have a dierent focus and
might perform better on a specic task. For analysis-by-synthesis,
comparing the performance of a model and also of the model adap-
tation algorithm is an unsolved problem. Current state of the art
research frequently focuses on task-specic qualitative results and
those results can barely be compared across models and algorithms.
The current trend in the community to share source-code and mod-
els helps to compare and reproduce results, however, there is a
lack of useful benchmarks. A rst step in this direction is a new
dataset providing natural images in combination with a 3D scan of
the same individual [Sanyal et al
.
2019]. However this is focused
on shape reconstruction only, there is no single benchmark for 3D
reconstruction from 2D images including illumination and albedo
estimation.
The last challenges are of an ethical nature. Concerns around
image analysis and synthesis, especially for faces is currently dis-
cussed within the scientic community as well as in the media and
the broad public. The current algorithmic development in computer
vision and graphics allows to recognize faces and to generate or
manipulate images and video. In addition most methods around
3DMMs elicit some dataset bias. Saito et al
.
[2017] approached this
using the Chicago Face Databse[Ma et al
.
2015] to build a face model
with balanced ethnicities. Those challenges are not a purely scien-
tic one, but also a political one. We start to see regulations of those
technologies and there will be likely more regulations across the
world in the near future. As a community, we can choose on what
projects we focus to work on and there are plenty of meaningful and
valuable applications of 3DMMs, face analysis, and face synthesis
as we presented in the Section 7 which could be explored less with
restrictive regulations.
8.2 Scalability
Research on parametric models of human faces has seen a lot of
progress in recent years. This raises the question of how scalable
the found solutions are to other types of real-world entities beyond
humans. On the one hand, human faces are highly challenging as
we are attuned to noticing even slightest inaccuracies in their mod-
eling. At the same time, they are also more amenable to statistical
modeling as their structure is relatively regular and correspondence
30 B. Egger et al.
across faces is quite well-dened. Other types of real-world entities,
or even humans in clothing or the human head with full hair, are
exhibiting much stronger appearance, structure, and shape variation
that may require additional methodical innovations to empower
proper modeling. The vision and graphics communities have begun
to build and learn statistical models of other types of shape cate-
gories. Researchers also increasingly attempt to learn such models in
an unsupervised or weakly supervised way for better real-world scal-
ability. These approaches partially build on many concepts learned
from the models described in this article but introduce additional
representation innovations, like learned implicit representations
[Cole et al
.
2017; Eslami et al
.
2018; Sitzmann et al
.
2019], to handle
their specic structural properties. Future research will certainly
see more work in this direction that answers the question of what
is the right shape, appearance and deformation representations for
a wider range of real-world object classes.
8.3 Application
An additional challenge for our research community will be to agree
on ecient ways to share and combine research eorts performed
by dierent research groups. We should agree on common data
formats and dissemination channels for available scan databases,
which would simplify building integrated models, and enable us to
better test and compare them. In that context, ever more pressing
questions of privacy and security will also need to be addressed.
On the one hand, it is needless to say that we have to adhere to
highest standards of privacy protection in data sets we share, so
not to reveal personal data or identities beyond what is needed and
permitted by law or by the captured individuals. For handling this,
community-wide procedures for providing consent on the use of
data that are compatible with legal regulations could be agreed on
and shared.
However, beyond this, increasingly powerful methods to build
and reconstruct such face models from image and video will in
the future enable us to build highly believable 3D human avatars
from casually captured imagery. These avatars will enable us to
create virtual renditions of real people at unseen accuracy to popu-
late computer graphically generated virtual spaces at high visual
delity. However, algorithmic tools should be investigated as well
that prevent the reconstruction or use of such avatars in undesired
or questionable applications that a reconstructed person did not
provide consent on. Advanced reconstruction algorithms on the
basis of parametric models may also make it possible to extract se-
mantic information of people from imagery that they may not want
to reveal (e.g., about emotional state, health, and physical condition,
etc.). Therefore, algorithmic strategies to balance personal privacy
and reconstruction ability shall be investigated and provided by our
research community.
Also, the continuously improving performance of algorithms to
reconstruct detailed human models from single images or videos
enables advanced new ways to synthesize new face imagery or even
modify existing face images and videos at very high visual delity.
As an example, some recent combinations of model-based recon-
struction algorithms and adversarially trained neural networks have
shown impressive results in that respect. Such advanced synthesis
algorithms will simplify many applications and open up entirely
new applications, for instance in content creation for animation and
visual eects, in content creation for virtual and augmented reality,
in telepresence, visual dubbing or advanced video editing. However,
they might also be used to create or modify media content with
malicious intent. Therefore, as a community focusing on basic re-
search, we will continue our eorts to objectively inform the general
public about the great possibilities opened up by advanced paramet-
ric models of face, body and other real-world entities to build the
next generation of intelligent, interactive and creative computing
systems. At the same time, we will use our essential basic expertise
about the underlying algorithmic principles to develop new ways to
detect unwanted media synthesis and modication and to prevent
such unwanted modications algorithmically.
8.4 Outlook
The big question we ask is how will a generative face model look
like in 10 or 20 years? What will be the representation and will it
be a complete model of the human face with all its variation and
details? Currently, we experience a divergence of 3DMMs. Dierent
research teams put a dierent focus and model some parts in more
detail but lack other details or statistical variation. Recent model-
ing advances are focusing on building task-specic representations
rather than a more general face model to be applicable for multiple
tasks. For some applications, the model itself is the limiting factor,
whilst other applications prot from a simple model based on PCA.
The requirements in terms of quality, realism, generalization, and
performance are very dierent e.g., content creation vs. computer
vision. The gap between state of the art computer-generated render-
ings for a single face including expressions versus generative and
parametric face models based on statistics is dramatic.
Current advances in the eld of machine learning will contribute
to build more general and at the same time more realistic models.
The core of the face model was always interpreted as a learning
problem, recent advances lifted the analysis-by-synthesis task from
a per image optimization task to a learning challenge. However, this
loop is not yet closed - why not learn or improve the model itself?
There are already rst works in the direction of model learning
(compare Section 6.3), but they are limited by very similar modeling
assumptions as traditional 3DMMs. First steps to overcome those
were recently performed in the direction of neural rendering [Eslami
et al
.
2018; Thies et al
.
2019], 3D representation learning [Sitzmann
et al
.
2019] and unsupervised shape model learning [Szabó et al
.
2019]. Other modeling approaches like generative adversarial net-
works [Goodfellow et al
.
2014; Karras et al
.
2019b,a] are currently
operating in 2D image space. Such parametric models can be used
to embed faces of real people in a latent space Abdal et al
.
[2019a,b],
but the resulting embedding is hard for humans to interpret.
20 years ago 3DMMs were part of a revolution in computer graph-
ics and computer vision to go away from 2D image processing to
3D modeling. The computer vision community is currently focusing
again on mainly 2D based approaches and we have to propose the
missing key to again move the community to 3D. Additionally one
of the leading benets of 3DMM is the natural disentanglement of
3D Morphable Face Models - Past, Present and Future 31
shape, color, illumination and camera parameters. Such a disentan-
glement is very hard to be derived purely from data [Locatello et al
.
2019] and for faces, 3DMMs build it manually based on the image
formation process. According to "Pattern Theory" [Grenander 1996;
Mumford and Desolneux 2010] it is a prerequisite for any high-
performance image analysis system to nd and separate conditional
independent parameters that describe the image to analyze. The
discovery and separation of such parameters purely from 2D data
is still an unsolved challenge. 3DMMs directly implement models
using the parameter also used by physics and geometry to model
light and three-dimensional objects.
One direction which might be particularly interesting is to break
out of the common modeling assumptions and oversimplication
but at the same time automate the tedious manual work behind the
photo-realistic generation of faces. We expect some kind of living
3DMM to evolve from the community. Automation will be the lead-
ing modeling idea. A living 3DMM should be able to learn from 3D
data as well as 2D data, both still and in motion. We imagine the
model to be learned from a minimal seed like a mean face, a sphere
or just a rough prototype based on the rst few data points. The
optimal living model would not be task-specic but should be able
to generalize to various tasks. The face model must, therefore, be
hierarchical in some form to represent multiple degrees of detail
but share statistics across those levels. Such an optimal face model
would be general enough to be applicable for real-time computer vi-
sion tasks, analysis-by-synthesis from currently challenging images
as well as photorealistic rendering with a high level of facial details.
Last but not least some tasks rely on an interpretable parametriza-
tion and not just a black box learning machine. Basic knowledge
of geometry and physics would not only ease the learning but also
at least disentangle pose and illumination variation from the facial
shape and appearance. Building such a general face model might
remain a challenge for the next 10 or 20 years but would align with
the original idea behind 3DMMs.
ACKNOWLEDGMENTS
This survey paper was initiated at the Dagstuhl Seminar 19102 on 3D
Morphable Models [Egger et al
.
2019] and contains ideas resulting
from discussions at this seminar. This survey paper was partially
funded by Early PostDoc Mobility Grant, Swiss National Science
Foundation P2BSP2_178643, ERC Consolidator Grant 4DRepLy and
the Max Planck Center for Visual Computing and Communications
(MPC-VCC). We thank Barış Geçer for his help on the teaser gure,
and Haiwen Feng for providing the FLAME texture space. We thank
the anonymous reviewers whose comments have greatly improved
this manuscript.
REFERENCES
2005. CASIA-3D FaceV1. (2005). http://biometrics.idealtest.org/
Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019a. Image2StyleGAN++: How to Edit
the Embedded Images?
Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019b. Image2StyleGAN: How to Em-
bed Images Into the StyleGAN Latent Space?. In Proc. International Conference on
Computer Vision (ICCV).
Jascha Achenbach, Robert Brylka, Thomas Gietzen, Katja zum Hebel, Elmar Schömer,
Ralf Schulze, Mario Botsch, and Ulrich Schwanecke. 2018. A multilinear model for
bidirectional craniofacial reconstruction. In Proc. Eurographics Workshops. Euro-
graphics Association, 67–76.
Jens Ackermann, Michael Goesele, et al
.
2015. A survey of photometric stereo techniques.
Foundations and Trendsin Computer Graphics and Vision 9, 3-4 (2015), 149–254.
David Esteban Ahmedt Aristizabal. 2019. Multi-modal analysis for the automatic evalu-
ation of epilepsy. Ph.D. Dissertation. Queensland University of Technology.
Taleb Alashkar, Boulbaba Ben Amor, Mohamed Daoudi, and Stefano Berretti. 2014. A
3D Dynamic Database for Unconstrained Face Recognition. In Proc. International
Conference and Exhibition on 3D Body Scanning Technologies.
Oswald Aldrian and WA Smith. 2010. A linear approach of 3d face shape and texture
recovery using a 3d morphable model. In Proc. British Machine Vision Conference
(BMVC).
Oswald Aldrian and William AP Smith. 2011a. Inverse rendering in suv space with a
linear texture model. In Proc. International Conference on Computer Vision (ICCV)
Workshops. IEEE, 822–829.
Oswald Aldrian and William AP Smith. 2012. Inverse rendering of faces on a cloudy
day. In Proc. European Conference on Computer Vision (ECCV). 201–214.
Oswald Aldrian and William AP Smith. 2013. Inverse rendering of faces with a 3D
morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence
35, 5 (2013), 1080–1093.
Oswald Aldrian and William A. P. Smith. 2011b. Inverse Rendering with a Morphable
Model: A Multilinear Approach. In Proc. British Machine Vision Conference (BMVC).
Brett Allen, Brian Curless, Brian Curless, and Zoran Popović. 2003. The space of
human body shapes: reconstruction and parameterization from range scans. In ACM
Transactions on Graphics, Vol. 22. ACM, 587–594.
Sarah Alotaibi and William AP Smith. 2017. A Biophysical 3D Morphable Model of Face
Appearance. In Proc. International Conference on Computer Vision (ICCV) Workshops.
IEEE, 824–832.
Brian Amberg, Andrew Blake, Andrew Fitzgibbon, Sami Romdhani, and Thomas Vetter.
2007. Reconstructing high quality face-surfaces using model based stereo. In Proc.
International Conference on Computer Vision (ICCV). IEEE, 1–8.
Brian Amberg, Reinhard Knothe, and Thomas Vetter. 2008. Expression invariant 3D face
recognition with a morphable model. In Proc. International Conference on Automatic
Face and Gesture Recognition. IEEE, 1–6.
Brian Amberg and Thomas Vetter. 2011. Optimal landmark detection using shape
models and branch and bound. In Proc. International Conference on Computer Vision
(ICCV). IEEE, 455–462.
Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers,
and James Davis. 2005. SCAPE: shape completion and animation of people. In ACM
Transactions on Graphics (Proceedings of SIGGRAPH). 408–416.
Joseph J Atick, Paul A Grin, and A Norman Redlich. 1996. Statistical approach to
shape from shading: Reconstruction of three-dimensional face surfaces from single
two-dimensional images. Neural computation 8, 6 (1996), 1321–1340.
Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017.
Bringing Portraits to Life. ACM Transactions on Graphics 36, 6 (2017), 196:1–196:13.
Pedro Azevedo, Thiago Oliveira-Santos, and Edilson De Aguiar. 2016. An Augmented
Reality Virtual Glasses Try-On System. In Symposium on Virtual Reality. 1–9. https:
//doi.org/10.1109/SVR.2016.12
Timur Bagautdinov, Chenglei Wu, Jason Saragih, Pascal Fua, and Yaser Sheikh. 2018.
Modeling Facial Geometry Using Compositional VAEs. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
Andrew D. Bagdanov, Alberto Del Bimbo, and Iacopo Masi. 2011. The Florence 2D/3D
Hybrid Face Dataset. In Joint ACM Workshop on Human Gesture and Behavior
Understanding (J-HGBU âĂŹ11). ACM, New York, NY, USA, 79âĂŞ80. https://doi.
org/10.1145/2072572.2072597
Anil Bas, Patrik Huber, William AP Smith, Muhammad Awais, and Josef Kittler. 2017a.
3D Morphable Models as Spatial Transformer Networks. In Proc. International Con-
ference on Computer Vision (ICCV) Workshops. IEEE, 895–903.
Anil Bas and William A. P. Smith. 2019. What does 2D geometric information really
tell us about 3D face shape? International Journal of Computer Vision 127 (2019).
Anil Bas, William A. P. Smith, Timo Bolkart, and Stefanie Wuhrer. 2017b. Fitting a 3D
Morphable Model to Edges: A Comparison Between Hard and Soft Correspondences.
In Asian Conference on Computer Vision Workshops, Chu-Song Chen, Jiwen Lu, and
Kai-Kuang Ma (Eds.). Springer International Publishing, Cham, 377–391.
Curzio Basso and Alessandro Verri. 2007. Fitting 3D morphable models using implicit
representations. In Proc. International Joint Conference on Computer Vision, Imaging
and Computer Graphics Theory and Applications. (VISIGRAPP). 45–52.
Curzio Basso and Thomas Vetter. 2005. Statistically motivated 3D faces reconstruc-
tion. In Proc. International Conference on Reconstruction of Soft Facial Parts, Vol. 31.
Citeseer.
Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010. High-
quality single-shot capture of facial geometry. In ACM Transactions on Graphics
(Proceedings of SIGGRAPH), Vol. 29. 40.
Thabo Beeler, Bernd Bickel, Gioacchino Noris, Paul Beardsley, Steve Marschner,
Robert W Sumner, and Markus Gross. 2012. Coupled 3D reconstruction of sparse
facial hair and skin. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 31, 4
(2012), 117.
32 B. Egger et al.
Thabo Beeler and Derek Bradley. 2014. Rigid stabilization of facial expressions. ACM
Transactions on Graphics 33, 4 (2014), 44.
Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W. Sumner, and Markus Gross. 2011. High-quality Passive Facial Performance
Capture Using Anchor Frames. In ACM Transactions on Graphics (Proceedings of
SIGGRAPH). ACM, New York, NY, USA, Article 75, 10 pages. https://doi.org/10.
1145/1964921.1964970
Pascal Bérard, Derek Bradley, Maurizio Nitti, Thabo Beeler, and Markus Gross. 2014.
High-quality capture of eyes. ACM Transactions on Graphics (Proceedings of SIG-
GRAPH Asia) 33, 6 (2014), 223.
Amit Bermano, Thabo Beeler, Yeara Kozlov, Derek Bradley, Bernd Bickel, and Markus
Gross. 2015. Detailed spatio-temporal reconstruction of eyelids. ACM Transactions
on Graphics (Proceedings of SIGGRAPH) 34, 4 (2015), 44.
Stefano Berretti, Boulbaba Ben Amor, Mohamed Daoudi, and Alberto del Bimbo. 2011.
3D facial expression recognition using SIFT descriptors of automatically detected
keypoints. The Visual Computer 27, 11 (2011), 1021–1036.
Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K. Na-
yar. 2008. Face Swapping: Automatically Replacing Faces in Photographs. ACM
Transactions on Graphics 27, 3 (2008), 39:1–39:8.
Volker Blanz, Irene Albrecht, Jörg Haber, and H-P Seidel. 2006. Creating face models
from vague mental images. In Compututer Graphics Forum, Vol. 25. Wiley Online
Library, 645–654.
Volker Blanz, Curzio Basso, Tomaso Poggio, and Thomas Vetter. 2003. Reanimating
faces in images and video. In Compututer Graphics Forum, Vol. 22. Wiley Online
Library, 641–650.
Volker Blanz, Patrick Grother, P Jonathon Phillips, and Thomas Vetter. 2005. Face
recognition based on frontal views generated from non-frontal images. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. IEEE,
454–461.
Volker Blanz, Albert Mehl, Thomas Vetter, and Hans-Peter Seidel. 2004a. A Statistical
Method for Robust 3D Surface Reconstruction from Sparse Data. In Proc. 3D Data
Processing Visualization and Transmission. 293–300.
Volker Blanz, Sami Romdhani, and Thomas Vetter. 2002. Face identication across
dierent poses and illuminations with a 3d morphable model. In Proc. International
Conference on Automatic Face and Gesture Recognition. IEEE, 202–207.
Volker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. 2004b. Ex-
changing Faces in Images. Compututer Graphics Forum 23, 3 (2004), 669–676.
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D
faces. In ACM Transactions on Graphics (Proceedings of SIGGRAPH). 187–194.
Volker Blanz and Thomas Vetter. 2003. Face recognition based on tting a 3d morphable
model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 9 (2003),
1063–1074.
Timo Bolkart and Stefanie Wuhrer. 2015a. 3D Faces in Motion: Fully Automatic Reg-
istration and Statistical Analysis. Computer Vision and Image Understanding 131
(2015), 100–115.
Timo Bolkart and Stefanie Wuhrer. 2015b. A Groupwise Multilinear Correspondence
Optimization for 3D Faces. In Proc. International Conference on Computer Vision
(ICCV). 3604–3612.
Timo Bolkart and Stefanie Wuhrer. 2016. A Robust Multilinear Model Learning Frame-
work for 3D Faces. In Proc. IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR). 4911–4919.
James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trigeorgis, Yannis
Panagakis, and Stefanos Zafeiriou. 2017. 3D face morphable models "In-The-Wild".
In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
5464–5473.
James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos
Zafeiriou. 2018a. Large scale 3D morphable models. International Journal of Com-
puter Vision 126, 2-4 (2018), 233–254.
James Booth, Anastasios Roussos, Evangelos Ververas, Epameinondas Antonakos,
Stylianos Ploumpis, Yannis Panagakis, and Stefanos Zafeiriou. 2018b. 3D Recon-
struction of In-the-Wild Faces in Images and Videos. IEEE Transactions on Pattern
Analysis and Machine Intelligence 40, 11 (2018), 2638–2652.
James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniahy, and David Dun-
away. 2016. A 3D Morphable Model Learnt from 10,000 Faces. In Proc. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). 5543–5552.
James Booth and Stefanos Zafeiriou. 2014. Optimal uv spaces for facial morphable
model construction. In Proc. IEEE International Conference on Image Processing.
Soen Bouaziz, Yangang Wang, and Mark Pauly. 2013. Online Modeling for Realtime
Facial Animation. ACM Transactions on Graphics 32, 4 (2013), 40:1–40:10.
Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheer. 2010. High resolution
passive facial performance capture. In ACM Transactions on Graphics, Vol. 29. ACM,
41.
Matthew Brand. 1999. Voice Puppetry. In ACM Transactions on Graphics. ACM
Press/Addison-Wesley Publishing Co., New York, NY, USA, 21–28.
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving
Visual Speech with Audio. In ACM Transactions on Graphics (Proceedings of SIG-
GRAPH). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 353–360.
Pia Breuer, Kwang-In Kim, Wolf Kienzle, Bernhard Scholkopf, and Volker Blanz. 2008.
Automatic 3D face reconstruction from single images or video. In Proc. International
Conference on Automatic Face and Gesture Recognition. IEEE, 1–8.
Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. 2007. Calculus of non-
rigid surfaces for geometry and texture manipulation. Transactions on Visualization
and Computer Graphics 13, 5 (2007), 902–913.
Alan Brunton, Timo Bolkart, and Stefanie Wuhrer. 2014a. Multilinear wavelets: A
statistical shape space for human faces. In Proc. European Conference on Computer
Vision (ECCV). 297–312.
Alan Brunton, Augusto Salazar, Timo Bolkart, and Stefanie Wuhrer. 2014b. Review
of statistical shape spaces for 3D data with comparative analysis for human faces.
Computer Vision and Image Understanding 128, 0 (2014), 1 – 17.
Alan Brunton, Chang Shu, Jochen Lang, and Eric Dubois. 2011. Wavelet Model-based
Stereo for Fast, Robust Face Reconstruction. In Proc. Canadian Conference on Com-
puter and Robot Vision.
Adrian Bulat and Georgios Tzimiropoulos. 2017. How Far Are We From Solving the 2D
& 3D Face Alignment Problem? (And a Dataset of 230,000 3D Facial Landmarks). In
Proc. International Conference on Computer Vision (ICCV).
Stéphanie Caharel, Fang Jiang, Volker Blanz, and Bruno Rossion. 2009. Recognizing an
individual face: 3D shape contributes earlier than 2D surface reectance information.
Neuroimage 47, 4 (2009), 1809–1818.
Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High-delity
Facial Performance Capture. ACM Transactions on Graphics 34, 4, Article 46 (July
2015), 9 pages. https://doi.org/10.1145/2766943
Chen Cao, Qiming Hou, and Kun Zhou. 2014a. Displaced Dynamic Expression Regres-
sion for Real-time Facial Tracking and Animation. ACM Transactions on Graphics
33, 4, Article 43 (July 2014), 10 pages. https://doi.org/10.1145/2601097.2601204
Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D Shape Regression for
Real-time Facial Animation. ACM Transactions on Graphics 32, 4, Article 41 (July
2013), 10 pages. https://doi.org/10.1145/2461912.2462012
Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014b. FaceWarehouse:
A 3D facial expression database for visual computing. Transactions on Visualization
and Computer Graphics 20, 3 (2014), 413–425.
Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016a. Real-time
Facial Animation with Image-based Dynamic Avatars. ACM Transactions on Graphics
35, 4, Article 126 (July 2016), 12 pages. https://doi.org/10.1145/2897824.2925873
Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016b. Real-time
Facial Animation with Image-based Dynamic Avatars. ACM Transactions on Graphics
35, 4 (2016), 126:1–126:12.
Thomas J Cashman and Andrew W Fitzgibbon. 2012. What shape are dolphins? building
3d morphable models from 2d images. IEEE Transactions on Pattern Analysis and
Machine Intelligence 35, 1 (2012), 232–244.
Jin-xiang Chai, Jing Xiao, and Jessica Hodgins. 2003. Vision-based Control of 3D
Facial Animation. In Proc. ACM SIGGRAPH / Eurographics Symposium on Computer
Animation (SCA). Eurographics Association, Aire-la-Ville, Switzerland, Switzerland,
193–206.
Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Ger-
ard Medioni. 2018. ExpNet: Landmark-free, deep, 3D facial expressions. In Proc.
International Conference on Automatic Face and Gesture Recognition. IEEE, 122–129.
Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Gerard
Medioni. 2017. Faceposenet: Making a case for landmark-free face alignment. In
Proceedings of the IEEE International Conference on Computer Vision. 1599–1608.
Anpei Chen, Zhang Chen, Guli Zhang, Ziheng Zhang, Kenny Mitchell, and Jingyi Yu.
2019. Photo-Realistic Facial Details Synthesis from Single Image. Proc. International
Conference on Computer Vision (ICCV) (2019).
Shiyang Cheng, Michael Bronstein, Yuxiang Zhou, Irene Kotsia, Maja Pantic, and
Stefanos Zafeiriou. 2019. MeshGAN: Non-linear 3D Morphable Models of Faces.
arXiv preprint arXiv:1903.10384 (2019).
Shiyang Cheng, Irene Kotsia, Maja Pantic, and Stefanos Zafeiriou. 2018. 4DFAB: A
Large Scale 4D Database for Facial Expression Analysis and Biometric Applications.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Erika Chuang and chris. Bregler. 2002. Performance-driven Facial Animation using Blend
Shape Interpolation. Technical Report CS-TR-2002-02. Stanford University.
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and
William T Freeman. 2017. Synthesizing normalized faces from facial identity fea-
tures. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
3703–3712.
3DMM Community. 2019. Curated List of 3D Morphable Model
Software and Data. https://github.com/3d-morphable-models/
curated-list-of-awesome-3D-Morphable-Model-software-and-data.
Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 1998. Active Appear-
ance Models. In Proc. European Conference on Computer Vision (ECCV).
3D Morphable Face Models - Past, Present and Future 33
Timothy F. Cootes, Christopher J. Taylor, David H. Cooper, and Jim Graham. 1995.
Active Shape Models-Their Training and Application. Computer Vision and Image
Understanding 61, 1 (1995), 38 – 59. https://doi.org/10.1006/cviu.1995.1004
Darren Cosker, Eva Krumhuber, and Adrian Hilton. 2011. A FACS valid 3D dynamic
action unit database with applications to 3D dynamic morphable facial modeling.
In Proc. International Conference on Computer Vision (ICCV). 2296–2303.
Ian Craw and Peter Cameron. 1991. Parameterising images for recognition and recon-
struction. In Proc. British Machine Vision Conference (BMVC). Springer, 367–370.
Clement Creusot, Nick Pears, and Jim Austin. 2013. A Machine-Learning Approach
to Keypoint Detection and Landmarking on 3D Meshes. International Journal of
Computer Vision 102, 1-3 (2013), 146–179.
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black.
2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proc. IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR).
Hang Dai, Nick Pears, and William Smith. 2018. A Data-augmented 3D Morphable
Model of the Ear. In Proc. International Conference on Automatic Face and Gesture
Recognition. IEEE, 404–408.
Hang Dai, Nick Pears, William A. P. Smith, and Christian Duncan. 2017. A 3D Morphable
Model of Craniofacial Shape and Texture Variation. In Proc. International Conference
on Computer Vision (ICCV).
Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, and
Hanspeter Pster. 2011. Video Face Replacement. ACM Transactions on Graphics 30,
6 (2011), 130:1–130:10.
Michael De Smet, Rik Fransens, and Luc Van Gool. 2006. A generalized EM approach
for 3D model based face recognition under occlusions. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Vol. 2. IEEE, 1423–1430.
Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin,
and Mark Sagar. 2000. Acquiring the reectance eld of a human face. In Proc.
Conference on Computer graphics and Interactive Techniques. ACM Press/Addison-
Wesley Publishing Co., 145–156.
Jiankang Deng, Shiyang Cheng, Niannan Xue, Yuxiang Zhou, and Stefanos Zafeiriou.
2018. Uv-gan: Adversarial facial uv map completion for pose-invariant face recogni-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
7093–7102.
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accu-
rate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image
to Image Set. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops.
Arnaud Dessein, William A.P. Smith, Richard C. Wilson, and Edwin R. Hancock. 2015.
Example-Based Modeling of Facial Texture from Decient Data. In Proc. International
Conference on Computer Vision (ICCV). 3898–3906.
Roman Dovgard and Ronen Basri. 2004. Statistical symmetric shape from shading for
3D structure recovery of faces. In Proc. European Conference on Computer Vision
(ECCV). Springer, 99–113.
Ian Dryden and Kanti Mardia. 2002. Statistical Shape Analysis. Wiley.
Chi Nhan Duong, Khoa Luu, Kha Gia Quach, and Tien D. Bui. 2019. Deep Appearance
Models: A Deep Boltzmann Machine Approach for Face Modeling. International
Journal of Computer Vision 127, 5 (2019), 437–455.
Jose I Echevarria, Derek Bradley, Diego Gutierrez, and Thabo Beeler. 2014. Capturing
and stylizing hair for 3D fabrication. ACM Transactions on Graphics (Procee dings of
SIGGRAPH) 33, 4 (2014), 125.
Bernhard Egger. 2018. Semantic Morphable Models. Ph.D. Dissertation. University of
Basel.
Bernhard Egger, Dinu Kaufmann, Sandro Schönborn, Volker Roth, and Thomas Vetter.
2016a. Copula eigenfaces. In Proc. International Joint Conference on Computer Vision,
Imaging and Computer Graphics Theory and Applications. (GRAPP). 50–58.
Bernhard Egger, Dinu Kaufmann, Sandro Schönborn, Volker Roth, and Thomas Vetter.
2016b. Copula Eigenfaces with Attributes: Semiparametric Principal Component
Analysis for a Combined Color, Shape and Attribute Model. In Communications in
Computer and Information Science. Springer, 95–112.
Bernhard Egger, Sandro Schönborn, Andreas Schneider, Adam Kortylewski, Andreas
Morel-Forster, Clemens Blumer, and Thomas Vetter. 2018. Occlusion-Aware 3D
Morphable Models and an Illumination Prior for Face Image Analysis. International
Journal of Computer Vision 126, 12 (01 Dec 2018), 1269–1287. https://doi.org/10.
1007/s11263-018-1064-8
Bernhard Egger, William Smith, Christian Theobalt, and Thomas Vetter. 2019. 3D
Morphable Models (Dagstuhl Seminar 19102). Dagstuhl Reports 9, 3 (2019), 16–38.
https://doi.org/10.4230/DagRep.9.3.16
SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos,
Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor,
et al
.
2018. Neural scene representation and rendering. 360, 6394 (2018), 1204–1210.
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech
Animation. In ACM Transactions on Graphics (Proceedings of SIGGRAPH). ACM, New
York, NY, USA, 388–398. https://doi.org/10.1145/566570.566594
Gabrielle Fanelli, Jürgen Gall, Harald Romsdorfer, Thibaut Weise, and Luc van Gool.
2010. A 3D Audio-Visual Corpus of Aective Communication. IEEE MultiMedia 12,
6 (2010), 591 – 598.
Tianhong Fang, Xi Zhao, Omar Ocegueda, Shishir K. Shah, and Ioannis A. Kakadiaris.
2012. 3D/4D facial expression analysis: An advanced annotated face model approach.
Image and Vision Computing 30, 10 (2012), 738–749.
Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3D Face
Reconstruction and Dense Alignment with Position Map Regression Network. In
Proc. European Conference on Computer Vision (ECCV).
Victoria Fernández Abrevaya, Adnane Boukhayma, Stefanie Wuhrer, and Edmond Boyer.
2019. A Generative 3D Facial Model by Adversarial Training. In Proc. International
Conference on Computer Vision (ICCV).
Victoria Fernández Abrevaya, Stefanie Wuhrer, and Edmond Boyer. 2018. Multilin-
ear Autoencoder for 3D Face Model Learning. In Proc. IEEE Winter Conference on
Applications of Computer Vision (WACV).
Victoria Fernández Abrevaya, Stefanie Wuhrer, and Edmond Boyer. 2018. Spatiotempo-
ral Modeling for Ecient Registration of Dynamic 3D Faces. In Proc. IEEE Interna-
tional Conference on 3D Vision (3DV). 371–380.
Claudio Ferrari, Giuseppe Lisanti, Stefano Berretti, and Alberto Del Bimbo. 2015. Dictio-
nary Learning based 3D Morphable Model Construction for Face Recognition with
Varying Expression and Pose. In Proc. IEEE International Conference on 3D Vision
(3DV). 509–517.
Rik Fransens, Christoph Strecha, and Luc Van Gool. 2005. Parametric stereo for multi-
pose face recognition and 3D-face modeling. In Proc. International Conference on
Automatic Face and Gesture Recognition. Springer, 109–124.
Ohad Fried, Eli Shechtman, Dan B Goldman, and Adam Finkelstein. 2016. Perspective-
aware manipulation of portrait photos. ACM Transactions on Graphics 35, 4 (2016),
128.
Yasutaka Furukawa and Jean Ponce. 2009. Dense 3D motion capture for human faces. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1674–1681.
Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Pérez,
and Christian Theobalt. 2014. Automatic Face Reenactment. In Proc. IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society,
Washington, DC, USA, 4217–4224.
Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick
Pérez, and Christian Theobalt. 2015. VDub: Modifying Face Video of Actors for
Plausible Visual Alignment to a Dubbed Audio Track. Compututer Graphics Forum
34, 2 (2015), 193–204.
Pablo Garrido, Levi Valgaerts, Chenglei Wu, and Christian Theobalt. 2013. Reconstruct-
ing detailed dynamic face geometry from monocular video. ACM Transactions on
Graphics 32, 6 (2013), 158–1.
Pablo Garrido, Michael Zollhoefer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick
Pérez, and Christian Theobalt. 2016a. Reconstruction of personalized 3D face rigs
from monocular video. ACM Transactions on Graphics 35, 3 (2016), 28.
Pablo Garrido, Michael Zollhoefer, Chenglei Wu, Derek Bradley, Patrick Pérez, Thabo
Beeler, and Christian Theobalt. 2016b. Corrective 3D reconstruction of lips from
monocular video. ACM Transactions on Graphics 35, 6 (2016), 219–1.
Baris Gecer, Alexander Lattas, Stylianos Ploumpis, Jiankang Deng, Athanasios Pa-
paioannou, Stylianos Moschoglou, and Stefanos Zafeiriou. 2019a. Synthesizing
Coupled 3D Face Modalities by Trunk-Branch Generative Adversarial Networks.
arXiv preprint arXiv:1909.02215 (2019).
Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. 2019b. GANFIT:
Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jason Geng. 2011. Structured-light 3D surface imaging: a tutorial. Advances in Optics
and Photonics 3, 2 (Jun 2011), 128–160.
Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and
William T Freeman. 2018. Unsupervised training for 3d morphable model regres-
sion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
8377–8386.
Athinodoros S Georghiades. 2003. Incorporating the Torrance and Sparrow Model of
Reectance in Uncalibrated Photometric Stereo. In Proc. International Conference on
Computer Vision (ICCV). 816.
Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Lüthi,
Sandro Schönborn, and Thomas Vetter. 2018. Morphable Face Models - An Open
Framework. In Proc. International Conference on Automatic Face and Gesture Recog-
nition. 75–82.
Abhijeet Ghosh, Graham Fye, Borom Tunwattanapong, Jay Busch, Xueming Yu, and
Paul Debevec. 2011. Multiview face capture using polarized spherical gradient
illumination. In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia),
Vol. 30. 129.
Abhijeet Ghosh, Tim Hawkins, Pieter Peers, Sune Frederiksen, and Paul Debevec. 2008.
Practical modeling and acquisition of layered facial reectance. In ACM Transactions
on Graphics, Vol. 27. ACM, 139.
Thomas Gietzen, Robert Brylka, Jascha Achenbach, Katja zum Hebel, Elmar Schömer,
Mario Botsch, Ulrich Schwanecke, and Ralf Schulze. 2019. A method for automatic
forensic facial reconstruction based on dense statistics of soft tissue thickness. PloS
one 14, 1 (2019), e0210257.
34 B. Egger et al.
Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pster, Szymon Rusinkiewicz, and
Thomas Funkhouser. 2006. A statistical model for synthesis of detailed facial ge-
ometry. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 25, 3 (2006),
1025–1034.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In
Proc. Advances in neural information processing systems (NeurIPS). 2672–2680.
Paulo Gotardo, Jérémy Riviere, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. 2018.
Practical Dynamic Facial Appearance Modeling and Acquisition. ACM Transactions
on Graphics (Proceedings of SIGGRAPH Asia) 37, 6 (2018), 232:1–232:13.
Paulo F. U. Gotardo, Tomas Simon, Yaser Sheikh, and Iain Matthews. 2015. Photogeomet-
ric Scene Flow for High-Detail Dynamic 3D Reconstruction. In Proc. International
Conference on Computer Vision (ICCV).
Ulf Grenander. 1996. Elements of pattern theor y. JHU Press.
Jianya Guo, Xi Mei, and Kun Tang. 2013. Automatic landmark annotation and dense
correspondence registration for 3D human facial images. BMC Bioinformatics 14, 1
(2013).
Peter L. Hallinan, Gaile G. Gordon, Alan L. Yuille, Peter Giblin, and David Mumford.
1999. Two- and Three-Dimensional Patterns of the Face. A K Peters/CRC Press.
Peter Hammond, Tim J Hutton, Judith E Allanson, Linda E Campbell, Raoul CM Hen-
nekam, Sean Holden, Michael A Patton, Adam Shaw, I Karen Temple, Matthew
Trotter, et al
.
2004. 3D analysis of facial morphology. American Journal of Medical
Genetics Part A 126, 4 (2004), 339–348.
Fang Han and Han Liu. 2012. Semiparametric principal component analysis. In Proc.
Advances in neural information processing systems (NeurIPS) . 171–179.
Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. 2015. Eective face frontalization
in unconstrained images. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 4295–4304.
Behrend Heeren, Chao Zhang, Martin Rumpf, and William Smith. 2018. Principal
Geodesic Analysis in the Space of Discrete Shells. In Compututer Graphics Forum,
Vol. 37. 173–184.
Carlos Hernández, George Vogiatzis, Gabriel J Brostow, Bjorn Stenger, and Roberto
Cipolla. 2007. Non-rigid photometric stereo with colored lights. In Proc. International
Conference on Computer Vision (ICCV). IEEE, 1–8.
Thomas Heseltine, Nick Pears, and Jim Austin. 2008. Three-dimensional face recognition
using combinations of surface feature map subspace components. Image and Vision
Computing 26, 3 (2008), 382–396.
Alexander Hewer, Stefanie Wuhrer, Ingmar Steiner, and Korin Richmond. 2018. A
multilinear tongue model derived from speech related MRI data of the human vocal
tract. Computer Speech and Language 51 (2018), 68–92.
Matthew Q Hill, Connor J Parde, Carlos D Castillo, Y Ivette Colon, Rajeev Ranjan,
Jun-Cheng Chen, Volker Blanz, and Alice J O’Toole. 2018. Deep Convolutional
Neural Networks in the Face of Caricature: Identity and Image Revealed. (2018).
Yoshitaka Ushiku Hiroharu Kato and Tatsuya Harada. 2018. Neural 3D Mesh Renderer.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Pei-Lu Hsieh, Chongyang Ma, Jihun Yu, and Hao Li. 2015. Unconstrained Realtime
Facial Performance Capture. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 1675–1683.
Guosheng Hu, Pouria Mortazavian, Josef Kittler, and William Christmas. 2013. A facial
symmetry prior for improved illumination tting of 3D morphable model. In Proc.
International Conference on Biometrics (ICB). IEEE, 1–6.
Guosheng Hu, Fei Yan, Josef Kittler, William Christmas, Chi Ho Chan, Zhenhua Feng,
and Patrik Huber. 2017c. Ecient 3D morphable face model tting. Pattern Recogni-
tion 67 (2017), 366–379.
Liwen Hu, Derek Bradley, Hao Li, and Thabo Beeler. 2017a. Simulation-ready hair
capture. In Compututer Graphics Forum, Vol. 36. 281–294.
Liwen Hu, Chongyang Ma, Linjie Luo, and Hao Li. 2014a. Robust hair capture using
simulated examples. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 33,
4 (2014), 126.
Liwen Hu, Chongyang Ma, Linjie Luo, Li-Yi Wei, and Hao Li. 2014b. Capturing braided
hairstyles. ACM Transactions on Graphics 33, 6 (2014), 225.
Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman
Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017b. Avatar Digitization from a
Single Image for Real-time Rendering. ACM Transactions on Graphics 36, 6, Article
195 (Nov. 2017), 14 pages. https://doi.org/10.1145/3130800.31310887
Patrik Huber. 2017. Real-time 3D morphable shape model tting to monocular in-the-wild
videos. Ph.D. Dissertation. University of Surrey.
Patrik Huber, Guosheng Hu, Jose Rafael Tena, Pouria Mortazavian, Willem P. Koppen,
William J. Christmas, Matthias Rätsch, and Josef Kittler. 2016. A Multiresolution 3D
Morphable Face Model and Fitting Framework. In Proc. International Joint Confer-
ence on Computer Vision, Imaging and Computer Graphics Theory and Applications.
(VISIGRAPP).
David William Hunter and Bernard Paul Tiddeman. 2009. Visual ageing of human faces
in three dimensions using morphable models and projection to latent structures.
In Proc. International Joint Conference on Computer Vision, Imaging and Computer
Graphics Theory and Applications. (VISAPP).
Tim J. Hutton, Bernard F. Buxton, and Peter Hammond. 2001. Dense Surface Point
Distribution Models of the Human Face. In Proc. IEEE Workshop on Mathematical
Methods in Biomedical Image Analysis (MMBIA). 153–.
Tim J. Hutton, Bernard F. Buxton, Peter Hammond, and Henry WW Potts. 2003. Esti-
mating average growth trajectories in shape-space using kernel smoothing. IEEE
transactions on medical imaging 22, 6 (2003), 747–753.
Alexandru Eugen Ichim, Soen Bouaziz, and Mark Pauly. 2015. Dynamic 3D Avatar
Creation from Hand-held Video Input. ACM Transactions on Graphics 34, 4, Article
45 (July 2015), 14 pages. https://doi.org/10.1145/2766974
Alexandru-Eugen Ichim, Petr Kadlecek, Ladislav Kavan, and Mark Pauly. 2017. Phace:
Physics-based Face Modeling and Animation. 36, 4 (2017), 153:1–14.
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al
.
2015. Spatial transformer
networks. In Proc. Advances in neural information processing systems (NeurIPS).
2017–2025.
Fang Jiang, Volker Blanz, and Alice J O’Toole. 2009a. Three-dimensional information
in face representations revealed by identity aftereects. Psychological Science 20, 3
(2009), 318–325.
Fang Jiang, Laurence Dricot, Volker Blanz, Rainer Goebel, and Bruno Rossion. 2009b.
Neural correlates of shape and surface reectance information in individual faces.
Neuroscience 163, 4 (2009), 1078–1091.
Xiong Jiang, Angela Bollich, Patrick Cox, Eric Hyder, Joette James, Saqib Ali Gowani,
Nouchine Hadjikhani, Volker Blanz, Dara S Manoach, Jason JS Barton, et al
.
2013. A
quantitative link between face discrimination decits and neuronal selectivity for
faces in autism. NeuroImage: Clinical 2 (2013), 320–331.
Xiong Jiang, Ezra Rosen, Thomas Zero, John VanMeter, Volker Blanz, and Maximilian
Riesenhuber. 2006. Evaluation of a shape-based model of human face discrimination
using FMRI and behavioral techniques. Neuron 50, 1 (2006), 159–172.
Andrew Jones, Jen-Yuan Chiang, Abhijeet Ghosh, Magnus Lang, Matthias Hullin, Jay
Busch, and Paul Debevec. 2008. Real-time Geometry and Reectance Capture for
Digital Face Replacement. Technical Report 4s. University of Southern California.
Michael J Jones and Tomaso Poggio. 1998. Multidimensional morphable models. In
IJCV. IEEE, 683–688.
Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total Capture: A 3D Deformation
Model for Tracking Faces, Hands, and Bodies. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
Ioannis A. Kakadiaris, Georgios Passalis, Theoharis Theoharis, George Toderici, I.
Konstantinidis, and N. Murtuza. 2005. Multimodal face recognition: combination
of geometry with physiological information. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Vol. 2. 1022–1029.
Ioannis A. Kakadiaris, Georgios Passalis, George Toderici, Mohammed N Murtuza, Yun-
liang Lu, Nikos Karamelpatzis, and Theoharis Theoharis. 2007. Three-dimensional
face recognition in the presence of facial expressions: An annotated deformable
model approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29,
4 (2007), 640–649.
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-
driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM
Transactions on Graphics 36, 4, Article 94 (July 2017), 12 pages.
Tero Karras, Samuli Laine, and Timo Aila. 2019b. A Style-Based Generator Architecture
for Generative Adversarial Networks. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Aila. 2019a. Analyzing and Improving the Image Quality of StyleGAN. arXiv preprint
arXiv:11912.04958 (2019).
Michael Keller, Reinhard Knothe, and Thomas Vetter. 2007. 3D reconstruction of human
faces from occluding contours. In MIRAGE. Springer, 261–273.
Ira Kemelmacher-Shlizerman. 2016. Transguring Portraits. ACM Transactions on
Graphics 35, 4, Article 94 (July 2016), 8 pages.
Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010.
Being John Malkovich. In Proc. European Conference on Computer Vision (ECCV),
Vol. 6311. Springer, 341–353.
Ira Kemelmacher-Shlizerman and Steven M. Seitz. 2011. Face Reconstruction in the Wild.
In Proc. International Conference on Computer Vision (ICCV). IEEE Computer Society,
Washington, DC, USA, 1746–1753. https://doi.org/10.1109/ICCV.2011.6126439
Sameh Khamis, Jonathan Taylor, Jamie Shotton, Cem Keskin, Shahram Izadi, and
Andrew Fitzgibbon. 2015. Learning an ecient model of hand shape variation from
depth images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 2540–2548.
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias
Nießner, Patrick Pérez, Christian Richardt, Michael Zollhoefer, and Christian
Theobalt. 2018a. Deep Video Portraits. ACM Transactions on Graphics (2018).
Hyeongwoo Kim, Michael Zollhoefer, Ayush Tewari, Justus Thies, Christian Richardt,
and Theobalt Christian. 2018b. InverseFaceNet: Deep Single-Shot Inverse Face
Rendering From A Single Image. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
Brendan F Klare, Mark J Burge, Joshua C Klontz, Richard W Vorder Bruegge, and Anil K
Jain. 2012. Face recognition performance: Role of demographic information. IEEE
3D Morphable Face Models - Past, Present and Future 35
Transactions on Information Forensics and Security 7, 6 (2012), 1789–1801.
Martin Klaudiny, Steven McDonagh, Derek Bradley, Thabo Beeler, and Kenny Mitchell.
2017. Real-Time Multi-View Facial Capture with Synthetic Training. In Compututer
Graphics Forum, Vol. 36. 325–336.
Reinhard Knothe, Sami Romdhani, and Thomas Vetter. 2006. Combining PCA and LFA
for Surface Reconstruction from a Sparse Set of Control Points. In Proc. International
Conference on Automatic Face and Gesture Recognition. 637–644.
Paul Koppen, Zhen-Hua Feng, Josef Kittler, Muhammad Awais, William Christmas,
Xiao-Jun Wu, and He-Feng Yin. 2018. Gaussian mixture 3D morphable face model.
Pattern Recognition 74 (2018), 617–628.
Adam Kortylewski, Bernhard Egger, Andreas Schneider, Thomas Gerig, Andreas Morel-
Forster, and Thomas Vetter. 2018a. Empirically analyzing the eect of dataset biases
on deep face recognition systems. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops. 2093–2102.
Adam Kortylewski, Bernhard Egger, Andreas Schneider, Thomas Gerig, Andreas Morel-
Forster, and Thomas Vetter. 2019. Analyzing and Reducing the Damage of Dataset
Bias to Face Recognition With Synthetic Data. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) Workshops.
Adam Kortylewski, Andreas Schneider, Thomas Gerig, Bernhard Egger, Andreas Morel-
Forster, and Thomas Vetter. 2018b. Training deep face recognition systems with
synthetic data. arXiv preprint arXiv:1802.05891 (2018).
Adam Kortylewski, Mario Wieser, Andreas Morel-Forster, Aleksander Wieczorek, Sonali
Parbhoo, Volker Roth, and Thomas Vetter. 2018c. Informed MCMC with Bayesian
Neural Networks for Facial Image Analysis. Proc. Advances in neural information
processing systems workshops (NeurIPS) (2018).
Jana Koudelová, Ján Dupej, Jaroslav Bružek, Petr Sedlak, and Jana Velemínská. 2015.
Modelling of facial growth in Czech children based on longitudinal data: Age pro-
gression from 12 to 15 years using 3D surface models. Forensic Science International
248 (2015), 33–40.
Aravind Krishnaswamy and Gladimir VG Baranoski. 2004. A biophysically-based
spectral model of light interaction with human skin. Compututer Graphics Forum
23, 3 (2004), 331–340.
Sumedha Kshirsagar and Nadia Magnenat-Thalmann. 2003. Visyllable Based Speech
Animation. Compututer Graphics Forum 22, 3 (2003), 632–640.
Samuli Laine, Tero Karras, Timo Aila, Antti Herva, Shunsuke Saito, Ronald Yu, Hao Li,
and Jaakko Lehtinen. 2017. Production-level Facial Performance Capture Using Deep
Convolutional Neural Networks. In Proc. ACM SIGGRAPH / Eurographics Symposium
on Computer Animation (SCA). ACM, New York, NY, USA, Article 10, 10 pages.
Erik Learned-Miller, Qifeng Lu, Angela Paisley, Peter Trainer, Volker Blanz, Katrin
Dedden, and Ralph Miller. 2006. Detecting acromegaly: screening for disease with
a morphable model. In Proc. International Conference on Medical Image Computing
and Computer-Assisted Intervention (MICCAI). Springer, 495–503.
Kuang-Chih Lee, Jerey Ho, and David J Kriegman. 2005. Acquiring linear subspaces
for face recognition under variable lighting. IEEE Transactions on Pattern Analysis
and Machine Intelligence 5 (2005), 684–698.
David A Leopold, Igor V Bondar, and Martin A Giese. 2006. Norm-based face encoding
by single neurons in the monkey inferotemporal cortex. Nature 442, 7102 (2006),
572.
David A Leopold, Alice J O’Toole, Thomas Vetter, and Volker Blanz. 2001. Prototype-
referenced shape encoding revealed by high-level aftereects. Nature Machine
Intelligence 4, 1 (2001), 89.
Marc Levoy, Kari Pulli, Brian Curless, Szymon Rusinkiewicz, David Koller, Lucas Pereira,
Matt Ginzton, Sean Anderson, James Davis, Jeremy Ginsberg, et al
.
2000. The digital
Michelangelo project: 3D scanning of large statues. In Proc. Conference on Computer
graphics and Interactive Techniques. ACM Press/Addison-Wesley Publishing Co.,
131–144.
JP Lewis, Zhenyao Mo, Ken Anjyo, and Taehyun Rhee. 2014b. Probable and improbable
faces. In Mathematical Progress in Expressive Image Synthesis I. 21–30.
J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang Deng.
2014a. Practice and Theory of Blendshape Facial Models. In Computer Graphics
Forum (Eurographics State of the Art Reports).
Chen Li, Kun Zhou, and Stephen Lin. 2014b. Intrinsic Face Image Decomposition
with Human Face Priors. In Proc. European Conference on Computer Vision (ECCV),
Vol. 8693. Springer, 218–233.
Chen Li, Kun Zhou, and Stephen Lin. 2015b. Simulating makeup through physics-based
manipulation of intrinsic image layers. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). IEEE Computer Society, 4621–4629.
Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh,
Aaron Nicholls, and Chongyang Ma. 2015a. Facial Performance Sensing Head-
Mounted Display. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 34, 4
(July 2015).
Hao Li, Thibaut Weise, and Mark Pauly. 2010. Example-based facial rigging. ACM
Transactions on Graphics (Proceedings of SIGGRAPH) 29, 4 (2010), 32:1–32:6.
Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with
On-the-y Correctives. ACM Transactions on Graphics 32, 4 (2013), 42:1–42:10.
Kai Li, Qionghai Dai, Ruiping Wang, Yebin Liu, Feng Xu, and Jue Wang. 2014a. A Data-
Driven Approach for Facial Expression Retargeting in Video. IEEE Transactions on
Multimedia 16, 2 (2014), 299–310.
Kai Li, Feng Xu, Jue Wang, Qionghai Dai, and Yebin Liu. 2012. A data-driven approach
for facial expression synthesis in video. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). IEEE Computer Society, 57–64.
Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning a
model of facial shape and expression from 4D scans. ACM Transactions on Graphics
36, 6 (2017).
Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. 2018. Dierentiable
monte carlo ray tracing through edge sampling. (2018), 222.
Shu Liang, Ira Kemelmacher-Shlizerman, and Linda G. Shapiro. 2014. 3D Face Hallu-
cination from a Single Depth Frame. In Proc. IEEE International Conference on 3D
Vision (3DV), Vol. 1. 31–38. https://doi.org/10.1109/ThreeDV.2014.67
Shu Liang, Linda G Shapiro, and Ira Kemelmacher-Shlizerman. 2016. Head Reconstruc-
tion from Internet Photos. In Proc. European Conference on Computer Vision (ECCV).
Springer, 360–374.
Jiangke Lin, Yi Yuan, Tianjia Shao, and Kun Zhou. 2020. Towards High-Fidelity 3D Face
Reconstruction from In-the-Wild Images Using Graph Convolutional Networks. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Feng Liu, Luan Tran, and Xiaoming Liu. 2019b. 3D Face Modeling from Diverse Raw
Scan Data. In Proc. International Conference on Computer Vision (ICCV).
Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019a. Soft Rasterizer: A Dieren-
tiable Renderer for Image-based 3D Reasoning. In Proc. International Conference on
Computer Vision (ICCV).
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face
Attributes in the Wild. In Proc. International Conference on Computer Vision (ICCV).
Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and
Olivier Bachem. 2019. Challenging common assumptions in the unsupervised learn-
ing of disentangled representations. In Proc. International Conference on Machine
Learning.
Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep ap-
pearance models for face rendering. ACM Transactions on Graphics 37, 4 (2018),
68.
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J.
Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Transactions on
Graphics (Proceedings of SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16.
Linjie Luo, Hao Li, and Szymon Rusinkiewicz. 2013. Structure-aware hair capture. ACM
Transactions on Graphics 32, 4 (2013), 76.
Marcel Lüthi, Thomas Gerig, Christoph Jud, and Thomas Vetter. 2018. Gaussian process
morphable models. IEEE Transactions on Pattern Analysis and Machine Intelligence
40, 8 (2018), 1860–1873.
Debbie S Ma, Joshua Correll, and Bernd Wittenbrink. 2015. The Chicago face database:
A free stimulus set of faces and norming data. Behavior research methods 47, 4 (2015),
1122–1135.
Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and
Paul Debevec. 2007. Rapid acquisition of specular and diuse normal maps from
polarized spherical gradient illumination. In Proc. Eurographics Workshops. 183–194.
Dennis Madsen, Marcel Lüthi, Andreas Schneider, and Thomas Vetter. 2018. Probabilistic
Joint Face-Skull Modelling for Facial Reconstruction. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 5295–5303.
Stephen R Marschner, Henrik Wann Jensen, Mike Cammarano, Steve Worley, and Pat
Hanrahan. 2003. Light scattering from human hair bers. ACM Transactions on
Graphics 22, 3 (2003), 780–791.
Stephen R Marschner, Stephen H Westin Eric PF Lafortune, and Kenneth E Torrance
Donald P Greenberg. 1999. Image-Based BRDF Measurement Including Human
Skin. In Proc. Eurographics Workshops. 131.
Iacopo Masi, Anh Tuážěn Trážğn, Tal Hassner, Jatuporn Toy Leksut, and Gérard Medioni.
2016. Do we really need to collect millions of faces for eective face recognition?.
In Proc. European Conference on Computer Vision (ECCV). Springer, 579–596.
Bogdan J. Matuszewski, Wei Quan, Lik-Kwan Shark, Alison S. McLoughlin, Catherine E.
Lightbody, Hedley C.A. Emsley, and Caroline L. Watkins. 2012. Hi4D-ADSIP 3-D
dynamic facial articulation database. Image and Vision Computing 30, 10 (2012), 713
– 727. 3D Facial Behaviour Analysis and Understanding.
Steven McDonagh, Martin Klaudiny, Derek Bradley, Thabo Beeler, Iain Matthews, and
Kenny Mitchell. 2016. Synthetic Prior Design for Real-Time Face Tracking. In Proc.
IEEE International Conference on 3D Vision (3DV). 639–648.
Baback Moghaddam, Jinho Lee, Hanspeter Pster, and Raghu Machiraju. 2003. Model-
Based 3D Face Capture with Shape-from-Silhouettes. In Proc. International Confer-
ence on Computer Vision (ICCV) Workshops. IEEE Computer Society, 20.
Andreas Morel-Forster. 2016. Generative shape and image analysis by combining Gauss-
ian processes and MCMC sampling. Ph.D. Dissertation. University of Basel.
Stylianos Moschoglou, Evangelos Ververas, Yannis Panagakisand Mihalis A. Nicolaou, ,
and Stefanos Zafeiriou. 2018. Multi-attribute robust component analysis for facial
uv maps. IEEE Journal of Selected Topics in Signal Processing 12, 6 (2018), 1324–1337.
36 B. Egger et al.
Iordanis Mpiperis, Sotiris Malassiotis, and Michael G. Strintzis. 2008. Bilinear Models
for 3-D Face and Facial Expression Recognition. IEEE Transactions on Information
Forensics and Security 3, 3 (2008), 498–511.
Andreas Mueller, Pascal Paysan, Ralf Schumacher, Hans-Florian Zeilhofer, Isabelle
Berg-Boerner, Juerg Maurer, Thomas Vetter, Erik Schkommodau, Philipp Juergens,
and Katja Schwenzer-Zimmerer. 2011. Missing facial parts computed by a mor-
phable model and transferred directly to a polyamide laser-sintered prosthesis: an
innovation study. British Journal of Oral and Maxillofacial Surgery 49, 8 (2011),
e67–e71.
David Mumford and Agnès Desolneux. 2010. Pattern theory: the stochastic analysis of
real-world signals. AK Peters/CRC Press.
Koki Nagano, Huiwen Luo, Zejian Wang, Jeawoo Seo, Jun Xing, Liwen Hu, Lingyu
Wei, and Hao Li. 2019. Deep Face Normalization. ACM Transactions on Graphics
(Proceedings of SIGGRAPH) 38, 6 (2019), 183:1–16.
Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral
Agarwal, Jens Fursund, Hao Li, Richard Roberts, et al
.
2018. paGAN: real-time
avatars using dynamic textures. In ACM Transactions on Graphics (Proceedings of
SIGGRAPH Asia). ACM, 258.
Diego Nehab, Szymon Rusinkiewicz, James Davis, and Ravi Ramamoorthi. 2005. E-
ciently combining positions and normals for precise 3D geometry. ACM Transactions
on Graphics (Proceedings of SIGGRAPH) 24, 3 (2005), 536–543.
Thomas Neumann, Kiran Varanasi, Stephan Wenger, Markus Wacker, Marcus Magnor,
and Christian Theobalt. 2013. Sparse Localized Deformation Components. ACM
Transactions on Graphics (Proceedings of SIGGRAPH Asia) 32, 6 (2013), 179:1–179:10.
Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim,
Andrew J Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew
Fitzgibbon. 2011. KinectFusion: Real-time dense surface mapping and tracking.
In Proc. IEEE International Symposium on Mixed and Augmented Reality (ISMAR).
127–136.
Arthur Niswar, Ishtiaq Khan, and Farzam Farbiz. 2011. Virtual try-on of eyeglasses using
3D model of the head. Proc. International Conference on Virtual Reality Continuum
and Its Applications in Industry (12 2011). https://doi.org/10.1145/2087756.2087838
Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-Fidelity Facial
and Speech Animation for VR HMDs. ACM Transactions on Graphics (Proceedings of
SIGGRAPH Asia) 35, 6 (December 2016).
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love-
grove. 2019. Deepsdf: Learning continuous signed distance functions for shape
representation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition.
In Proc. British Machine Vision Conference (BMVC).
Paysan Pascal. 2010. Statistical modeling of facial aging based on 3D scans. Ph.D.
Dissertation. University of Basel.
Georgios Passalis, Panagiotis Perakis, Theoharis Theoharis, and Ioannis A. Kakadiaris.
2011. Using Facial Symmetry to Handle Pose Variations in Real-World 3D Face
Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 10
(2011), 1938–1951.
Ankur Patel and William A.P. Smith. 2009. 3D morphable face models revisited. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1327–1334.
Ankur Patel and William A.P. Smith. 2011. Simplication of 3D morphable models. In
Proc. International Conference on Computer Vision (ICCV). 271–278.
Ankur Patel and William AP Smith. 2012. Driving 3D morphable models using shading
cues. Pattern Recognition 45, 5 (2012), 1993–2004.
Ankur Patel and William AP Smith. 2016. Manifold-based constraints for operations in
face space. Pattern Recognition 52 (2016), 206–217.
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A.
Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D
Hands, Face, and Body from a Single Image. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter.
2009a. A 3D face model for pose and illumination invariant face recognition. In
Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance.
Ieee, 296–301.
Pascal Paysan, Marcel Lüthi, Thomas Albrecht, Anita Lerch, Brian Amberg, Francesco
Santini, and Thomas Vetter. 2009b. Face reconstruction from skull shapes and
physical attributes. In Deutsche Arbeitsgemeinschaft für Mustererkennung Symposum
(DAGM). Springer, 232–241.
P Jonathon Phillips, Patrick Grother, Ross Micheals, Duane M Blackburn, Elham Tabassi,
and Mike Bone. 2003. Face recognition vendor test 2002. In Proc. International SOI
Conference. IEEE.
Jean-Sébastien Pierrard. 2008. Skin segmentation for robust face image analysis. Ph.D.
Dissertation. University of Basel.
Jean-Sébastien Pierrard and Thomas Vetter. 2007. Skin detail analysis for face recogni-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Fred Pighin and J.P. Lewis. 2006. Performance-Driven Facial Animation. In ACM
Transactions on Graphics (Proceedings of SIGGRAPH).
Marcel Piotraschke and Volker Blanz. 2016. Automated 3D Face Reconstruction from
Multiple Images Using Quality Measures. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). 3418–3427. https://doi.org/10.1109/CVPR.2016.372
Stylianos Ploumpis, Haoyang Wang, Nick Pears, William A. P. Smith, and Stefanos
Zafeiriou. 2019. Combining 3D Morphable Models: A Large Scale Face-And-Head
Model. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, and Francesc
Moreno-Noguer. 2018. GANimation: Anatomically-aware Facial Animation from a
Single Image. In Proc. European Conference on Computer Vision (ECCV).
Ravi Ramamoorthi and Pat Hanrahan. 2001. An ecient representation for irradiance
environment maps. In ACM Transactions on Graphics (Proceedings of SIGGRAPH).
ACM, 497–500.
Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. 2018. Generating
3D faces using Convolutional Mesh Autoencoders. In Proc. European Conference on
Computer Vision (ECCV). 725–741.
Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D face reconstruction by learning
from synthetic data. In Proc. IEEE International Conference on 3D Vision (3DV). 460–
469.
Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning detailed face
reconstruction from a single image. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). 1259–1268.
Sami Romdhani, Volker Blanz, and Thomas Vetter. 2002. Face identication by tting a
3d morphable model using linear shape and texture error functions. In Proc. European
Conference on Computer Vision (ECCV). Springer, 3–19.
Sami Romdhani and Thomas Vetter. 2005. Estimating 3D shape and texture using pixel
intensity, edges, specular highlights, texture constraints and a prior. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 986–993 vol.
2. https://doi.org/10.1109/CVPR.2005.145
Fabiano Romeiro and Todd Zickler. 2007. Model-based stereo with occlusions. In Proc.
International Conference on Automatic Face and Gesture Recognition. Springer, 31–45.
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and
Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial
images. In Proc. International Conference on Computer Vision (ICCV). 1–11.
Joseph Roth, Yiying Tong, and Xiaoming Liu. 2015. Unconstrained 3D Face Reconstruc-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Boston, MA.
Joseph Roth, Yiying Tong, and Xiaoming Liu. 2016. Adaptive 3D Face Reconstruction
from Unconstrained Photo Collections. IEEE Transactions on Pattern Analysis and
Machine Intelligence 39, 11 (December 2016), 2127–2141.
Shunsuke Saito, Liwen Hu, Chongyang Ma, Hikaru Ibayashi, Linjie Luo, and Hao Li.
2018. 3D hair synthesis using volumetric variational autoencoders. ACM Transactions
on Graphics (Proceedings of SIGGRAPH Asia) 37, 6 (2018).
Shunsuke Saito, Tianye Li, and Hao Li. 2016. Real-Time Facial Segmentation and
Performance Capture from RGB Input. In Proc. European Conference on Computer
Vision (ECCV), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer
International Publishing, Cham, 244–261.
Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2017. Photorealistic
facial texture inference using deep neural networks. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 5144–5153.
Augusto Salazar, Stefanie Wuhrer, Chang Shu, and Flavio Prieto. 2014. Fully automatic
expression-invariant face correspondence. Machine Vision and Applications 25, 4
(2014), 859–879.
Dalila Sánchez-Escobedo, Mario Castelán, and William AP Smith. 2016. Statistical
3D face shape estimation from occluding contours. Computer Vision and Image
Understanding 142 (2016), 111–124.
Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. 2019. Learning to
Regress 3D Face Shape and Expression from an Image without 3D Supervision. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jason M. Saragih, Simon Lucey, and Jerey F. Cohn. 2011. Real-time avatar animation
from a single image. In Proc. International Conference on Automatic Face and Gesture
Recognition. IEEE Computer Society, 117–124.
Arman Savran, Nese Alyuöz, Hamdi Dibeklioglu, Oya Celiktutan, Berk Gökberk, Bülent.
Sankur, and Lale Akarun. 2008. Bosphorus database for 3D face analysis. In Proc.
European Workshop on Biometrics and Identity Management. 47–56.
Kristina Scherbaum, Tobias Ritschel, Matthias Hullin, Thorsten Thormählen, Volker
Blanz, and Hans-Peter Seidel. 2011. Computer-Suggested Facial Makeup. Compututer
Graphics Forum (2011).
Kristina Scherbaum, Martin Sunkel, H-P Seidel, and Volker Blanz. 2007. Prediction
of individual non-linear aging trajectories of faces. In Compututer Graphics Forum,
Vol. 26. Wiley Online Library, 285–294.
Andreas Schneider, Ghazi Bouabene, Ayet Shaiek, Sandro Schönborn, and Thomas Vet-
ter. 2019. Photo-Realisitc Exemplar-Based Face Aging. Proc. International Conference
on Automatic Face and Gesture Recognition (2019).
Andreas Schneider, Bernhard Egger, and Thomas Vetter. 2018. A Parametric Freckle
Model for Faces. In Proc. International Conference on Automatic Face and Gesture
Recognition.
3D Morphable Face Models - Past, Present and Future 37
Andreas Schneider, Sandro Schönborn, Lavrenti Frobeen, Bernhard Egger, and Thomas
Vetter. 2017. Ecient global illumination for morphable models. In Proc. International
Conference on Computer Vision (ICCV). 3865–3873.
Sandro Schönborn, Bernhard Egger, Andreas Forster, and Thomas Vetter. 2015. Back-
ground modeling for generative image models. Computer Vision and Image Under-
standing 136 (2015), 117–127.
Sandro Schönborn, Bernhard Egger, Andreas Morel-Forster, and Thomas Vetter. 2017.
Markov Chain Monte Carlo for Automated Face Image Analysis. International
Journal of Computer Vision 123, 2 (01 Jun 2017), 160–183. https://doi.org/10.1007/
s11263-016-0967-5
Florian Schro, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unied
embedding for face recognition and clustering. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). 815–823.
Matthaeus Schumacher and Volker Blanz. 2012. Which facial prole do humans expect
after seeing a frontal view? a comparison with a linear face model. ACM Transactions
on Applied Perception 9, 3 (2012), 11.
Matthaeus Schumacher and Volker Blanz. 2015. Exploration of the correlations of
attributes and features in faces. In Proc. International Conference on Automatic Face
and Gesture Recognition, Vol. 1. IEEE, 1–8.
Alassane Seck, William AP Smith, Arnaud Dessein, Bernard Tiddeman, Hannah Dee,
and Abhishek Dutta. 2016. Ear-to-ear capture of facial intrinsics. arXiv preprint
arXiv:1609.02368 (2016).
Matan Sela, Elad Richardson, and Ron Kimmel. 2017. Unrestricted Facial Geometry
Reconstruction Using Image-to-Image Translation. In Proc. International Conference
on Computer Vision (ICCV). IEEE Computer Society, 1585–1594.
Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. 2018.
SfSNet: Learning Shape, Reectance and Illuminance of Facesin the Wild’. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6296–6305.
Gil Shamai, Ron Slossberg, and Ron Kimmel. 2020. Synthesizing facial photometries and
corresponding geometries using generative adversarial networks. ACM Transactions
on Multimedia Computing, Communications, and Applications 15, 3 (2020), #87:1–24.
Christian R Shelton. 2000. Morphable surface models. International Journal of Computer
Vision 38, 1 (2000), 75–91.
Cheng-Ta Shen, Fay Huang, Wan-Hua Lu, Sheng-Wen Shih, and Hong-Yuan Mark Liao.
2014. 3D Age Progression Prediction in Children’s Faces with a Small Exemplar-
Image Set. Journal of Information Science & Engineering 30, 4 (2014).
Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014. Automatic acquisition
of high-delity facial performances using monocular videos. ACM Transactions on
Graphics 33, 6 (2014), 222.
Il-Kyu Shin, A Cengiz Öztireli, Hyeon-Joong Kim, Thabo Beeler, Markus Gross, and
Soo-Mi Choi. 2014. Extraction and transfer of facial expression wrinkles for facial
performance enhancement. In Proc. of The Pacic Conference on Computer Graphics
and Applications. 113–118.
Lawrence Sirovich and Michael Kirby. 1987. Low-dimensional procedure for the char-
acterization of human faces. Journal of the Optical Society of America A 4, 3 (1987),
519–524.
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019. Scene Representa-
tion Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In
Proc. Advances in neural information processing systems (NeurIPS).
Peter-Pike Sloan, Jan Kautz, and John Snyder. 2002. Precomputed radiance transfer
for real-time rendering in dynamic, low-frequency lighting environments. ACM
Transactions on Graphics 21, 3 (2002), 527–536.
Ron Slossberg, Gil Shamai, and Ron Kimmel. 2018. High quality facial surface and
texture synthesis via generative adversarial networks. In Proc. European Conference
on Computer Vision (ECCV). 0–0.
Michael De Smet and Luc Van Gool. 2010. Optimal regions for linear model-based 3D
face reconstruction. In Asian Conference on Computer Vision (ACCV). 276–289.
William AP Smith. 2016. The perspective face shape ambiguity. In Perspectives in Shape
Analysis. Springer, 299–319.
William A. P. Smith, Alassane Seck, Hannah Dee, Bernard Tiddeman, Joshua Tenen-
baum, and Bernhard Egger. 2020. A Morphable Face Albedo Model. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR).
Giota Stratou, Abhijeet Ghosh, Paul Debevec, and Louis-Philippe Morency. 2011. Eect
of illumination on automatic expression recognition: a novel 3D relightable facial
database. In Proc. International Conference on Automatic Face and Gesture Recognition.
IEEE, 611–618.
Martin A. Styner, Kumar T. Rajamani, Lutz-Peter Nolte, Gabriel Zsemlye, Gábor Székely,
Christopher J. Taylor, and Rhodri H. Davies. 2003. Evaluation of 3D Correspondence
Methods for Model Building. In Proc. International conference on Information Pro-
cessing in Medical Imaging (IPMI), Chris Taylor and J. Alison Noble (Eds.). Springer
Berlin Heidelberg, Berlin, Heidelberg, 63–75.
Yi Sun, Xiaochen Chen, Matthew Rosato, and Lijun Yin. 2010. Tracking Vertex Flow
and Model Adaptation for Three-Dimensional Spatiotemporal Face Analysis. IEEE
Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 40, 3
(2010), 461–474.
Yifan Sun and Noboru Murata. 2020. CAFM: A 3D Morphable Model for Animals. In
Proc. IEEE Winter Conference on Applications of Computer Vision (WACV).
Michael Suttie, Tatiana Foroud, Leah Wetherill, Joseph L Jacobson, Christopher D
Molteno, Ernesta M Meintjes, H Eugene Hoyme, Nathaniel Khaole, Luther K Robin-
son, Edward P Riley, et al
.
2013. Facial dysmorphism across the fetal alcohol spectrum.
Pediatrics 131, 3 (2013), e779.
Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M. Seitz. 2014.
Total Moving Face Reconstruction. In Proc. European Conference on Computer Vi-
sion (ECCV), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.).
Springer International Publishing, Cham, 796–812.
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2015. What
makes tom hanks look like tom hanks. In Proc. International Conference on Computer
Vision (ICCV). 3952–3960.
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017.
Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics
36, 4, Article 95 (July 2017), 13 pages.
Attila Szabó, Givi Meishvili, and Paolo Favaro. 2019. Unsupervised Generative 3D
Shape Learning from Natural Images. arXiv preprint arXiv:1910.00287 (2019).
Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. Deepface:
Closing the gap to human-level performance in face verication. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). 1701–1708.
Gary K.L. Tam, Zhi-Quan Cheng, Yu-Kun Lai, Frank C. Langbein, Yonghuai Liu, David
Marshall, Ralph R. Martin, Xian-Fang Sun, and Paul L. Rosin. 2013. Registration
of 3D point clouds and meshes: A survey from rigid to Nonrigid. Transactions on
Visualization and Computer Graphics 19, 7 (2013), 1199–1217.
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia
Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach
for Generalized Speech Animation. ACM Transactions on Graphics 36, 4 (2017),
93:1–93:11.
Jose Rafael Tena, Fernando De la Torre, and Iain Matthews. 2011. Interactive Region-
based Linear 3D Face Models. ACM Transactions on Graphics 30, 4 (July 2011),
76:1–76:10.
Jose Rafael Tena, Raymond S Smith, Miroslav Hamouz, Josef Kittler, Adrian Hilton, and
John Illingworth. 2007. 2d face pose normalisation using a 3d morphable model. In
Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance.
IEEE, 51–56.
Frank B ter Haar and Remco C Veltkamp. 2008. 3D face model tting for recognition.
In Proc. European Conference on Computer Vision (ECCV). Springer, 652–664.
Ayush Tewari, Florian Bernard, Pablo Garrido, Gaurav Bharaj, Mohamed Elgharib,
Hans-Peter Seidel, Patrick Pérez, Michael Zollhoefer, and Christian Theobalt. 2019.
FML: Face Model Learning from Videos. In Proc. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
Ayush Tewari, Michael Zollhoefer, Florian Bernard, Pablo Garrido, Hyeongwoo Kim,
Patrick Perez, and Christian Theobalt. 2018. High-Fidelity Monocular Face Re-
construction based on an Unsupervised Model-based Face Autoencoder. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1. https:
//doi.org/10.1109/TPAMI.2018.2876842
Ayush Tewari, Michael Zollhoefer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim,
Patrick Pérez, and Christian Theobalt. 2018. Self-supervised multi-level face model
learning for monocular reconstruction at over 250 hz. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 2549–2559.
Ayush Tewari, Michael Zollhoefer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,
Patrick Perez, and Theobalt Christian. 2017. MoFA: Model-based Deep Convolutional
Face Autoencoder for Unsupervised Monocular Reconstruction. In Proc. International
Conference on Computer Vision (ICCV).
Barry-John Theobald, Iain Matthews, Michael Mangini, Jerey R Spies, Timothy R
Brick, Jerey F Cohn, and Steven M Boker. 2009. Mapping and manipulating facial
expression. Language and Speech 52, 2–3 (2009), 369–386.
Justus Thies, Michael Zollhoefer, Matthias Nießner, Levi Valgaerts, Marc Stamminger,
and Christian Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment.
ACM Transactions on Graphics 34, 6 (2015), 183:1–183:14.
Justus Thies, Michael Zollhoefer, Marc Stamminger, Christian Theobalt, and Matthias
Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2387–2395.
Justus Thies, Michael Zollhoefer, Marc Stamminger, Christian Theobalt, and Matthias
Nießner. 2018a. FaceVR: Real-Time Gaze-Aware Facial Reenactment in Virtual
Reality. ACM Transactions on Graphics 37, 2, Article 25 (June 2018), 15 pages.
https://doi.org/10.1145/3182644
Justus Thies, Michael Zollhoefer, Marc Stamminger, Christian Theobalt, and Marc
Nießner. 2018b. HeadOn: Real-time Reenactment of Human Portrait Videos. ACM
Transactions on Graphics (2018).
Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering:
Image synthesis using neural textures. ACM Transactions on Graphics (Proceedings
of SIGGRAPH) 38, 4 (2019), 1–12.
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias
Nießner. 2018c. FaceVR: Real-Time Facial Reenactment and Eye Gaze Control in
38 B. Egger et al.
Virtual Reality. ACM Transactions on Graphics (Proceedings of SIGGRAPH) (2018).
Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gérard Medioni. 2017. Regressing robust
and discriminative 3D morphable models with a very deep neural network. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5163–5172.
Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yuval Nirkin, and Gérard G Medioni.
2018. Extreme 3D Face Reconstruction: Seeing Through Occlusions.. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). 3935–3944.
Luan Tran, Feng Liu, and Xiaoming Liu. 2019. Towards High-delity Nonlinear 3D
Face Morphable Model. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Luan Tran and Xiaoming Liu. 2018a. Nonlinear 3D Face Morphable Model. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City,
UT.
Luan Tran and Xiaoming Liu. 2018b. On learning 3d face morphable model from
in-the-wild images. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2018).
Liyun Tu, Antonio R Porras, Alec Boyle, and Marius George Linguraru. 2018. Anal-
ysis of 3D Facial Dysmorphology in Genetic Syndromes from Unconstrained 2D
Photographs. In International Conference on Medical Image Computing and Computer-
Assisted Intervention. Springer, 347–355.
Matthew Turk and Alex Pentland. 1991. Eigenfaces for recognition. Journal of Cognitive
Neuroscience 3, 1 (1991), 71–86.
Oliver van Kaick, Hao Zhang, Ghassan Hamarneh, and Daniel Cohen-Or. 2011. A Survey
on Shape Correspondence. Compututer Graphics Forum 30, 6 (2011), 1681–1707.
Zdravko Velinov, Marios Papas, Derek Bradley, Paulo Gotardo, Parsa Mirdehghan, Steve
Marschner, Jan Novák, and Thabo Beeler. 2018. Appearance Capture and Modeling
of Human Teeth. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 37,
6 (2018), 207:1–207:13.
Thomas Vetter and Tomaso Poggio. 1997. Linear Object Classes and Image Synthesis
from a Single Example Image. IEEE Transactions on Pattern Analysis and Machine
Intelligence 19 (1997), 733–742.
Daniel Vlasic, Matthew Brand, Hanspeter Pster, and Jovan Popović. 2005a. Face
transfer with multilinear models. ACM Transactions on Graphics (Proceedings of
SIGGRAPH) 24, 3 (2005), 426–433.
Daniel Vlasic, Matthew Brand, Hanspeter Pster, and Jovan Popović. 2005b. Face
Transfer with Multilinear Models. ACM Transactions on Graphics 24, 3 (2005),
426–433.
Mirella Walker, Fang Jiang, Thomas Vetter, and Sabine Sczesny. 2011. Universals
and cultural dierences in forming personality trait judgments from faces. Social
Psychological and Personality Science 2, 6 (2011), 609–617.
Mirella Walker, Sandro Schönborn, Rainer Greifeneder, and Thomas Vetter. 2018. The
Basel Face Database: A validated set of photographs reecting systematic dierences
in Big Two and Big Five personality dimensions. P loS one 13, 3 (2018), e0193190.
Mirella Walker and Thomas Vetter. 2009. Portraits made to measure: Manipulating
social judgments about individuals with a statistical face model. Journal of Vision 9,
11 (2009), 12–12.
Christian Wallraven, Volker Blanz, and Thomas Vetter. 1999. 3D-Reconstruction of Faces:
Combining Stereo with Class-Based Knowledge. In Deutsche Arbeitsgemeinschaft
für Mustererkennung Symposum (DAGM), Wolfgang Förstner, Joachim M. Buhmann,
Annett Faber, and Petko Faber (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg,
405–412.
Mengjiao Wang, Yannis Panagakis, Patrick Snape, and Stefanos Zafeiriou. 2017. Learn-
ing the Multilinear Structure of Visual Data. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
Mengjiao Wang, Zhixin Shu, Shiyang Cheng, Yannis Panagakis, Dimitris Samaras, and
Stefanos Zafeiriou. 2019b. An Adversarial Neuro-Tensorial Approach for Learning
Disentangled Representations. International Journal of Computer Vision 127 (2019),
743–762.
Ruizhe Wang, Chih-Fan Chen, Hao Peng, Xudong Liu, Oliver Liu, and Xin Li. 2019a.
Digital Twin: Acquiring High-Fidelity 3D Avatar from a Single Image. arXiv preprint
arXiv:1912.03455 (2019).
Yang Wang, Xiaolei Huang, Chan-Su Lee, Song Zhang, Zhiguo Li, Dimitris Samaras,
Dimitris Metaxas, Ahmed Elgammal, and Peisen Huang. 2004. High Resolution
Acquisition, Learning and Transfer of Dynamic 3-D Facial Expressions. Compututer
Graphics Forum 23, 3 (2004), 677–686.
Yang Wang, Lei Zhang, Zicheng Liu, Gang Hua, Zhen Wen, Zhengyou Zhang, and
Dimitris Samaras. 2009. Face relighting from a single image under arbitrary unknown
lighting conditions. IEEE Transactions on Pattern Analysis and Machine Intelligence
31, 11 (2009), 1968–1984.
Thibaut Weise, Soen Bouaziz, Hao Li, and Mark Pauly. 2011a. Realtime performance-
based facial animation. ACM Transactions on Graphics (Proceedings of SIGGRAPH)
30, 4 (2011), 77:1–77:10.
Thibaut Weise, Soen Bouaziz, Hao Li, and Mark Pauly. 2011b. Realtime Performance-
based Facial Animation. ACM Transactions on Graphics 30, 4 (2011), 77:1–77:10.
Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. 2009. Face/O: Live Facial Pup-
petry. In Proc. ACM SIGGRAPH / Eurographics Symposium on Computer Animation
(SCA). ACM, 7–16.
Cyrus A Wilson, Abhijeet Ghosh, Pieter Peers, Jen-Yuan Chiang, Jay Busch, and Paul
Debevec. 2010. Temporal upsampling of performance geometry using photometric
alignment. ACM Transactions on Graphics 29, 2 (2010), 17.
Chenglei Wu, Derek Bradley, Pablo Garrido, Michael Zollhoefer, Christian Theobalt,
Markus Gross, and Thabo Beeler. 2016a. Model-Based Teeth Reconstruction. ACM
Transactions on Graphics (Proceedings of SIGGRAPH Asia) 35, 6 (2016).
Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. 2016b. An anatomically-
constrained local deformation model for monocular face capture. 35, 4 (2016),
115:1–12.
Zexiang Xu, Hsiang-Tao Wu, Lvdi Wang, Changxi Zheng, Xin Tong, and Yue Qi. 2014.
Dynamic hair capture using spacetime optimization. ACM Transactions on Graphics
33 (2014), 6.
Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Ol-
szewski, Shigeo Morishima, and Hao Li. 2018. High-delity facial reectance and
geometry inference from an unconstrained image. ACM Transactions on Graphics
37, 4 (2018), 162.
Fei Yang, Lubomir Bourdev, Eli Shechtman, Jue Wang, and Dimitris Metaxas. 2012.
Facial expression editing in video using a temporally-smooth factorization. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 861–868.
Ilker Yildirim, Mario Belledonne, Winrich Freiwald, and Josh Tenenbaum. 2020. Ecient
inverse graphics in biological face processing. Science Advances 6, 10 (2020).
Lijun Yin, Xiaochen Chen, Yi Sun, Tony Worm, and Michael Reale. 2008. A high-
resolution 3D dynamic facial expression database. In Proc. International Conference
on Automatic Face and Gesture Recognition. 1–6.
Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and Matthew J. Rosato. 2006. A 3D Facial
Expression Database for Facial Behavior Research. In Proc. International Conference
on Automatic Face and Gesture Recognition. 211–216.
Ronald Yu, Shunsuke Saito, Haoxiang Li, Duygu Ceylan, and Hao Li. 2017. Learning
Dense Facial Correspondences in Unconstrained Images. In Proc. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). 4723–4732.
Stefanos Zafeiriou, Gary A Atkinson, Mark F Hansen, William AP Smith, Vasileios
Argyriou, Maria Petrou, Melvyn L Smith, and Lyndon N Smith. 2013. Face recog-
nition and verication using photometric stereo: The photoface database and a
comprehensive evaluation. IEEE Transactions on Information Forensics and Security
8, 1 (2013), 121–135.
Chao Zhang, William Smith, Arnaud Dessein, Nick Pears, and Hang Dai. 2016b. Func-
tional Faces: Groupwise Dense Correspondence using Functional Maps. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Lei Zhang and Dimitris Samaras. 2006. Face recognition from a single training image
under arbitrary unknown lighting using spherical harmonics. IEEE Transactions on
Pattern Analysis and Machine Intelligence 28, 3 (2006), 351–363.
Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2004. Spacetime faces: high
resolution capture for modeling and animation. In ACM Transactions on Graphics
(Proceedings of SIGGRAPH), Vol. 23. 548–558.
Xing Zhang, Lijun Yin, Jerey F. Cohn, Shaun Canavan, Michael Reale, Andy Horowitz,
Peng Liu, and Jerey M. Girard. 2014. BP4D-Spontaneous: a high-resolution sponta-
neous 3D dynamic facial expression database. Image and Vision Computing 32, 10
(2014), 692 – 706.
Zheng Zhang, Jerey M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun
Canavan, Michaele Reale, Andrew Horowitz, Huiyuan Yang, Jerey F. Cohn, Qiang Ji,
and Lijun Yin. 2016a. Multimodal Spontaneous Emotion Corpus for Human Behavior
Analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
3438–3446.
Guoyan Zheng, Shuo Li, and Gabor Szekely. 2017. Statistical shape and deformation
analysis: methods, implementation and applications. Academic Press.
Yuxiang Zhou, Jiankang Deng, Irene Kotsia, and Stefanos Zafeiriou. 2019. Dense 3D
Face Decoding over 2500FPS: Joint Texture and Shape Convolutional Mesh Decoders.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li. 2015. High-delity pose and
expression normalization for face recognition in the wild. In Proc. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). 787–796.
Jasenko Zivanov, Andreas Forster, Sandro Schönborn, and Thomas Vetter. 2013. Hu-
man face shape analysis under spherical harmonics illumination considering self
occlusion. In Proc. International Conference on Biometrics (ICB). IEEE, 1–8.
Jasenko Zivanov, Pascal Paysan, and Thomas Vetter. 2009. Facial normal map capture
using four lights–an eective and inexpensive method of capturing the ne scale
detail of human faces using four point lights. In Proc. International Joint Conference on
Computer Vision, Imaging and Computer Graphics Theory and Applications. (GRAPP).
Michael Zollhoefer , Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick
Pérez, Marc Stamminger, Marc Nießner, and Christian Theobalt. 2018. State of the
Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer
Graphics Forum (Eurographics State of the Art Reports) 37, 2 (2018).
Gaspard Zoss, Thabo Beeler, Markus Gross, and Derek Bradley. 2019. Accurate marker-
less jaw tracking for facial performance capture. ACM Transactions on Graphics 38,
4 (2019), 50.
3D Morphable Face Models - Past, Present and Future 39
Gaspard Zoss, Derek Bradley, Pascal Bérard, and Thabo Beeler. 2018. An empirical rig
for jaw animation. ACM Transactions on Graphics 37, 4 (2018), 59.
Silvia Zu, Angjoo Kanazawa, and Michael J. Black. 2018. Lions and Tigers and Bears:
Capturing Non-Rigid, 3D, Articulated Shape from Images. In Proc. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society.