The Avatar: 3D Face reconstruction from two orthogonal pictures with application to facial makeover.* 

*Draft Report  please refer to the pdf version for the final report.  
Nikhil Rasiwasia 

Abstract
In this report an algorithm for the fast reconstruction of a textured 3D face model of the given individual from his two orthogonal pictures – a frontal view and a profile view  is presented. Then a possible application to the facial makeover of the individual is also conceptualized. The algorithm needs minimal human intervention for construction and does not need any special setup, or the camera calibration parameters as required by the Stereo based algorithm. Initially the facial features are identified and extracted giving the coordinated of the feature points, and then a generic model is deformed using Radial Basis Functions (RBF). The reconstructed model is in the standard virtual reality model (VRML) format, such that it can be viewed online by the common web browsers with a use of a plugin. 

1 Introduction 3D face reconstruction has been an active area of research in computer vision and graphics. The need for a fast reconstruction algorithm has emerged out in the recent past due to its application in realistic facial animation for low data video conferencing. Another important application is found in the gaming industry where there is a boom for realistic looking and feeling games. If the player can see himself as a part of the game system, the game gains popularity. Facial reconstruction has also been applied to the problem of face recognition. There is also a huge demand for such application in the movies and special effects industry where the realistic face is animated to obtain the desired effects. Later in this paper two more applications of this shall be introduced with the possibility of converting them to a business plan. Generating the 3D model is not an easy task. Various algorithms and abundant literature is available on this subject. The approaches for this problem are divided into three main categories [19]: pure image based rendering techniques, hybrid image based techniques and 3D scanning techniques. In the first category of algorithms, the 3D model is generated from the images only. They do not try to estimate the real 3D structure but, just interpolate between the given images. The Hybrid techniques uses some approximate information about the 3D geometry and mixes it with the image based rendering to obtain more accurate results. The aim of both there algorithms is to obtain a coherent view of the real scene, not to obtain the metric measurement. In the third category of algorithms, complete 3D structure is obtained. 3D scanning techniques can be active – 3D data from the range scanner or coded light, or can be passive. The passive methods are commonly known as Shape from X methods [22]. The commonly used passive methods are: Shape from Motion [20, 21], where one or multiple videos is used to find correspondences and then 3D shape is extracted; Shape from Optical Flow, in which making correspondences is not important rather the apparent velocity of the pixels given by the optical flow field is used; Shape from Texture, which is a clear psychophysical evidence of human use to extract depth. The perspective distortion and the size of the texel extracted from the lowlevel processing is used to find the 3D shape; Shape from contour or Silhouette which aims to describe a 3D shape as seen from one or more different directions [19]; Shape from Shading [23] based on diffusing properties of the Lambertian surfaces; Shape from Stereo, a widely used method for facial reconstruction [16, 17, 18, 19] that typically consists of three steps: Camera Calibration, Establishing point correspondences between pairs of points from the left and the right image and Reconstruction of the 3D coordinates of the points in the scene. The algorithms mentioned above in this text suffer from two major problems. Firstly the quality of 3D reconstruction from Shape from X algorithms is often not satisfactory. Secondly, most of the algorithms require a special setup or a pre defined lighting to recover the 3d shape, making them unsuitable for the home environment. Many researches have successfully tried to tackle the first problem by using a generic model of the face, and deforming to fit the feature points obtained from the previous algorithms. For the second issue, two orthogonal views of the subject has been used by researchers [2, 3, 4, 5] to simplify the process of obtaining the feature point information. If the two images are from arbitrary angle from the head, the corner point matching will fail as the skin may be smooth and free of blemishes. A frontal view and profile view of the subject has been used for the problem, which does not suffer from any of the issues mentioned above. We also use there images to obtain the coordinates of facial features in three dimensions and then deform a generic model by the use of Procurstes Analysis and Radial Basis functions.


2.1. Geometric Model Choice
The human face has a basic structure that consists of features such as nose, mouth, eyes etc, for different people. A generic face model can easily encompass such features, but within these features there are differences that make one different from others. A generic model should be such that it can be modified to cover the space of all such different faces and should be structurally supportive of facial animation. The wireframe model can be modeled in softwares like 3D Studio Max, Maya, Poser etc. This system uses a model that incorporates features like eyes, nose, mouth, forehead and chin and the inbetween areas. The complete head model consisting of ears and hair has been deferred in this implementation. The facial model consists of 1683 3D vertexes and 3186 faces. (Fig. 2.1) This generic model is stored in virtual reality modeling language format also referred to as VRML97 [26]. The advantage of this format is that it can be viewed from different angles using commonly available plugins for the standard browsers.


2.2. Importing the Images
The frontal and the profile views can be obtained by a stereo camera setup as explained in [5], but for simplicity we allow even images from handheld camera. These images are aligned for processing as described in the next step. This relaxation does introduce an error as the images are assumed to be orthogonal, but they are perspective images in reality. A correction factor can be introduced in the later stages to minimize the error due to this assumption. Another important constrain is that the images are taken in normal white light condition on a background free from any skin colored object. The background may or may not be cluttered. (Fig 2.3)


2.3. Alignment of the profile view
If the images are not taken from a stereo camera setup, they need to be aligned by scaling, rotation and translation so that the features in the frontal and profile view lie in the same horizontal line. The user is asked to mark 4 specific points – Nose, Eye, Ear and Mouth  in each image and then using these, the transformations are calculated.


3. Image based Facial Feature Point extraction
The X, Y, Z coordinates of the 35 feature points are extracted from the frontal and the profile. The frontal view gives the X and Y coordinates  (xf, yf). The profile view also gives the Y coordinate along with the Z coordinate  (zp, yp). This Y coordinate is approximately same as yf as a result of the alignment done in section 2.3. The final position of the feature points are: (xf[i], (yf[i]+yp[i])/2, zp[i]) where i = 1, 2, 3, …, 35. (3.1) In the current implementation 9 feature points – 4 of each eye and 1 for mouth  are extracted automatically as explained in section 3.2 and 3.3. For this the face region is identified in both the images. The remaining 26 features are found out by modifying a predetermined set of 35 feature points – derived from the generic model – such that the squared distance between the corresponding 9 points extracted automatically is minimal.


3.1. Localization of the Face Experience suggests that human face has a distinctive color that is generally not found in nature or common man made objects. Face localization based on color allows for fast processing (Fig 3.1). It is robust to geometric variations of the skin pattern and also robust under partial occlusion. This project employs a simple pixel based skin model that explicitly defines the skin region [25]. The model has been slightly modified to harden the conditions for a pixel to be classified as skin. It is based on the knowledge that in RGB space the skin has a higher Red content than other components. (R,G,B) is classified as skin if: R > 95 and G > 40 and B > 20 and


3.2. Extraction of the Eyes feature points from frontal image.
First step in determining the feature points is to localize the eye area (consisting of both the eyes) from the face. The Image of the subject is converted to binary levels and a template matching algorithm is applied [12, 13, 14]. Since template matching is a time intensive algorithm, the resolution of the face image is reduced to 64px width. The template (Fig. 3.2) is generic and captures the general properties of the eye and its surrounding area. Each eye region is extracted from the localized area and using the Prewitt operator the vertical edge detection is performed. The only vertical edges in and near the eye region belong to the eye itself (iris and eye boundary) and all other surrounding regions are either smooth or have strong horizontal components in their edges (eyebrows). The region with the maximum vertical edges gives the bounding box of the eyes. From this the X and the Y coordinates of the 4 points from each eye can be easily calculated. (Fig. 3.3)


3.3. Extraction of the Mouth
Once the eye region is correctly identified, finding gross mouth region is trivial. A rectangular region is cropped from the face image with the extremities of the eyes as the right and left boundaries and the lower part of the eyes as the upper boundary. Horizontal histogram technique is implemented to find the location of the mouth in the image. For this we first apply a gray level morphological image reconstruction to improve the gray scale levels of the mouth and then convert the image to binary colors. The dark shadows and the background around the chin and neck interfere with the process. So such regions are tackled by removing the black connected components touching the image boundary. Location of the mouth is obtained by considering the first peak from the top above a certain threshold, in the horizontal histogram (Fig 3.4). The center of the mouth is determined by taking the vertical histogram in the binary image of the localized mouth region.


3.4. Extraction of other features
The X,Y coordinates of 9 points are used to deform a predefined set 35 point derived from the generic model. The corresponding 9 points from the predefined model are used to calculate values for scaling and translation in X and Y dimension and rotation. The problem can be formulated as minimizing the squared distance between the 9 model points after transformation and the observed points. In homogeneous coordinates: Where x(i), y(i) are points from the predefined model The function to be minimized is then as follows. Where x(i)mod and y(i)mod is given by the Eq 3.3 and (a(i), b(i)) are coordinates of points as obtained from the frontal image. For the Profile image, the tip of the nose is found out using the skin models developed in section 3.1. Using other heuristics a coarse estimate of the (zp, yp) can be obtained.


4. Interactive Facial Feature Point Editing
Even if all the 35 points are obtained using automatic methods, there might be some errors that creep in due to the noise or illposed images. In this system since only a coarse estimate of the 26 points is obtained using automated methods the problem calls for some interactive technique to fine tune the coordinates. A GUI system had been developed in Matlab to facilitate the process of fine tuning. Initially the user is provided with the frontal image with the 35 feature points marked (Fig 4.1). These coordinates of these points can be changed by simply dragging them over the image. A prototype model is also displayed along with, which acts as a reference for the tuning process.


5. Generic Model Deformation
The generic model deformation is a problem about 3D space deformation in fact. A set of control points are identified in this space and their displacements are computed. Then an interpolation function is chosen that accommodates the displacement of the control points. The other points in the 3D space, in particular the points of the generic model are then calculated using the interpolation function. We follow a two step process for deformation. First the global deformations is carried out in which the entire 3D model vertices is brought as close as possible to the feature points considering it to be a rigid body and in the second step the 3D model is treated as a non rigid object and points are matched in a more realistic manner, further closing on to the feature points.


5.1. Global Deformations
To achieve the goals of global deformation the 3D model needs to be translated, scaled and rotated to match the feature points. This is a similar problem as tackled in section 3.4, except increase in dimensions from 2D to 3D and the number of points from 9 to 35. To solve this using the optimization approach is computationally expensive. A computationally simpler approach is to use the Procrustes Analysis [5, 27]. One drawback is that it the scale factor in X, Y and Z coordinates are not independent and attain the averaged values (Fig 5.1). (5.1) Pcalculated is 35 x 3 matrix containing the feature points from the face images and Pmodel the corresponding points from the model. a) Compute the mean of both Pmodel and Pcalculated


5.2. Local Deformations
In this second step a more realistic model is generated where the feature points from the generic model is actually mapped onto the feature points calculated. In this process an interpolation function is estimated by the displacements of the feature points and the rest of the points are mapped using this function. Constructing such an interpolation function is a standard problem in scattered data interpolation. Several techniques exist in the literature [9, 10]. An intuitive way to tackle the problem is by using distance weighted methods. Another approach is to construct the interpolant as a linear combination of basis functions and then to determine the coefficients of the basis functions.


6. Texture Mapping
The final step in the reconstruction is texturing the modified 3D model using the frontal image. We do not use the profile image for texturing as ears and hair is not a part of the generic model to be textured. With the generic model the texture coordinates are also saved in the vrml format. The job is to again deform the texture coordinates so that the features overlap with the actual features in the frontal image. The procrustes analysis and the RBF can be again used in 2 Dimension for the transformation as the actual coordinates of the feature points are already known.


7. Conclusion and Future work
In this paper we have described an image based approach to reconstruct the 3D model from two orthogonal images – Frontal and Profile view. A generic model has been successfully deformed to generate the individualized model of the subject. A sub process of this required calculating the feature points on the face, currently 9 of which are successfully calculated automatically. Finally the texture is mapped on to the model to make it more realistic. * Using a higher resolution model of the face with more points and vertices.


8. References
[1] J. Ahlberg, “CANDIDE3, an updated parameterized face,”. Technical Report No. LiTHISYR2326, Dept. of Electrical. Engineering, Linkping University 