Automatic landmark annotation and dense correspondence registration for 3D human facial images

Background Traditional anthropometric studies of human face rely on manual measurements of simple features, which are labor intensive and lack of full comprehensive inference. Dense surface registration of three-dimensional (3D) human facial images holds great potential for high throughput quantitative analyses of complex facial traits. However there is a lack of automatic high density registration method for 3D faical images. Furthermore, current approaches of landmark recognition require further improvement in accuracy to support anthropometric applications. Result Here we describe a novel non-rigid registration method for fully automatic 3D facial image mapping. This method comprises two steps: first, seventeen facial landmarks are automatically annotated, mainly via PCA-based feature recognition following 3D-to-2D data transformation. Second, an efficient thin-plate spline (TPS) protocol is used to establish the dense anatomical correspondence between facial images, under the guidance of the predefined landmarks. We demonstrate that this method is highly accurate in landmark recognition, with an average RMS error of ~1.7 mm. The registration process is highly robust, even for different ethnicities. Conclusion This method supports fully automatic registration of dense 3D facial images, with 17 landmarks annotated at greatly improved accuracy. A stand-alone software has been implemented to assist high-throughput high-content anthropometric analysis.


Introduction
Large-scale, high-throughput phenotyping is becoming increasingly important in the post-genomics era. Advanced image processing technologies are used more and more for collecting deep and comprehensive morphological data from different organisms, such as yeast [1], plants [2], and mice [3]. The soft tissue of the human face is a complex geometric surface composed of many important organs, including eyes, nose, ears, mouth, etc. Given its essential biological functions, the human face has been a key research subject in a wide range of fields including anthropology [4], medical genetics [5,6,7,8], forensics [9,10], psychology [11,12], and aging [13,14]. With the development of non-invasive 3D image acquiring technologies, high resolution 3D data of the human face are becoming readily available for various applications (www.3dmd.com).
In many research fields, only a small fraction of the high-resolution 3D image data is used, which usually comprises a set of landmarks and/or their mutual distances and angles [4,15,16]. However, methods have been developed to register the 3D surfaces using their dense surface meshes, which allows the inclusion of the complete data set for powerful inferences and analyses [8,17,18]. In general, surface registration methods can be classified into two groups: rigid and non-rigid techniques. The former aligns surfaces by rigid transformation, e.g, rotation and translation, while the latter employs deformations to get a close alignment between surfaces. For rigid registration, the Iterative Closest Point (ICP) algorithm is the most widely used [19]. However, as the affine transformations do not fully capture the anatomical variability, the closest 4 corresponding points between two surfaces calculated by the ICP algorithm are not necessarily biologically homologous, especially when comparing faces which differ significantly. In order to establish the anatomical correspondence more effectively, it is necessary to employ non-rigid transformations. A common method for deforming 3D surfaces is the thin-plate spline (TPS) algorithm [20]. The process of using TPS warping involves minimizing a bending energy function for a transformation over a set of fiducial points (landmarks), thereby bringing the corresponding fiducial points on each surface into alignment with each other.
Although it is a powerful method, TPS has a key drawback that limits its use in the analysis of large-scale, open 3D facial datasets: namely, it requires a set of landmarks to be annotated first. For many existing registration methods, landmarks have to be manually labeled on the facial surfaces [21,22,23,24], which is highly time consuming and introduces human errors. Methods have been developed to combine ICP-based landmark annotation and TPS warping to automate the registration [25,26]. However, the landmark correspondences found by ICP are not exactly anatomically homologous, as previously discussed. There exist many automatic landmark localization methods [9,27,28,29,30,31]. At some point, most of these approaches use local, curvature-based facial features due to their invariance to surface translation and rotation. The two most frequently adopted features are the HK curvature and the shape index [27,28,29,32].
However, curvature-based descriptors often suffer from surface irregularities, especially near eye and mouth corners [33]. Other studies have used appearance-based methods where the facial features are modeled by basis vectors calculated from transformations such as Principal Component Analysis (PCA) [34,35], Gabor wavelets [36,37], or the 5 Discrete Cosine Transform (DCT) [31].
In this study, we develop an automatic registration method which combines a novel workflow of landmark localization and an efficient protocol of TPS-based surface registration. The landmark localization mainly employs PCA to extract landmarks on surfaces by use of both shape and texture information. For the surface registration, a new TPS warping protocol that avoids the complication of inverse TPS warping (a compulsory procedure in the conventional registration method) is used to resample the meshes according to the reference mesh. We show that this method is highly accurate and robust accross different ethnicities. We also propose a new spherical resampling algorithm for re-meshing surfaces which efficiently removes the caveats and improves the mesh structure. Furthermore, the associated texture is also included in the registered data for visualization and various analyses.

Ethics Statement
Sample collection in this study was carried out in accordance with the ethical standards of the ethics committee of the Shanghai Institutes for Biological Sciences (SIBS) and the Declaration of Helsinki, and has been specifically surveyed and approved by SIBS. A written statement of informed consent was obtained from every participant, with his/her authorizing signature. The participants, whose transformed facial images are used in this study as necessary illustrations of our methodology, have been shown the manuscript and corresponding figures. Aside from the informed consent for data sampling, a consent of publication was shown and explained to each participant 6 and their authorizing signature was obtained as well.

The 3D face data set
Three-dimensional facial images were acquired from individuals of age 18 to 28 years old, among which 316 (114 males and 202 females) were Uyghurs from Urumqi,

Workflow
The workflow is briefly described as follows (figure 2). Starting with a set of raw 3D face scans, the nose tip is first automatically localized on each face using a sphere fitting approach and pose normalization is performed to align all sample faces to a uniform frontal view. For the landmark annotation, the six most salient landmarks were first manually labeled on a set of training samples; Principal Component Analysis (PCA) was then employed to localize these 6 landmarks on the sample surfaces and 11 additional landmarks were heuristically annotated afterwards. A reference face was then chosen, re-meshed using spherical sampling, and TPS-warped to each sample face using the 17 landmarks asfiducial points. A dense, biological correspondence was thus built by re-meshing the sample face according to the reference face. The correspondence is further improved by using the average face as the reference and repeating the registration process.

Preliminary nose tip localization and pose normalization
In 3D facial image processing, pose normalization and landmark localization are highly dependent on each other since pose normalization is typically guided by landmarks. The features commonly used for pose correction are the nose tip and inner eye corners as they are easier to detect [27], less sensitive to pose variation, and invariant to facial expressions [32,34,38,39]. On the other hand, most existing landmark localization approaches rely on the assumption of frontal or approximately frontal poses 9 and are therefore sensitive to roll and yaw rotation [30,32,40]. In order to fully automate the pose normalization and landmark annotation, we first identify the most robust and prominent landmark, the nose tip.
Since the area around the nose tip can be approximated as a semi-sphere with a diameter specific to nose, we try to identify the nose tip by fitting a sphere around every vertex using its surrounding vertices. A vertex is likely the nose tip if its neighboring points fit a sphere very well and the sphere diameter approaches the specific value of the nose tip. As this method is insensitive to the pose of the face, the spherical area on nose tip can be seen as a rotation invariant descriptor (RID). The algorithm is described as follows. Let us denote a facial mesh composed of N points by F = {p i } for i = 1, …, ). The best fit sphere T around p j is therefore determined by two parameters, that showed little variance across the four test groups of different sex and ethnicities. 10 The two spherical parameters can then be combined with the optimal r 0 radius into one statistic (f) which describes how well a given point fits the criteria for a nose tip: (1) The f value should be very small around the nose tip region. Indeed, we found that  The pose correction becomes easy once the nose tip has been located. Correcting the pose basically consists of resetting the viewing coordinate system where an origin point and two axes must be defined. In some studies, the ICP matches are applied [41,42].
Other studies try to find landmarks (i.e. inner eye corners) other than the nose tip to determine the pose [20,30]. However, in this study we followed a rather practical solution in which all vertices within 50mm of the nose tip are used to correct the pose via the Hotelling transformation [41,42].

Localization of the six most salient landmarks using PCA
Here we propose a novel landmark localization method. The basic idea is to transform the 3D shape and texture data into a 2D space. A 2D PCA algorithm is then used to identify the six most salient landmarks, namely the inner and outer corners of the eyes and both corners of the mouth (figure 1b). Only the gray scale values are used as color information for this step. For any 3D face image, the plane defined by the x and y axes is defined as the target 2D space. The 3D surface and its corresponding 2D texture are then resampled on a uniform square grid at a 1mm resolution to obtain the corresponding z coordinate values and gray scale values.
These values are directly mapped to the target 2D space (figure 4). In order to minimize the data noise, the z coordinate and gray scale values are de-noised using a 3×3 decision-based median filter [29]. Only the values of the outer most layer are transformed to 2D following the z buffering algorithm, particularly for the areas where the 3D surface folds into a multilayer along the z-axis [43]. Holes that may occur inside the surface are closed by bicubic interpolation as previously described [44]. The 13 interpolation process was done separately on texture and the 2.5D image data. The resulting 2D image combines both shape and texture information, which serves as the basis for the PCA-based landmark localization. The PCA analysis is a commonly used approach for accurate pattern recognition in 2D data [24,43,45]. It involves retrieving the feature signatures by dissecting the training data with PCA, followed by projecting the sample data into the PCA eigenspace to determine the similarity. In this study, the landmark signature is obtained by defining a patch of a given size, say s mm × s mm, can be combined together to specify the shape and texture properties around the landmark: P is then calculated for the signature eigenspace U using PCA (see Appendix II for details). To find the landmarks in a sample face, a patch of s mm×s mm is similarly defined for every point in the corresponding 2D grid, and a sample patch vector P s is derived following equation (2). P s is subsequently projected to the space U to evaluate its closeness to the origin point of U. In this study, two measurements of closeness are Noticing that the region around the nasion point is approximately saddle-shaped and that of the chin point is ellipsoidal or sphere-shaped, both characterized by the two-way symmetry, we therefore locate the two points by finding the maximum local symmetry scores. The earlobe points are easily found by locating the tips with sheer slopes along the z-axis.

Spherical resampling and surface remeshing
During the 3D image acquisition, the surface meshes often suffer from mesh structure irregularities and/or defects such as mesh holes (see figure 1c for example). Surface remeshing is often used to solve such problems [46]. In this work, we apply spherical sampling to the reference surface to obtain a well-structured mesh. Spherical sampling 16 is preferred as human faces are approximately ellipsoidal. We first perform a spherical parameterization to the surface using the geographic coordinates. Given a vertex The x-coordinate is multiplied by a factor 2 before the coordinate conversion, to compensate for the face aspect ratio (height to width) [42]. When plotted against  and  , the parameterized surface unfold into a nearly flat plane. This surface is then trimmed with an oval path to remove the irregular edges and re-sampled from a uniform square grid with an interval of 0.005 for both  and  . The re-sampled data points are then converted back to the Cartesian coordinate system to define a new surface mesh.

Surface registration for dense correspondence
In order to preserve the anatomical correspondence across the facial surfaces, we adopted the idea of the TPS-based registration method proposed previously [23]. In that study, all surfaces were first manually annotated for a set of landmarks. The sample surfaces and the reference were all TPS warped to the cross-sample average landmarks.
Each sample surface was then re-meshed by the closest points to the reference vertices, and further inverse TPS warped back to the original shape. Mathematically, TPS warping is not invertible. Although an approximation exists, it is computationally 17 intensive and error prone [47]. In our study, we designed an alternative scheme. First, a well-structured surface with few defects is chosen as the reference face, and spherically remeshed as described above. Then only the reference surface is TPS warped to each sample surface, taking the 17 landmarks as the fiducial points. The TPS warping is done as previously described [48]. Thereafter the vertices on the reference surface find their closest projections on the sample surface, which define the new mesh vertices of the sample surface [48,49]. The dense correspondence is established after all the sample surfaces are remeshed using the same reference. This approach eliminates the need for inverse TPS warping, and enhances the computational efficiency as well.

Accuracy of the landmark localization
In this section we demonstrate the accuracy of the proposed algorithm for automatic landmark localization. The accuracy is measured by the deviation of the automatically annotated landmarks from those manually annotated.

Robustness of the registration method
One way to evaluate the robustness of the registration method is to determine how the use of different references would affect the correspondence mapping. We performed such an evaluation, as shown in figure 5. First, we obtained the average Han Chinese

The average faces calculation with the 3D face registration
We applied the proposed 3D face registration method to the whole 3D face sample set.
In total 363 male and 321 female Han Chinese and 114 male and 202 female Uyghur were included in this analysis. All surfaces were automatically annotated. One Han Chinese face with few caveats and fully exposed skin was chosen as the reference, to which all the sample faces were registered. The Generalized Procrustes Analysis (GPA) 21 was then used to align all the registered surfaces to a common coordinate [50]. The average faces were then calculated as the average meshes colored by the corresponding average texture pixels across all the samples in each group. Figure 6 shows the average faces of the four groups. As can be seen, the average faces well retain the morphological and textural features of each group. Specifically, the Uyghur are characterized by a more protruding nose and eyebrow ridges while Han Chinese have wider cheeks. The skin pigmentation also seems lighter for the Uyghur compared to the Han Chinese. This difference could not be confirmed as the environmental light was not well controlled when the 3D images were taken.

Discussion:
In this work we propose a fully automatic registration method for high resolution 3D 22 facial images. This method combines automatic landmark annotation and TPS-based registration. Up to now, there have been few studies that have fully automated the entire dense 3D facial surface registration process. Methods have been developed to combine the ICP-based landmark localization with the TPS-based registration [25,26].
Nonetheless, ICP may fail to detect biological correspondence when surfaces differ greatly in shape [26]. Most time-honored solutions for landmark localization deal with only 2.5D data, leaving out the texture information. In particular, Perakis et al.  (table 1). If the less salient landmarks, such as the earlobe tips, are excluded from the analysis, the errors will decrease further. The gain in accuracy may be partially attributed to the higher image resolution of our data (~30,000 vertices per surface on average) compared to the 23 previous work (~10,000 vertices per surface). Nonetheless we believe that the use of both shape and texture information played a key role in improving the landmark localization accuracy. We found that the positions of some salient landmarks such as the eye corners are ambiguous even manually when the texture is removed. Texture gives rich information about the facial anatomical layout, such as the boundaries of different skin/tissue types. In fact, texture is almost the only information source for pattern recognition in 2D images and has been shown to give good performance. We projected both the shape and texture data into the 2D space, where the well-established PCA algorithm was used to detect the key landmarks. We also made use of the texture information for detecting certain other landmarks. This design seems to improve the accuracy and robustness of our method significantly. Furthermore, due to the use of simple and optimized algorithms, the landmark annotation is also very efficient and does not require large amounts of memory. Hundreds of surfaces can be annotated within several minutes on a standard Windows PC. It should be noted that the aim of this study is not to propose a general scheme of landmark recognition on 3D surfaces.
Rather, it gives a pragmatic solution for automatically annotating a fixed set of salient landmarks on the human face. Nonetheless, our new idea of combining both the shape and texture information in the PCA framework may be extended to improve general feature recognition on 3D surfaces.
In this work, we also proposed a new protocol for the TPS-based registration, whereby the TPS warping was only applied to the reference face while the sample faces remained undeformed and thus avoided the step of inverse TPS warping, thereby further increasing the efficiency of our method. On the other hand, the robustness of the 24 registration is well retained, as can be seen from the results. To address the surface mesh irregularities such as the caveats and uneven vertex densities, we proposed a spherical sampling step to re-mesh the reference surface. This reference surface, when used in the registration, resulted in similarly well structured meshes in the sample surfaces as well. The registration quality can be further improved by replacing the original reference face with a sample-wide average face, followed by a second registration. This can further reduce the dependency of results on a specific reference.
It is interesting to note that both the automatic landmark annotation and the TPS based registration steps work equally well for two different ethnicities, namely Han Chinese and Uyghur, in spite of the fact that they are substantially different in both genetic background and facial appearance. Han Chinese are representative of East Asian populations while Uyghur is an ancient admixture population whose ancestries came from both East Asians and Caucasians (European people) [53]. As a result, Uyghur participants exhibited many Caucasian facial features such as sunken eyes and high nose ridge, etc ( figure 8). This method was also tested on other ethnic groups and showed consistent robustness (data not shown). Such ethnicity independency is very important when this method is used to study the cross population facial morphological variations in humans. Limited by the sample collection, we are not able to analyze the sensitivity of our parameters with respect to very different age groups, or atypical faces caused by diseases or obesity. To assist studies of other facial morphological variations such as face deformation diseases, face growth, and aging, this method has to be extended and tested for the specific datasets. Furthermore, the anatomic correspondence can be further improved by including additional features such as the eyebrows, eyelid lines, 25 and lip lines as landmarks. These features may provide discrimination power towards different facial expressions. Nonetheless, we believe the approach we describe here can provide a good basis for 3D face dense registration in general.
In summary, this study proposes a new scheme to build accurate and robust anatomical correspondence across dense surfaces of 3D facial images; and it can be Here k is the actual number of eigen vectors to be used, which is set to 16 in our case. U therefore defines an eigen space where the sample P patches can be evaluated for similarity.
For a sample face, every point in the 2D grid is given a 21mm×21mm patch and a sample patch vector P s is similarly derived following equation (2). P s is then subtracted by t P and projected into the eigen space U to give the weight vector which can be another indicator of pattern similarity.