Ortiz_Face_Recognition_in_2013_CVPR_paper
面部特征交换实验方法

面部特征交换实验方法引言面部特征交换实验是一种通过计算机技术实现人脸图像间特征互换的研究领域。
该方法可以在不改变人物身份和外貌特征的基础上,将一个人的面部特征转移到另一个人的面部图像上,从而实现面部特征的交换,具有重要的应用价值。
本文将介绍面部特征交换实验的方法及其应用。
人脸特征提取与标定在进行面部特征交换实验前,首先需要对人脸图像进行特征提取与标定。
特征提取是指从人脸图像中提取出与人脸相关的特征信息,如面部轮廓、眼睛位置、嘴巴位置等。
常用的特征提取方法包括基于深度学习的方法和传统的计算机视觉方法。
对于基于深度学习的方法,通常使用卷积神经网络(CNN)进行特征提取。
通过训练CNN模型,可以从人脸图像中学习到高层次的特征表示。
常用的CNN模型有VGG、ResNet等。
在进行面部特征交换实验时,可以使用预训练好的CNN模型进行特征提取。
传统的计算机视觉方法主要利用人脸识别算法进行特征提取。
常用的人脸识别算法包括特征点标定、轮廓提取、纹理提取等。
这些算法可以通过检测人脸的关键点、外观、形状等特征信息进行面部特征提取。
面部特征对齐与变形在进行面部特征交换实验时,需要对两个人脸图像进行特征对齐和变形。
特征对齐是指将两个人脸图像中的面部特征对应到同一位置,使得它们之间的对应关系是准确的。
特征对齐常用的方法有:1.利用人脸关键点进行对齐:提取人脸图像中的关键点(例如眼睛、鼻子、嘴巴等),通过将两张图像中的关键点进行对应,计算得到他们之间的变换关系(如旋转、平移、缩放等),从而实现面部特征对齐。
2.利用人脸纹理进行对齐:提取人脸图像中的纹理特征,通过计算纹理之间的相似度,找到两张图像中纹理最相似的部分,并将其对齐。
面部特征对齐完成后,还需要进行面部特征的变形。
变形主要包括形状变形和纹理变形。
形状变形是指将一个人的面部特征变形成另一个人的特征,使得两个人的面部形状尽可能相似。
纹理变形是指将一个人的面部纹理变形成另一个人的纹理,使得两个人的面部纹理尽可能相似。
Robust Face Recognition via Sparse Representation

Robust Face Recognition via Sparse Representation -- A Q&A about the recent advances in face recognitionand how to protect your facial identityAllen Y. Yang (yang@)Department of EECS, UC BerkeleyJuly 21, 2008Q: What is this technique all about?A: The technique, called robust face recognition via sparse representation, provides a new solution to use computer program to classify human identity using frontal facial images, i.e., the well-known problem of face recognition.Face recognition has been one of the most extensively studied problems in the area of artificial intelligence and computer vision. Its applications include human-computer interaction, multimedia data compression, and security, to name a few. The significance of face recognition is also highlighted by a contrast between human’s high accuracy to recognize face images under various conditions and the computer’s historical poor accuracy.This technique proposes a highly accurate recognition framework. The extensive experiment has shown the method can achieve similar recognition accuracy as human vision, for the first time. In some cases, the method has outperformed what human vision can achieve in face recognition.Q: Who are the authors of this technique?A: The technique was developed in 2007 by Mr. John Wright, Dr. Allen Y. Yang, Dr. S. Shankar Sastry, and Dr. Yi Ma.The technique is jointly owned by the University of Illinois and the University of California, Berkeley. A provisional US patent has been filed in 2008. The technique is also being published in the IEEE Transactions on Pattern Analysis and Machine Intelligence [Wright 2008].Q: Why is face recognition difficult for computers?A: There are several issues that have historically hindered the improvement of face recognition in computer science.1.High dimensionality, namely, the data size is large for face images.When we take a picture of a face, the face image under certain color metrics will be stored as an image file on a computer, e.g., the image shown in Figure 1. Because the human brain is a massive parallel processor, it can quickly process a 2-D image and match the image with the other images learned in the past. However, the modern computer algorithms can only process 2-D images sequentially, meaning, it can only process an image pixel-by-pixel. Hence although the image file usually only takes less than 100 K Bytes to store on computer, if we treat each image as a sample point, it sits in a space of more than 10-100 K dimension (that is each pixel owns an individual dimension). Any pattern recognition problem in high-dimensional space (>100 D) is known to be difficult in the literature.Fig. 1. A frontal face image from the AR database [Martinez 1998]. The size of a JPEG file for this image is typically about 60 Kbytes.2.The number of identities to classify is high.To make the situation worse, an adult human being can learn to recognize thousands if not tens of thousands of different human faces over the span of his/her life. To ask a computer to match the similar ability, it has to first store tens of thousands of learned face images, which in the literature is called the training images. Then using whatever algorithm, the computer has to process the massive data and quickly identify a correct person using a new face image, which is called the test image.Fig. 2. An ensemble of 28 individuals in the Yale B database [Lee 2005]. A typical face recognition system needs to recognition 10-100 times more individuals. Arguably an adult can recognize thousands times more individuals in daily life.Combine the above two problems, we are solving a pattern recognition problem to carefully partition a high-dimensional data space into thousands of domains, each domain represents the possible appearance of an individual’s face images.3.Face recognition has to be performed under various real-world conditions.When you walk into a drug store to take a passport photo, you would usually be asked to pose a frontal, neutral expression in order to be qualified for a good passport photo. The store associate will also control the photo resolution, background, and lighting condition by using a uniform color screen and flash light. However in the real world, a computer program is asked to identify humans without all the above constraints. Although past solutions exist to achieve recognition under very limited relaxation of the constraints, to this day, none of the algorithms can answer all the possible challenges, including this technique we present.To further motivate the issue, human vision can accurately recognize learned human faces under different expressions, backgrounds, poses, and resolutions [Sinha 2006]. With professional training, humans can also identify face images with facial disguise. Figure 3 demonstrates this ability using images of Abraham Lincoln.Fig. 3. Images of Abraham Lincoln under various conditions (available online). Arguably humans can recognize the identity of Lincoln from each of these images.A natural question arises: Do we simply ask too much for a computer algorithm to achieve? For some applications such as at security check-points, we can mandate individuals to pose a frontal, neural face in order to be identified. However, in most other applications, this requirement is simply not practical. For example, we may want to search our photo albums to find all the images that contain our best friendsunder normal indoor/outdoor conditions, or we may need to identify a criminal suspect from a murky, low-resolution hidden camera who would naturally try to disguise his identity. Therefore, the study to recognize human faces under real-world conditions is motivated not only by pure scientific rigor, but also by urgent demands from practical applications.Q: What is the novelty of this technique? Why is the method related to sparse representation?A: The method is built on a novel pattern recognition framework, which relies on a scientific concept called sparse representation. In fact, sparse representation is not a new topic in many scientific areas. Particularly in human perception, scientists have discovered that accurate low-level and mid-level visual perceptions are a result of sparse representation of visual patterns using highly redundant visual neurons [Olshausen 1997, Serre 2006].Without diving into technical detail, let us consider an analogue. Assume that a normal individual, Tom, is very good at identifying different types of fruit juice such as orange juice, apple juice, lemon juice, and grape juice. Now he is asked to identify the ingredients of a fruit punch, which contains an unknown mixture of drinks. Tom discovers that when the ingredients of the punch are highly concentrated on a single type of juice (e.g., 95% orange juice), he will have no difficulty in identifying the dominant ingredient. On the other hand, when the punch is a largely even mixture of multiple drinks (e.g., 33% orange, 33% apple, and 33% grape), he has the most difficulty in identifying the individual ingredients. In this example, a fruit punch drink can be represented as a sum of the amounts of individual fruit drinks. We say such representation is sparse if the majority of the juice comes from a single fruit type. Conversely, we say the representation is not sparse. Clearly in this example, sparse representation leads to easier and more accurate recognition than nonsparse representation.The human brain turns out to be an excellent machine in calculation of sparse representation from biological sensors. In face recognition, when a new image is presented in front of the eyes, the visual cortex immediately calculates a representation of the face image based on all the prior face images it remembers from the past. However, such representation is believed to be only sparse in human visual cortex. For example, although Tom remembers thousands of individuals, when he is given a photo of his friend, Jerry, he will assert that the photo is an image of Jerry. His perception does not attempt to calculate the similarity of Jerry’s photo with all the images from other individuals. On the other hand, with the help of image-editing software such as Photoshop, an engineer now can seamlessly combine facial features from multiple individuals into a single new image. In this case, a typical human would assert that he/she cannot recognize the new image, rather than analytically calculating the percentage of similarities with multiple individuals (e.g., 33% Tom, 33% Jerry, 33% Tyke) [Sinha 2006].Q: What are the conditions that the technique applies to?A: Currently, the technique has been successfully demonstrated to classify frontal face images under different expressions, lighting conditions, resolutions, and severe facial disguise and image distortion. We believe it is one of the most comprehensive solutions in face recognition, and definitely one of the most accurate.Further study is required to establish a relation, if any, between sparse representation and face images with pose variations.Q: More technically, how does the algorithm estimate a sparse representation using face images? Why do the other methods fail in this respect?A: This technique has demonstrated the first solution in the literature to explicitly calculate sparse representation for the purpose of image-based pattern recognition. It is hard to say that the other extant methods have failed in this respect. Why? Simply because previously investigators did not realize the importance of sparse representation in human vision and computer vision for the purpose of classification. For example, a well-known solution to face recognition is called the nearest-neighbor method. It compares the similarity between a test image with all individual training images separately. Figure 4 shows an illustration of the similarity measurement. The nearest-neighbor method identifies the test image with a training image that is most similar to the test image. Hence the method is called the nearest neighbor. We can easily observe that the so-estimated representation is not sparse. This is because a single face image can be similar to multiple images in terms of its RGB pixel values. Therefore, an accurate classification based on this type of metrics is known to be difficult.Fig. 4. A similarity metric (the y-axis) between a test face image and about 1200 training images. The smaller the metric value, the more similar between two images. Our technique abandons the conventional wisdom to compare any similarity between the test image and individual training images or individual training classes. Rather, the algorithm attempts to calculate a representation of the input image w.r.t. all available training images as a whole. Furthermore, the method imposes one extra constraint that the optimal representation should use the smallest number of training images. Hence, the majority of the coefficients in the representation should be zero, and the representation is sparse (as shown in Figure 5).Fig. 5. An estimation of sparse representation w.r.t. a test image and about 1200 training images. The dominant coefficients in the representation correspond to the training images with the same identity as the input image. In this example, the recognition is based on downgraded 12-by-10 low-resolution images. Yet, the algorithm can correctly identify the input image as Subject 1.Q: How does the technique handle severe facial disguise in the image?A: Facial disguise and image distortion pose one of the biggest challenges that affect the accuracy of face recognition. The types of distortion that can be applied to face images are manifold. Figure 6 shows some of the examples.Fig. 6. Examples of image distortion on face images. Some of the cases are beyond human’s ability to perform reliable recognition.One of the notable advantages about the sparse representation framework is that the problem of image compensation on distortion combined with face recognition can be rigorously reformulated under the same framework. In this case, a distorted face image presents two types of sparsity: one representing the location of the distorted pixels in the image; and the other representing the identity of the subject as before. Our technique has been shown to be able to handle and eliminate all the above image distortion in Figure 6 while maintaining high accuracy. In the following, we present an example to illustrate a simplified solution for one type of distortion. For more detail, please refer to our paper [Wright 2008].Figure 7 demonstrates the process of an algorithm to recognize a face image with severe facial disguise by sunglasses. The algorithm first partitions the left test image into eight local regions, and individually recovers a sparse representation per region. Notice that with the sunglasses occluding the eye regions, the corresponding representations from these regions do not provide correct classification. However, when we look at the overall classification result over all regions, the nonocclused regions provide a high consensus for the image to be classified as Subject 1 (as shownin red circles in the figure). Therefore, the algorithm simultaneously recovers the subject identity and the facial regions that are being disguised.Fig. 7. Solving for part-based sparse representation using local face regions. Left: Test image. Right: Estimation of sparse representation and the corresponding classification on the titles. The red circle identifies the correct classiciation.Q: What is the quantitative performance of this technique?A: Most of the representative results from our extensive experiment have been documented in our paper [Wright 2008]. The experiment was based on two established face recognition databases, namely, the Extended Yale B database [Lee 2005] and the AR database [Martinez 1998].In the following, we highlight some of the notable results. On the Extended Yale B database, the algorithm achieved 92.1% accuracy using 12-by-10 resolution images, 93.7% using single-eye-region images, and 98.3% using mouth-region images. On the AR database, the algorithm achieves 97.5% accuracy on face images with sunglasses disguise, and 93.5% with scarf disguise.Q: Does the estimation of sparse representation cost more computation and time compared to other methods?A: The complexity and speed of an algorithm are important to the extent that they do not hinder the application of the algorithm to real-world problems. Our technique uses some of the best-studied numerical routines in the literature, namely, L-1 minimization to be specific. The routines belong to a family of optimization algorithms called convex optimization, which have been known to be extremely efficient to solve on computer. In addition, considering the rapid growth of the technology in producing advanced micro processors today, we do not believe there is any significant risk to implement a real-time commercial system based on this technique.Q: With this type of highly accurate face recognition algorithm available, is it becoming more and more difficult to protect biometric information and personal privacy in urban environments and on the Internet?A: Believe it or not, a government agency, a company, or even a total stranger can capture and permanently log your biometric identity, including your facial identity, much easier than you can imagine. Based on a Time magazine report [Grose 2008], a resident living or working in London will likely be captured on camera 300 times per day! One can believe other people living in other western metropolitan cities are enjoying similar “free services.” If you like to stay indoor and blog on the Internet, your public photo albums can be easily accessed over the nonprotected websites, and probably have been permanently logged by search engines such as Google and Yahoo!.With the ubiquitous camera technologies today, completely preventing your facial identity from being obtained by others is difficult, unless you would never step into a downtown area in big cities and never apply for a driver’s license. However, there are ways to prevent illegal and involuntary access to your facial identity, especially on the Internet. One simple step that everyone can choose to do to stop a third party exploring your face images online is to prevent these images from being linked to your identity. Any classification system needs a set of training images to study the possible appearance of your face. If you like to put your personal photos on your public website and frequently give away the names of the people in the photos, over time a search engine will be able to link the identities of the people with the face images in those photos. Therefore, to prevent an unauthorized party to “crawl” into your website and sip through the valuable private information, you should make these photo websites under password protection. Do not make a large amount of personal images available online without consent and at the same time provide the names of the people on the same website.Previously we have mentioned many notable applications that involve face recognition. The technology, if properly utilized, can also revolutionize the IT industry to better protect personal privacy. For example, an assembly factory can install a network of cameras to improve the safety of the assembly line but at the same time blur out the facial images of the workers from the surveillance videos. A cellphone user who is doing teleconferencing can activate a face recognition function to only track his/her facial movements and exclude other people in the background from being transmitted to the other party. All in all, face recognition is a rigorous scientific study. Its sole purpose is to hypothesize, model, and reproduce the image-based recognition process with accuracy comparable or even superior to human perception. The scope of its final extension and impact to our society will rest on the shoulder of the government, the industry, and each of the individual end users. References[Grose 2008] T. Grose. When surveillance cameras talk. Time (online), Feb. 11, 2008.[Lee 2005] K. Lee et al.. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, 2005.[Martinez 1998] A. Martinez and R. Benavente. The AR face database. CVC Tech Report No. 24, 1998.[Olshausen 1997] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, vol. 37, 1997.[Serre 2006] T. Serre. Learning a dictionary of shape-components in visual cortex: Comparison with neurons, humans and machines. PhD dissertation, MIT, 2006.[Sinha 2006] P. Sinha et al.. Face recognition by humans: Nineteen results all computer vision researchers should know about. Proceedings of the IEEE, vol. 94, no. 11, November 2006.[Wright 2008] J. Wright et al.. Robust face recognition via sparse representation. (in press) IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.。
人脸识别技术与生物特征识别培训ppt

算法优化
硬件升级
持续优化人脸识别算法,提高识别速 度和准确率。
升级硬件设备,提高人脸识别系统的 处理能力和响应速度。
数据训练
使用大规模、多样化的数据集进行训 练,提高人脸识别模型的泛化能力。
05
培训内容与实践
人脸识别技术基础培训
人脸识别技术原理
详细介绍人脸识别技术的原理、算法和实现过程,包括特征提取 、比对和识别等关键技术。
分类
按照所利用的特征类型,生物特征识 别技术可分为基于生理特征和基于行 为特征的识别技术。常见生物特征识来自技术介绍0102
03
04
指纹识别
利用指纹的唯一性和稳定性进 行身份鉴别。
虹膜识别
通过分析眼睛的虹膜纹理进行 身份鉴别。
视网膜识别
通过分析眼睛的视网膜结构进 行身份鉴别。
面部识别
通过分析人的面部特征进行身 份鉴别。
人脸识别技术与生物 特征识别培训
汇报人:可编辑
xx年xx月xx日
• 人脸识别技术概述 • 人脸识别的关键技术 • 生物特征识别技术介绍 • 人脸识别技术的挑战与解决方案 • 培训内容与实践 • 总结与展望
目录
01
人脸识别技术概述
人脸识别技术的定义与原理
总结词 人脸识别技术是一种基于生物特征识 别技术,通过计算机图像处理和人工 智能算法,自动识别和验证个人身份 的技术。
个人隐私。
THANKS
感谢观看
比对
将提取出的特征与数据库中的特征进行比对,以实现人脸的识别或验证。
深度学习在人脸识别中的应用
深度学习模型
深度学习模型如卷积神经网络(CNN)被广泛应用于人脸识 别,能够自动提取高层次的特征表示。
一种基于改进的AdaBoost、肤色和2DPCA的人脸检测方法

一种基于改进的AdaBoost、肤色和2DPCA的人脸检测方法裴珍;许忠仁【摘要】为了提高复杂背景下多人脸检测率以及人脸检测速度,提出了一种基于改进AdaBoost、肤色检测和二维主成分分析法(Two-dimensional Principal Component Analysis,2DPCA)的人脸检测方法.该方法首先利用金字塔结构快速检测人脸,得到人脸检测区域,然后利用肤色检测对待判人脸区域进行过滤,过滤误检的非人脸区域,最后根据人脸的几何位置进行人脸关键部位的2DPCA检测.仿真结果表明,该方法实现了复杂背景下多人脸图像快速检测和精确定位,有效降低了误检率,使检测结果更加精确.【期刊名称】《电子设计工程》【年(卷),期】2014(022)008【总页数】4页(P116-119)【关键词】检测率;AdaBoost算法;肤色检测;2DPCA【作者】裴珍;许忠仁【作者单位】辽宁石油化工大学信息与控制工程学院,辽宁抚顺113001;辽宁石油化工大学信息与控制工程学院,辽宁抚顺113001【正文语种】中文【中图分类】TN707人脸检测(Face Detection)是指在对于任意一幅给定图像中,采用一定的策略对其进行搜索以确定其中是否含有人脸,如果有则返回人脸的位置、大小、姿态[1]。
随着人脸检测在电视会议、图像搜索、数字视频处理、视频检测、视频压缩编码等领域的广泛应用,人脸检测的研究也日益成熟。
然而人脸检测存在以下几个问题:1)人脸存在肤色、外貌、表情等差异,具有模式可变性;2)人脸上有时会有眼镜、胡须和头部饰物以及其他的东西等附属物;3)光源和成像角度会对人脸产生反射和阴影等影响。
目前已有很多学者相继提出了使用AdaBoost与其他特征相融合来检测人脸图像[2-3]。
文献[4]提出的肤色检测方法适用于背景环境简单的人脸检测,然而复杂的背景下仅仅依靠肤色检测会有会很大的误差,产生比较高的误检率。
文献[5]提出的改进的AdaBoost和肤色特征相结合的算法采用了改进的权重,有效避免增益现象。
人脸识别技术外文翻译文献编辑

文献信息文献标题:Face Recognition Techniques: A Survey(人脸识别技术综述)文献作者:V.Vijayakumari文献出处:《World Journal of Computer Application and Technology》, 2013,1(2):41-50字数统计:英文3186单词,17705字符;中文5317汉字外文文献Face Recognition Techniques: A Survey Abstract Face is the index of mind. It is a complex multidimensional structure and needs a good computing technique for recognition. While using automatic system for face recognition, computers are easily confused by changes in illumination, variation in poses and change in angles of faces. A numerous techniques are being used for security and authentication purposes which includes areas in detective agencies and military purpose. These surveys give the existing methods in automatic face recognition and formulate the way to still increase the performance.Keywords: Face Recognition, Illumination, Authentication, Security1.IntroductionDeveloped in the 1960s, the first semi-automated system for face recognition required the administrator to locate features ( such as eyes, ears, nose, and mouth) on the photographs before it calculated distances and ratios to a common reference point, which were then compared to reference data. In the 1970s, Goldstein, Armon, and Lesk used 21 specific subjective markers such as hair color and lip thickness to automate the recognition. The problem with both of these early solutions was that the measurements and locations were manually computed. The face recognition problem can be divided into two main stages: face verification (or authentication), and face identification (or recognition).The detection stage is the first stage; it includesidentifying and locating a face in an image. The recognition stage is the second stage; it includes feature extraction, where important information for the discrimination is saved and the matching where the recognition result is given aid of a face database.2.Methods2.1.Geometric Feature Based MethodsThe geometric feature based approaches are the earliest approaches to face recognition and detection. In these systems, the significant facial features are detected and the distances among them as well as other geometric characteristic are combined in a feature vector that is used to represent the face. To recognize a face, first the feature vector of the test image and of the image in the database is obtained. Second, a similarity measure between these vectors, most often a minimum distance criterion, is used to determine the identity of the face. As pointed out by Brunelli and Poggio, the template based approaches will outperform the early geometric feature based approaches.2.2.Template Based MethodsThe template based approaches represent the most popular technique used to recognize and detect faces. Unlike the geometric feature based approaches, the template based approaches use a feature vector that represent the entire face template rather than the most significant facial features.2.3.Correlation Based MethodsCorrelation based methods for face detection are based on the computation of the normalized cross correlation coefficient Cn. The first step in these methods is to determine the location of the significant facial features such as eyes, nose or mouth. The importance of robust facial feature detection for both detection and recognition has resulted in the development of a variety of different facial feature detection algorithms. The facial feature detection method proposed by Brunelli and Poggio uses a set of templates to detect the position of the eyes in an image, by looking for the maximum absolute values of the normalized correlation coefficient of these templates at each point in test image. To cope with scale variations, a set of templates atdifferent scales was used.The problems associated with the scale variations can be significantly reduced by using hierarchical correlation. For face recognition, the templates corresponding to the significant facial feature of the test images are compared in turn with the corresponding templates of all of the images in the database, returning a vector of matching scores computed through normalized cross correlation. The similarity scores of different features are integrated to obtain a global score that is used for recognition. Other similar method that use correlation or higher order statistics revealed the accuracy of these methods but also their complexity.Beymer extended the correlation based on the approach to a view based approach for recognizing faces under varying orientation, including rotations with respect to the axis perpendicular to the image plane(rotations in image depth). To handle rotations out of the image plane, templates from different views were used. After the pose is determined ,the task of recognition is reduced to the classical correlation method in which the facial feature templates are matched to the corresponding templates of the appropriate view based models using the cross correlation coefficient. However this approach is highly computational expensive, and it is sensitive to lighting conditions.2.4.Matching Pursuit Based MethodsPhilips introduced a template based face detection and recognition system that uses a matching pursuit filter to obtain the face vector. The matching pursuit algorithm applied to an image iteratively selects from a dictionary of basis functions the best decomposition of the image by minimizing the residue of the image in all iterations. The algorithm describes by Philips constructs the best decomposition of a set of images by iteratively optimizing a cost function, which is determined from the residues of the individual images. The dictionary of basis functions used by the author consists of two dimensional wavelets, which gives a better image representation than the PCA (Principal Component Analysis) and LDA(Linear Discriminant Analysis) based techniques where the images were stored as vectors. For recognition the cost function is a measure of distances between faces and is maximized at each iteration. For detection the goal is to find a filter that clusters together in similar templates (themean for example), and minimized in each iteration. The feature represents the average value of the projection of the templates on the selected basis.2.5.Singular Value Decomposition Based MethodsThe face recognition method in this section use the general result stated by the singular value decomposition theorem. Z.Hong revealed the importance of using Singular Value Decomposition Method (SVD) for human face recognition by providing several important properties of the singular values (SV) vector which include: the stability of the SV vector to small perturbations caused by stochastic variation in the intensity image, the proportional variation of the SV vector with the pixel intensities, the variances of the SV feature vector to rotation, translation and mirror transformation. The above properties of the SV vector provide the theoretical basis for using singular values as image features. In addition, it has been shown that compressing the original SV vector into the low dimensional space by means of various mathematical transforms leads to the higher recognition performance. Among the various dimensionality reducing transformations, the Linear Discriminant Transform is the most popular one.2.6.The Dynamic Link Matching MethodsThe above template based matching methods use an Euclidean distance to identify a face in a gallery or to detect a face from a background. A more flexible distance measure that accounts for common facial transformations is the dynamic link introduced by Lades et al. In this approach , a rectangular grid is centered all faces in the gallery. The feature vector is calculated based on Gabor type wavelets, computed at all points of the grid. A new face is identified if the cost function, which is a weighted sum of two terms, is minimized. The first term in the cost function is small when the distance between feature vectors is small and the second term is small when the relative distance between the grid points in the test and the gallery image is preserved. It is the second term of this cost function that gives the “elasticity” of this matching measure. While the grid of the image remains rectangular, the grid that is “best fit” over the test image is stretched. Under certain constraints, until the minimum of the cost function is achieved. The minimum value of the cost function isused further to identify the unknown face.2.7.Illumination Invariant Processing MethodsThe problem of determining functions of an image of an object that are insensitive to illumination changes are considered. An object with Lambertian reflection has no discriminative functions that are invariant to illumination. This result leads the author to adopt a probabilistic approach in which they analytically determine a probability distribution for the image gradient as a function of the surfaces geometry and reflectance. Their distribution reveals that the direction of the image gradient is insensitive to changes in illumination direction. Verify this empirically by constructing a distribution for the image gradient from more than twenty million samples of gradients in a database of thousand two hundred and eighty images of twenty inanimate objects taken under varying lighting conditions. Using this distribution, they develop an illumination insensitive measure of image comparison and test it on the problem of face recognition. In another method, they consider only the set of images of an object under variable illumination, including multiple, extended light sources, shadows, and color. They prove that the set of n-pixel monochrome images of a convex object with a Lambertian reflectance function, illuminated by an arbitrary number of point light sources at infinity, forms a convex polyhedral cone in IR and that the dimension of this illumination cone equals the number of distinct surface normal. Furthermore, the illumination cone can be constructed from as few as three images. In addition, the set of n-pixel images of an object of any shape and with a more general reflectance function, seen under all possible illumination conditions, still forms a convex cone in IRn. These results immediately suggest certain approaches to object recognition. Throughout, they present results demonstrating the illumination cone representation.2.8.Support Vector Machine ApproachFace recognition is a K class problem, where K is the number of known individuals; and support vector machines (SVMs) are a binary classification method. By reformulating the face recognition problem and reinterpreting the output of the SVM classifier, they developed a SVM-based face recognition algorithm. The facerecognition problem is formulated as a problem in difference space, which models dissimilarities between two facial images. In difference space we formulate face recognition as a two class problem. The classes are: dissimilarities between faces of the same person, and dissimilarities between faces of different people. By modifying the interpretation of the decision surface generated by SVM, we generated a similarity metric between faces that are learned from examples of differences between faces. The SVM-based algorithm is compared with a principal component analysis (PCA) based algorithm on a difficult set of images from the FERET database. Performance was measured for both verification and identification scenarios. The identification performance for SVM is 77-78% versus 54% for PCA. For verification, the equal error rate is 7% for SVM and 13% for PCA.2.9.Karhunen- Loeve Expansion Based Methods2.9.1.Eigen Face ApproachIn this approach, face recognition problem is treated as an intrinsically two dimensional recognition problem. The system works by projecting face images which represents the significant variations among known faces. This significant feature is characterized as the Eigen faces. They are actually the eigenvectors. Their goal is to develop a computational model of face recognition that is fact, reasonably simple and accurate in constrained environment. Eigen face approach is motivated by the information theory.2.9.2.Recognition Using Eigen FeaturesWhile the classical eigenface method uses the KLT (Karhunen- Loeve Transform) coefficients of the template corresponding to the whole face image, the author Pentland et.al. introduce a face detection and recognition system that uses the KLT coefficients of the templates corresponding to the significant facial features like eyes, nose and mouth. For each of the facial features, a feature space is built by selecting the most significant “eigenfeatures”, which are the eigenvectors corresponding to the largest eigen values of the features correlation matrix. The significant facial features were detected using the distance from the feature space and selecting the closest match. The scores of similarity between the templates of the test image and thetemplates of the images in the training set were integrated in a cumulative score that measures the distance between the test image and the training images. The method was extended to the detection of features under different viewing geometries by using either a view-based Eigen space or a parametric eigenspace.2.10.Feature Based Methods2.10.1.Kernel Direct Discriminant Analysis AlgorithmThe kernel machine-based Discriminant analysis method deals with the nonlinearity of the face patterns’ distribution. This method also effectively solves the so-called “small sample size” (SSS) problem, which exists in most Face Recognition tasks. The new algorithm has been tested, in terms of classification error rate performance, on the multiview UMIST face database. Results indicate that the proposed methodology is able to achieve excellent performance with only a very small set of features being used, and its error rate is approximately 34% and 48% of those of two other commonly used kernel FR approaches, the kernel-PCA (KPCA) and the Generalized Discriminant Analysis (GDA), respectively.2.10.2.Features Extracted from Walshlet PyramidA novel Walshlet Pyramid based face recognition technique used the image feature set extracted from Walshlets applied on the image at various levels of decomposition. Here the image features are extracted by applying Walshlet Pyramid on gray plane (average of red, green and blue. The proposed technique is tested on two image databases having 100 images each. The results show that Walshlet level-4 outperforms other Walshlets and Walsh Transform, because the higher level Walshlets are giving very coarse color-texture features while the lower level Walshlets are representing very fine color-texture features which are less useful to differentiate the images in face recognition.2.10.3.Hybrid Color and Frequency Features ApproachThis correspondence presents a novel hybrid Color and Frequency Features (CFF) method for face recognition. The CFF method, which applies an Enhanced Fisher Model(EFM), extracts the complementary frequency features in a new hybrid color space for improving face recognition performance. The new color space, the RIQcolor space, which combines the component image R of the RGB color space and the chromatic components I and Q of the YIQ color space, displays prominent capability for improving face recognition performance due to the complementary characteristics of its component images. The EFM then extracts the complementary features from the real part, the imaginary part, and the magnitude of the R image in the frequency domain. The complementary features are then fused by means of concatenation at the feature level to derive similarity scores for classification. The complementary feature extraction and feature level fusion procedure applies to the I and Q component images as well. Experiments on the Face Recognition Grand Challenge (FRGC) show that i) the hybrid color space improves face recognition performance significantly, and ii) the complementary color and frequency features further improve face recognition performance.2.10.4.Multilevel Block Truncation Coding ApproachIn Multilevel Block Truncation coding for face recognition uses all four levels of Multilevel Block Truncation Coding for feature vector extraction resulting into four variations of proposed face recognition technique. The experimentation has been conducted on two different face databases. The first one is Face Database which has 1000 face images and the second one is “Our Own Database” which has 1600 face images. To measure the performance of the algorithm the False Acceptance Rate (FAR) and Genuine Acceptance Rate (GAR) parameters have been used. The experimental results have shown that the outcome of BTC (Block truncation Coding) Level 4 is better as compared to the other BTC levels in terms of accuracy, at the cost of increased feature vector size.2.11.Neural Network Based AlgorithmsTemplates have been also used as input to Neural Network (NN) based systems. Lawrence et.al proposed a hybrid neural network approach that combines local image sampling, A self organizing map (SOM) and a convolutional neural network. The SOP provides a set of features that represents a more compact and robust representation of the image samples. These features are then fed into the convolutional neural network. This architecture provides partial invariance to translation, rotation, scale and facedeformation. Along with this the author introduced an efficient probabilistic decision based neural network (PDBNN) for face detection and recognition. The feature vector used consists of intensity and edge values obtained from the facial region of the down sampled image in the training set. The facial region contains the eyes and nose, but excludes the hair and mouth. Two PDBNN were trained with these feature vectors and used one for the face detection and other for the face recognition.2.12.Model Based Methods2.12.1.Hidden Markov Model Based ApproachIn this approach, the author utilizes the face that the most significant facial features of a frontal face which includes hair, forehead, eyes, nose and mouth which occur in a natural order from top to bottom even if the image undergo small variation/rotation in the image plane perpendicular to the image plane. One dimensional HMM (Hidden Markov Model) is used for modeling the image, where the observation vectors are obtained from DCT or KLT coefficients. They given c face images for each subject of the training set, the goal of the training set is to optimize the parameters of the Hidden Markov Model to best describe the observations in the sense of maximizing the probability of the observations given in the model. Recognition is carried out by matching the best test image against each of the trained models. To do this, the image is converted to an observation sequence and then model likelihoods are computed for each face model. The model with the highest likelihood reveals the identity of the unknown face.2.12.2.The Volumetric Frequency Representation of Face ModelA face model that incorporates both the three dimensional (3D) face structure and its two-dimensional representation are explained (face images). This model which represents a volumetric (3D) frequency representation (VFR) of the face , is constructed using range image of a human head. Making use of an extension of the projection Slice Theorem, the Fourier transform of any face image corresponds to a slice in the face VFR. For both pose estimation and face recognition a face image is indexed in the 3D VFR based on the correlation matching in a four dimensional Fourier space, parameterized over the elevation, azimuth, rotation in the image planeand the scale of faces.3.ConclusionThis paper discusses the different approaches which have been employed in automatic face recognition. In the geometrical based methods, the geometrical features are selected and the significant facial features are detected. The correlation based approach needs face template rather than the significant facial features. Singular value vectors and the properties of the SV vector provide the theoretical basis for using singular values as image features. The Karhunen-Loeve expansion works by projecting the face images which represents the significant variations among the known faces. Eigen values and Eigen vectors are involved in extracting the features in KLT. Neural network based approaches are more efficient when it contains no more than a few hundred weights. The Hidden Markov model optimizes the parameters to best describe the observations in the sense of maximizing the probability of observations given in the model .Some methods use the features for classification and few methods uses the distance measure from the nodal points. The drawbacks of the methods are also discussed based on the performance of the algorithms used in the approaches. Hence this will give some idea about the existing methods for automatic face recognition.中文译文人脸识别技术综述摘要人脸是心灵的指标。
面向人脸识别的WPD-HOG金字塔特征提取方法

面向人脸识别的WPD-HOG金字塔特征提取方法刘文培;李凤莲;张雪英;田玉楚【摘要】人脸识别技术可应用于各监控和安保领域,它涉及特征提取、识别模型等关键技术.其中特征提取方法直接影响识别效果,目前所用的特征提取方法存在特征表达不全面、计算复杂度高等问题.据此,提出一种基于WPD-HOG金字塔的人脸特征提取方法,该方法结合小波包分解(Wavelet Packet Decomposition,WPD)、图像金字塔以及方向梯度直方图(Histograms of Oriented Gradients,HOG)对人脸图像特征进行有效表征,最终将WPD-HOG金字塔特征通过SVM分类器进行分类.通过在ORL人脸库上进行实验,与四种对比方法HOG、HOG金字塔、FWPD-HOG以及FWPD-HOG金字塔进行比较,实验结果表明,WPD-HOG金字塔特征提取方法的识别率要高于对比方法,且在噪声方面具有较好的鲁棒性.【期刊名称】《计算机工程与应用》【年(卷),期】2018(054)022【总页数】6页(P150-155)【关键词】人脸识别特征提取;小波包分解;图像金字塔;方向梯度直方图【作者】刘文培;李凤莲;张雪英;田玉楚【作者单位】太原理工大学信息工程学院,山西晋中 030600;太原理工大学信息工程学院,山西晋中 030600;太原理工大学信息工程学院,山西晋中 030600;太原理工大学信息工程学院,山西晋中 030600;昆士兰科技大学电机工程及计算机科学学院,澳大利亚昆士兰【正文语种】中文【中图分类】TP391.411 引言人脸识别技术作为生物识别技术的一种,以其特有的稳定性、方便性、唯一性等特点被愈来愈多地应用于各种身份识别领域。
人脸识别技术是基于人的脸部特征进行身份识别的,因此特征提取方法性能优劣直接决定识别效果。
目前,常用的特征提取方法大体可以分为两种:一种是基于全局特征的特征提取方法,能有效地表达人脸的轮廓特征,如主成分分析(Principal Component Analysis,PCA)[1]、线性判别分析(Linear Discriminant Analysis,LDA)[2]、特征脸[3]等方法;另一种是基于局部特征的特征提取方法,反映的是人脸的细节特征,如局部二值模式(Local Binary Pattern,LBP)[4]、Gabor[5]、方向梯度直方图(Histograms of Oriented Gradients,HOG)[6]等方法。
任意光照下人脸图像的低维光照空间表示

Vol.33,No.1ACTA AUTOMATICA SINICA January,2007 A Low-dimensional Illumination Space Representation ofHuman Faces for Arbitrary Lighting ConditionsHU Yuan-Kui1WANG Zeng-Fu1Abstract The proposed method for low-dimensional illumination space representation(LDISR)of human faces can not only synthesize a virtual face image when given lighting conditions but also estimate lighting conditions when given a face image.The LDISR is based on the observation that9basis point light sources can represent almost arbitrary lighting conditions for face recognition application and different human faces have a similar LDISR.The principal component analysis(PCA)and the nearest neighbor clustering method are adopted to obtain the9basis point light sources.The9basis images under the9basis point light sources are then used to construct an LDISR which can represent almost all face images under arbitrary lighting conditions. Illumination ratio image(IRI)is employed to generate virtual face images under different illuminations.The LDISR obtained from face images of one person can be used for other people.Experimental results on image reconstruction and face recognition indicate the efficiency of LDISR.Key words LDISR,basis image,illumination ratio image,face recognition1IntroductionIllumination variation is one of the most important fac-tors which reduce significantly the performance of face recognition system.It has been proved that the variations between images of the same face due to illumination are almost always larger than image variations due to change in face identity[1].So eliminating the effects due to illumi-nation variations relates directly to the performance and practicality of face recognition system.To handle face image variations due to changes in ligh-ting conditions,many methods have been proposed thus far.Generally,the approaches to cope with variation in appearance due to illumination fall into three kinds[2]: invariant features,such as edge maps,imagesfiltered with2D Gabor-like functions,derivatives of the gray-level image,images with Log transformations and the re-cently reported quotient image[3]and self-quotient image[4]; variation-modeling,such as subspace methods[5∼7],illumi-nation cone[8∼10];and canonical forms,such as methods in [11,12].This paper investigates the subspace methods for illumi-nation representation.Hallinan et al.[5,6]proposed an eigen subspace method for face representation.This method firstly collected frontal face images of the same person un-der different illuminations as training set,and then used principal component analysis(PCA)method to get the eigenvalues and eigenvectors of the training set.They concluded that5±2eigenvectors would suffice to model frontal face images under arbitrary illuminations.The ex-perimental results indicated that this method can recon-struct frontal face images with variant lightings using a few eigenvectors.Different from Hallinan,Shashua[7]pro-posed that under the assumption of Lambertian surface, three basis images shot under three linearly independent light sources could reconstruct frontal face images under arbitrary lightings.This method was proposed to discount the lighting effects but not to explain lighting conditions. Belhumeur et al.[8,9]proved that face images with the same pose under different illumination conditions form a convex cone,called illumination cone,and the cone can be repre-sented in a9dimensional space[10].This method performs well but it needs no less than seven face images for each Received January11,2006;in revised form March28,2006 Supported by Open Foundation of National Laboratory of Pattern Recognition,P.R.China.1.Department of Automation,University of Science and Technol-ogy of China,Hefei230027,P.R.ChinaDOI:10.1360/aas-007-0009person to estimate the3D face shape and the irradiance map.Basri&Jacobs[13]and Ramamoorthi[14,15]indepen-dently applied the spherical harmonic representation and explained the low dimensionality of differently illuminated face images.They theoretically proved that the images of a convex Lambertian object obtained under a wide variety of lighting conditions can be approximated accurately with a 9D linear subspace,explaining prior empirical results[5∼7]. However,both of them assumed that the3D surface normal and albedo(or unit albedo)were known.This assumption limits the application of this algorithm.The above research results theoretically and empirically indicate that frontal face images obtained under a wide variety of lighting conditions can be approximated accu-rately with a low-dimensional linear subspace.However, all the above subspace methods construct a subspace from training images for each human face,which is not only cor-responding to the illumination conditions but also to the face identity.The subspaces,in which the intrinsic infor-mation(shape and albedo)and the extrinsic information (lightings)are mixed,are not corresponding to the lighting conditions distinctly.Otherwise,a large training image set would be needed in the learning stage and3d face model might be needed.In this paper,a low-dimensional illumination space rep-resentation(LDISR)of human faces for arbitrary lighting conditions is proposed,which can handle the problems that can not be solved well in the existing methods to a certain extent.The key idea underlying our model is that any lighting condition can be represented by9basis point light sources.The9basis images under the9basis point light sources construct an LDISR,which separates the intrinsic and the extrinsic information and can both estimate ligh-ting conditions when given a face image and synthesize a virtual face image when given lighting condition combin-ing with the illumination ratio image(IRI)method.The method in[10]and the proposed method in this paper have some similarities,but they have some essential differences also.The former needs to build one subspace for each per-son,and the latter only needs to build one subspace for one selected person.Furthermore,the9D illumination space built in the former case is not corresponding to the lighting conditions distinctly,and in our case once the correspon-ding illumination space is built,it can be used to generate virtual frontal face images of anybody under arbitrary illu-minations by using the warping technology and IRI method developed.These virtual images are then used for the pur-pose of both training and recognition.The experiments onc 2007by Acta Automatica Sinica.All rights reserved.10ACTA AUTOMATICA SINICA Vol.33 Fig.1The positions corresponding to the dominant pointlight sourcesYale Face Database B indicate that the proposed methodcan improve the performance of face recognition efficiently.2Constructing the LDISRSince any given set of lighting conditions can be exactlyexpressed as a sum of point light sources,a surface patch sradiance illuminated by two light sources is the sum of thecorresponding radiances when the two light sources are ap-plied separately.More detail was discussed in[5].In thissection,PCA and clustering based method are adopted tofind the basis point light sources,which are able to repre-sent arbitrary lighting conditions.The needed3D face model was obtained using a3D ima-ging machine3DMetrics TM.Then the3D face model ob-tained was used to generate the training images.Moveafloodlight by increments of10degrees to each position(θi,ϕj)to generate image p(θi,ϕj),whereθis the eleva-tion andϕis the azimuth.Typicallyϕ∈[−120◦,120◦]andθ∈[−90◦,90◦].Totally,427images were generated,denoted as{pk ,k=1,···,427}.We use PCA tofind the dominant components for the finite set of images.Since the PCA is used on the images of the same human face with different lighting conditions, the dominant eigenvectors do not reflect the facial shape but the lighting conditions.So the above eigenvectors can be used to represent lighting conditions.In this paper,the lighting subspace is constructed not using the eigenvectors directly but the light sources corresponding to the eigen-vectors.According to the ratio of the corresponding eigen-value to the sum of all the eigenvalues,thefirst60 eigenvalues containing the99.9%energy were selected. And the60corresponding eigenvectors were selected as the principal components.Denote thefirst60eigenvectors as{u i,i=1,···,60}.For the i th eigenvector u i,thecorresponding training image is pj ,where u i and pjsatisfyu T i pj =maxk∈{1, (427){u T i pk}(1)The positions of the60dominant point light sources are shown in Fig.1.By investigating the positions of the dominant point light sources,it can be found that the dominant point light sources are distributed by certain rules.They are distributed almost symmetrically and cluster together in regions such as the frontal,the side,the below,and the above of head.The nearest neighbor clustering method is adopted here to get the basis light positions.Considering the effects of point light sources in different elevation and azimuth,some rules are employed for clustering:1.When the elevation is below−60◦or above60◦,clus-tering is done based on the differences of values in elevation.2.When the elevation is in range[−60◦,60◦],clusteringis donebased on the Euclidian distances in space.Fig.2The clustering result of thefirst60eigenvectors.By adopting the nearest nerghbor clustering method,the 60dominant light sources can be classified into9classes. The clustering result is shown in Fig.2.When the geometric center of each class is regarded as the basis position,the9 basis light positions are shown in Table1.From the above procedure,it is known that point light sources in the9basis positions are dominant and princi-pal components in the lighting space,and they can express arbitrary lighting conditions.The9basis images obtained under the9basis point light sources respectively construct a low-dimensional illumination space representation(LD-ISR)of human face,which can express frontal face images under arbitrary illuminations.Because different human faces have similar3D shapes[3,16],the LDISR of different faces is also similar.As an approximation,it can be as-sumed that different persons have the same LDISR,which has been discussed in[17].Denote the9basis images obtained under9basis lights are I i,i=1,···,9,the LDISR of human face can be de-noted as A=[I1,I2,···,I9].The face image under lighting s x can be expressed asI x=Aλλ(2) whereλ=[λ1,λ2,···,λ9]T,0≤λi≤1is the lighting pa-rameters of image I x and can be calculated by minimizing the energy function E(λ):E(λ)= Aλλ−I x 2(3) So we can getλ=A+I x(4) whereA+=(A T A)−1A T(5)No.1Hu Yuan-Kui and WANG Zeng-Fu:A Low-dimensional Illumination Space Representation of (11)Table1Positions of the9basis light sourceslight123456789Elevationθ(degree)017.525.7364468.6-33.3-35-70Elevationϕ(degree)0-47.544.4-10888-385-9522.5(a)Input image(b)ASM alignment(c)Warped mean shape(d)The virtual images generated under different lightingsFig.3Generating virtual images using the9D illuminationspace and the IRIGiven an image of human face for learning images,thelighting parametersλcan be calculated by(4),and thevirtual face images can be generated by(2)by using thelighting conditionλ.In order to use the LDISR learnedfrom one human face to generate virtual images of other hu-man faces,the illumination ratio image(IRI)based methodis adopted in next section.3Generating virtual images u-sing illumination ratio-image(IRI)Denote the light sources as s i,i=0,1,2,···,respec-tively,where s0is the normal light source,and I ji the imageunder light source s i for the person with index j.The IRIis based on the assumption that a face is a convex surfacewith a Lambertian function.A face image can be describedasI(u,v)=ρ(u,v)n(u,v)·l(6)where,ρ(u,v)is the albedo of point(u,v),n(u,v)is thesurface normal at(u,v),and l is the direction of light.Different from the quotient image[3],illumination ratioimage is defined as follows[11,18,19,20].R i(u,v)=I ij(u,v)I0j(u,v)(7)From(6)and(7),we haveR i(u,v)=ρj(u,v)n T(u,v)·s iρj(u,v)n T(u,v)·s0=n T(u,v)·s in T(u,v)·s0(8)Equation(8)shows that the IRI can be determined only by the surface normal of a face and the light sources,which is independent of specific albedo.Since different human faces have the similar surface normal[3,16],the IRIs of dif-ferent people under the same lighting condition can be con-sidered to be the same.In order to eliminate the effect due to shapes of different faces,the following procedure should be done.Firstly,all faces can be warped to the same shape, and then the IRI is computed.In this paper,an ASM based method is used to perform the face alignment and all faces will then be warped to the predefined mean shape.After the procedure,all faces will have a quite similar3D shape. That is to say,with the same illumination,IRI is the same for different people.The corresponding face image under arbitrary lighting condition can be generated from the IRI. Finally the face image is warped back to its original shape.From(7),we haveI ij(u,v)=I0j(u,v)R i(u,v)(9)Equation(9)means that,given the IRI under s i and the face image under the normal lighting,we can relight the face under s i.The face relighting problem can be defined as follows. Given one image,I a0,of people A under the normal ligh-ting s0,and one image,I bx,of another people B under some specific lighting s x,how to generate the image,I ax, of people A under lighting S x.Unlike[11,18],the IRI under each lighting is unknown in this paper.Given image I bx,the IRI under lighting s x can be calcu-lated using the LDISR described in Section2.Assume the LDISR,A,is learned from images of people M.The ligh-ting parameter,λx,of image I bx is solved by the least-square methodA T Aλλx=A T I bx(10) Aλλx is the image of people M under lighting s x,denoted as I mx.The IRI under lighting s x can be calculated byR x(u,v)=I xm(u,v)/I0m(u,v)(11)where I0m is the image of people M under normal lighting. After the IRI under lighting s x is calculated,the face image of people A can be relit under lighting s x by I xa(u,v)= I0a(u,v)R x(u,v).In general,given face image I0y of arbitrary face Y under lighting s0,face image of Y under arbitrary lighting can be generated by the following procedure:1.Detect face region I0y and align it using ASM;2.Warp I0y to the mean shape T0;3.Relight T0using the IRI under lighting s k:T k(u,v)=T0(u,v)R k(u,v);4.Reverse-warp the texture T k to its original shape toget the relit image I kyFig.3shows some relighting results on Yale Face Database B.In the experiments,the LDISR was con-structed by the nine basis images of people DZF(not in-cluded in Yale Face Database B).For each image under12ACTA AUTOMATICA SINICA Vol.33abec fd gFig.4Results of image reconstruction.a)Original images.b)Images reconstructed by 5-eigenimages.c)Images reconstructed by 3-basis images.d)Images reconstructed by the LDISR.e)The differences corresponding to the images in b).f)The differences corresponding to the images in c).g)The differences corresponding to the images in d).normal lighting in Yale Face Database B,the virtual im-ages under other 63lightings were generated.It should be highlighted that in the original IRI method [11,18],to calculate the IRI,the image under nor-mal lighting and the image under specific lighting must be of the same people.The LDISR based method proposed in this paper breaks this limitation and the face image used in the algorithm can be of different people.In addition,when no face image under normal lighting is available,the virtual image can be generated by using the given λx from (2).And the IRI will then be calculated according to the virtual image.4Experimental results4.12D image reconstructionThe experiment was based on the 427frontal face images under different lightings described in Section 2.In this experiment,three image reconstruction methods were im-plemented:5-eigenimages representation method proposed by Hallinan [5],a linear combination of 3-basis images pro-posed by Shashua [7],and the LDISR based method.The face images under different lightings were reconstructed and the performances were evaluated by the differences between the original and the reconstructed images.According to [5],PCA was adopted to train the 427images and the eigenvectors corresponding to the first 5eigenvalues were selected to construct face illumination sub-space I.According to [7],the selected 3basis images under three point light sources respectively were used to construct face illumination subspace II.The LDISR constructed by the nine basis images was the face illumination subspace III.The total 427face images were reconstructed by the three face illumination subspace,respectively.Some original images are shown in Fig.4a),and the images reconstructed using face illumination I,II,III are shown in Fig.4b),c),and d),respectively.The corre-sponding differences are shown in Fig.4e),f),and g),respectively.It can be concluded from Fig.4that the performances of the 5-eigenimages representation method and the LDISR are comparative,and they are both better than that of the 3-basis images representation method.When the variation due to lighting condition is large (Fig.4c),columns 2,3,and 4),the differences between the original and the recon-structed images are very large (Fig.4f),columns 2,3,and 4),especially when there are shadows in face images.To evaluate more rigorously,the fit function defined in [5]was adopted.The quality of the reconstruction can be measured by the goodness of the fit function:ε=1−I rec −I in 2I in 2(12)where I rec is the reconstructed image,and I in is the original image.The values of the fit function corresponding to all the 427reconstructions by three methods are shown in Fig.5.From Fig.5,it can be seen that the fitness of images reconstructed by the 5-eigenimages representation method and the LDISR to the original image is very good,while the 3-basis images representation method is not so good.When the variation in lighting is larger (corresponding to the abscissas are 50and 280in Fig.5)the performance of the LDISR is better than that of the 5-eigenimages repre-sentation method.Besides,the 5-eigenimages and the 3-basis images rep-resentation methods need multiple images of each person,and train one model for each person.However,the LDISR trains one model using 9basis images of one person,and can be used for other person by warping technique.4.2Face recognition with variant lightings based on virtual imagesIn this experiment,the LDISR and the IRI method were combined to generate virtual face images,which were used for face recognition with variant lightings.The experiments were based on the Yale Face Database B [10].64frontal face images of each person under 64different lightings were selected,and there were 640images of 10persons.TheNo.1Hu Yuan-Kui and WANG Zeng-Fu:A Low-dimensional Illumination Space Representation of ···13Fig.5The values of fit function corresponding to thereconstruction by three methods.images have been divided into five subsets according to the angles the light source direction makes with the camera axis [10]:Subset 1(up to 12◦),subset 2(up to 25◦),subset 3(up to 50◦),subset 4(up to 77◦),and subset 5(others).Correlation,PCA,and LDA methods were adopted for face recognition.For correlation method,the image under normal lighting of each person was the template image and the rest 63images of each person were test images.For PCA and LDA methods,three images of each person (of which the angles the light source direction makes with the camera axis are the smallest)were training images,and the rest were test images.The LDISR was constructed by the nine basis images of people DZF (not included in Yale Face Database B).For each frontal face image in Yale Face Database B,the virtual images corresponding to the other 63lightings were gener-ated using the LDISR and IRI.In order to decrease the effect of illumination,we used gamma intensity correction (GIC).Here γ=4.The three recognition methods were performed on the original images,images with GIC and virtual images with GIC.The results are shown in Fig.6,where correlation,PCA and LDA correspond to the results for the original images,GIC correlation,GIC PCA,and GIC LDA correspond to the results for the images with GIC,and GIC virtual correlation,GIC virtual PCA,and GIC virtual LDA correspond to the results for the virtual images with GIC.Fig.6illustrates that the recognition accuracy for the virtual images is improved greatly.When the variations due to illumination are larger,the improvement is greater.The recognition rates of correlation,PCA,and LDA on the virtual images are 87.24%,87.99%,and 90.5%,respec-tively.For subset 1,subset 2,and subset 3,in which the variations due to illumination are small,the performance of three recognition methods are comparable,while in sub-set 4and subset 5,LDA performs better.This indicates that the classifying ability of LDA is better than others.In the future,we will validate the proposed method on larger face database.5ConclusionThis paper proposes a method to construct an LDISR u-sing the 9basis images under the 9basis point light sources.The LDISR can represent almost all face images underar-Fig.6The results of Face recognition on Yale face Database Bbitrary lighting conditions.The LDISR combined with the IRI is corresponding to the lighting conditions distinctly,and can estimate lighting conditions when given a face im-age and synthesize a virtual face image when given lighting conditions.The experiments of reconstruction illustrate that the representation ability of LDISR is better than the 5-eigenimages and 3-basis images representation methods.The experiments on Yale Face Database B confirm the abi-lity of LDISR in synthesizing a virtual face image and in-dicate that the virtual face images can improve greatly the accuracy of face recognition under variant lightings.The main advantage of the proposed model is that it can be used to generate virtual images of anybody only from 9basis face images of one person.And at the same time,the method need not know the lighting conditions or pre-calculate the IRI.References1Moses Y,Adini Y,Ullman S.Face recognition:the problem of compensating for changes in illumination direction.In:Pro-ceedings of the Third European Conference on Computer Vision.Stockholm,Sweden.Springer-Verlag,1994,286∼2962Sim T,Kanade T,Combining models and exemplars for face recognition:An illuminating example,In:Proceedings of CVPR 2001Workshop on Models versus Exemplars in Computer Vi-sion,Hawaii,USA.IEEE,2001,1∼103Shashua A,Riklin-Raviv T,The quotient image:Class-based re-rendering and recognition with varying illuminations.IEEE Transactions on Pattern Analysis and Machine Intelligence ,2001,23(2):129∼1394Wang H,Li Stan Z,Wang Y.Face recognition under varying lighting conditions using self quotient image.In:Proceedings of the 6th International Conference on Automatic Face and Gesture Recognition,Seoul,Korea.IEEE,2004,819∼8245Hallinan P W.A low-dimensional representation of human faces for arbitrary lighting conditions.In:Proceedings of IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition,Seattle,USA.IEEE,1994,995∼9996Epstein R,Hallinan P W,Yuille A L.5±2eigenimages suffice:an empirical investigation of low dimensional lighting models.In:Proceedings of IEEE Workshop on Physics-Based Vision,Boston,USA.IEEE,1995,108∼1167Shashua A.On photometric issues in 3D visual recognition from a single 2D image.International Journal of Computer Vision ,1997,21(1):99∼1228Belhumeur P,Kriegman D.What is the set of images of an object under all possible illumination conditions.International Journal of Computer Vision ,1998,28(3):245∼2609Georghiades A S,Belhumeur P N,Kriegman D J.From few to many:Illumination cone models for face recognition under variable lighting and pose.IEEE Transactions on Pattern Analysis and Machine Intelligence ,2001,23(6):643∼66014ACTA AUTOMATICA SINICA Vol.3310Lee K,Jeffrey Ho,Kriegman D.Nine points of light:Acquiring subspaces for face recognition under variable lighting.In:Pro-ceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition,Hawaii,USA.IEEE,2001(1), 519∼52611Gao W,Shan S,Chai X,Fu X.Virtual face image generation for illumination and pose insensitive face recognition.In:Proceed-ings of IEEE International Conference on Acoustics,Speech,and Signal Processing,Hong Kong.IEEE,2003,776∼77912Zhao W,Chellappa R.Robust face recognition using symmetric shape-from-shading.Technical Report CARTR-919,1999,Cen-ter for Automation Research,University of Maryland,College Park,MD.13Basri R,Jacobs mbertian reflectance and linear subspaces, In:Proceedings of the8th IEEE Computer Society International Conference On Computer Vision,Vancouver,Canada.IEEE, 2001,2:383∼39014Ramamoorthi R,Hanrahan P.An efficient representation for ir-radiance environment maps.In:Proceedings of the28th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles,USA:ACM Press,2001,497∼50015Ramamoorthi R.Analytic PCA construction for theoretical analysis of lighting variability in images of a Lambertian object, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002,24(10):1322∼133316Chen H F,Belhumeur P N,Jacobs D W.In search of illumi-nation invariants.In:Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition,South Carolina,USA.IEEE,2000,1:54∼26117Wang H,Li Stan Z,Wang Y,Zhang W,Illumination modeling and normalization for face recognition,In:Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures,Nice,France.IEEE,2003,104∼11118Zhao J,Su Y,Wang D,Luo S.Illumination ratio image:Synthe-sizing and recognition with varying illuminations.Pattern Recog-nition Letters,2003,24(15):2703∼271019Wen Z,Liu Z,Huang T S.Face relighting with radiance envi-ronment maps.In:Proceedings of IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition,Madison, USA.IEEE,2003,2:158∼16520Qing L,Shan S,Chen X.Face relighting for face recognition under generic illumination.In:Proceedings of IEEE Interna-tional Conference on Acoustics,Speech,and Signal Processing, Montreal,Canada.IEEE,2004,5:733∼736HU Yuan-Kui Ph.D.candidate in the De-partment of Automation at University of Sci-ence and Technology of China.His researchinterests include face recognition,image pro-cessing,and pattern recognition.W ANG Zeng-Fu Professor of the Depart-ment of Automation at University of Scienceand Technology of China.His current re-search interests include audio&vision in-formation processing,intelligent robots,andpattern recognition.Corresponding author ofthis paper.E-mail:zfwang@。
面部表情识别#ppt课件

科技大学、南京理工大学、北方交通大学等都有人员从
事人脸表情识别的研究
.
3
目前面部表情识别的主要方法:
➢ 基于模板匹配的面部表情识别方法 ➢ 基于神经网络的面部表情识别方法 ➢ 基于规则的人脸面部表情识别方法 ➢ 基于随机序列模型的面部表情识别方法 ➢ 其他方法,比如支持向量机,小波分析
等
.
4
论文主要工作
4. 将径向基函数神经网络用于面部表情特征的 融合上,提出基于RBF网络的多特征融合的 面部表情识别
.
7
信息融合与面部表情分析
➢ 信息融合就是把来自多个信息源的目标信息合
并归纳为一个具有同意表示形式输出的推理过 程, 其基本的出发点是通过对这些信息源所提 供的信息的合理支配和使用, 利用多个信源在
.
14
基于神经网络级联的面部表情识别
网络级联的面部表情识别结构 BP网络的算法流程 网络级联的面部表情识别的实验结果
.
15
网络级联的面部表情识别结构
50× 60的 切割图像
自动定位 人脸切割
320× 243 的原始图
像
形状归一化 灰度归一化
人脸图像 预处理
1
2
1
2
2999
3000
249
250
SOM
提
情
结果
融
取
识
合
别
.
10
基于特征层融合的面部表情识别
这种方法对每个传感器的观测数据进行特征的 抽取以得到一个特征向量, 然后把这些特征向 量融合起来并根据融合后得到的特征向量进行 面部表情识别及判定。
人脸图像 特征提取
面
特
部
人脸图像 特征提取
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Face Recognition in Movie Trailers via Mean Sequence SparseRepresentation-based ClassificationEnrique G.Ortiz,Alan Wright,and Mubarak ShahCenter for Research in Computer Vision,University of Central Florida,Orlando,FL eortiz@,alanwright@,shah@AbstractThis paper presents an end-to-end video face recogni-tion system,addressing the difficult problem of identifying a video face track using a large dictionary of still face im-ages of a few hundred people,while rejecting unknown in-dividuals.A straightforward application of the popular 1-minimization for face recognition on a frame-by-frame ba-sis is prohibitively expensive,so we propose a novel algo-rithm Mean Sequence SRC(MSSRC)that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual.By adding a strict temporal constraint to the 1-minimization that forces individual frames in a face track to all reconstruct a sin-gle identity,we show the optimization reduces to a single minimization over the mean of the face track.We also in-troduce a new Movie Trailer Face Dataset collected from 101movie trailers on YouTube.Finally,we show that our method matches or outperforms the state-of-the-art on three existing datasets(YouTube Celebrities,YouTube Faces,and Buffy)and our unconstrained Movie Trailer Face Dataset. More importantly,our method excels at rejecting unknown identities by at least8%in average precision.1.IntroductionFace Recognition has received widespread attention for the past three decades due to its wide-applicability.Only recently has this interest spread into the domain of video, where the problem becomes more challenging due to the person’s motion and changes in both illumination and oc-clusions.However,it also has the benefit of providing many samples of the same person,thus providing the opportunity to convert many weak examples into a strong prediction of the identity.As video search sites like YouTube have grown,video content-based search has become increasingly necessary. For example,a capable retrieval system should returnall Figure1.This paper addresses the difficult problem of identifying a video face track using a large dictionary of still face images of a few hundred people,while rejecting unknown individuals. videos containing specific actors upon a user’s request.On sites like YouTube,where a cast list or script may not be available,the visual content is the key to accomplishing this task successfully.The main drawback is the availability of annotated video face tracks.With the advent of social networking and photo-sharing, computer vision tasks on the Internet have become increas-ingly fascinating and viable.This avenue is one little ex-ploited by video face recognition.Although large col-lections of annotated individuals in videos are not freely available,collecting data of annotated still images is easily doable,as witnessed by datasets like Labeled Faces in the Wild(LFW)[12]and Public Figures(PubFig)[16].Due to wide availability,we employ large databases of still images to recognize individuals in videos,as depicted in Figure1.Existing video face recognition methods tend to per-form classification on a frame-by-frame basis and later combining those predictions using an appropriate met-ric.A straight-forward application of 1-minimization in this fashion is very computationally expensive.In con-trast,we propose a novel method,Mean Sequence Sparse Representation-based Classification(MSSRC),that per-forms a joint optimization over all faces in the track at once. Though this seems expensive,we show that this optimiza-2013 IEEE Conference on Computer Vision and Pattern Recognition$ "($ $(%$ %$""'$" $$%" $ "#+., $%" '$" $#$ " $ "# $" " $ "" & "+-, "& " " $%" #" " * ) $" $ "#+/, !%%$ $%" '$" $"( $Figure 2.Video Face Recognition Pipeline.With a video as input,we perform face detection and track a face throughout the video clip.Then we extract,PCA,and concatenate three features,Gabor,LBP,and HOG.Finally,we perform face recognition using our novel algorithm MSSRC with an input face track and dictionary of still images.tion reduces to a single 1-minimization over the mean face track,thus reducing a many classification problem to one with inherent computational and practical benefits.Our proposed method aims to perform video face recognition across domains,leveraging thousands of la-beled,still images gathered from the Internet,specif-ically the PubFig and LFW datasets,to perform face recognition on real-world,unconstrained videos.To do this we collected 101movie trailers from YouTube and automatically extracted and tracked faces in the video to create a dataset for video face recognition ( ).Furthermore,we explore the often little-studied,open-universe scenario in which it is important to recognize and reject unknown identities,i.e .we identify famous actors appearing in movie trailers while rejecting background faces that represent un-known extras.We show our method outperforms existing methods in precision and recall,exhibiting the ability to bet-ter reject unknown or uncertain identities.The contributions of this paper are summarized as fol-lows:(1)We develop a fully automatic end-to-end system for video face recognition,which includes face tracking and recognition leveraging information from both still images for the known dictionary and video for recognition.(2)We propose a novel algorithm,MSSRC,that performs video face recognition using an optimization leveraging all of the available video data.(3)We show that our method matches or outperforms the state-of-the-art on three existing datasets (YouTube Faces,YouTube Celebrities,and Buffy)and our unconstrained Movie Trailer Face Dataset.The rest of this paper is organized as follows:Section 2discusses the related work on video face recognition.Then Section 3describes our entire framework for video face recognition from tracking to recognition.Next,in Section 4,we describe our unconstrained Movie Trailer Face Dataset.Section 5exhaustively evaluates our method on existing video datasets and our new dataset.Finally,we end with a summary of conclusions and future work in Section 6.2.Related WorkFor a complete survey of video-based face recognition refer to [18];here we focus on an overview of the most related methods.Current video face recognition techniques fall into one of three categories:key-frame based,temporal model based,and image-set matching based.Key-frame based methods generally perform a predic-tion on the identity of each key-frame in a face track fol-lowed by a probabilistic fusion or majority voting to se-lect the best match.Due to the large variations in the data,key-frame selection is crucial in this paradigm [4].Zhao et al .’s [25]work is most similar to us in that they use a database with still images collected from the Inter-net.They learn a model over this dictionary by learning key faces via clustering.These cluster centers are compared to test frames using a nearest-neighbor search followed by ma-jority,probabilistic voting to make a final prediction.We,on the other hand,use a classification scheme that enhances robustness by finding an agreement amongst the individual frames in a single optimization.Temporal model based methods learn the temporal,fa-cial dynamics of the face throughout a video.Several meth-ods employ Hidden Markov Models(HMM)for this end, e.g.[14].Most related to us,Hadid et al.[10]uses a still image training library by imposing motion information on it to train an HMM and Zhou et al.[26]probabilistically gen-eralizes a still-image library to do video-to-video matching. Generally training these models is prohibitively expensive, especially when the dataset size is large.Image-set matching based methods allows the model-ing of a face track as an image-set.Many methods,like[24], perform a mutual subspace distance where each face track is modeled in their own subspace from which a distance is computed between each.They are effective with clean data, but these methods are very sensitive to the variations inher-ent in video face tracks.Other methods take a more statis-tical approach,like[5],which used Logistic Discriminant-based Metric Learning(LDML)to learn a relationship be-tween images in face tracks,where the inter-class distances are maximized.LDML is very computationally expensive and focuses more on learning relationships within the data, whereas we directly relate the test track to the training data.Character recognition methods have been very popu-lar due to their application to movies and sitcoms.[8,19] perform person identification,where they use all available information,e.g.clothing appearance and audio,to identify the cast rather than the facial information alone.Another[3] used a small user selected sample of characters in the given movie to do a pixel-wise Euclidean distance to handle oc-clusion.While others[2],use a manifold for known charac-ters which successfully clusters input frames.While char-acter recognition is suitable for a long-running series,the use of clothing and other contextual clues are not helpful in the task of identifying actors between movies,TV shows,or non-related video clips.In these scenarios,our approach of focusing on the facial recognition aspect from still images is more adept in unconstrained environments.Still-Image based literature is vast,but a popular ap-proach is Wright et al.’s[23]Sparse Representation-based Classification(SRC),in which they present the principle that a given test image can be represented by a linear com-bination of images from a large dictionary of faces.The key concept is enforcing sparsity,since a test face can be reconstructed best from a small subset of the large dictio-nary,i.e.training faces of the same class.A straight-forward adaptation of this method would be to perform estimation on each frame and fuse results probabilistically,similarly to key-frame based methods.However, 1-minimization is known to be computationally expensive,thus we propose a constrained optimization with the knowledge that the im-ages within a face track are of the same person.We show that imposing this fact reduces the problem to computing a single 1-minimization over the average face track.3.Video Face Recognition PipelineIn this section,we describe our end-to-end video face recognition system.First,we detail our algorithm for face tracking based on face detections from video.Next,we chronicle the features we use to describe the faces and han-dle variations in pose,lighting,and occlusion.Finally,we derive our optimization for video face recognition that clas-sifies a video face track based on a dictionary of still images.3.1.Face TrackingOur method performs the difficult task of face track-ing based on face detections extracted using the high-performance SHORE face detection system[15]and gen-erates a face track based on two metrics.To associate a new detection to an existing track,ourfirst metric determines the ratio of the maximum sized bounding box encompass-ing both face detections to the size of the larger bounding box of the two detections.The formulation is as follows:d spatial=w∗hmax(h1∗w1,h2∗w2),(1) where(x1,y1,w1,h1)and(x2,y2,w2,h2)are the(x,y)lo-cation and the width and height of the previous and current frames respectively.The overall width w and height h are computed as w=max(x1+w1,x2+w2)−min(x1,x2) and h=max(y1+h1,y2+h2)−min(y1,y2).Intuitively, this metric encodes the dimensional similarity of the current and previous bounding boxes,intrinsically considering the spatial information.The second tracking metric takes into account the ap-pearance information via a local color histogram of the face. We compute the distance as a ratio of the histogram inter-section of the RGB histograms with30bins per channel of the last face of a track and the current detection to the total summation of the histogram bins:d appearance=ni=1min(a i,b i)/ni=1a i+b i,(2)where a and b are the histograms of the current and previ-ous face.We compare each new face detection to existing tracks;if the location and appearance metric is similar,the face is added to the track,otherwise a new track is created. Finally,we use a global histogram for the entire frame,en-coding scene information,to detect scene boundaries and impose a lifespan of20frames of no detection to end tracks.3.2.Feature ExtractionBecause real-world datasets contain pose variations even after alignment,we use three fast and popular local fea-tures:Local Binary Patterns(LBP)[1],Histogram of Ori-ented Gradients(HOG)[7],and Gabor wavelets[17].More features aid recognition,but at a higher computational cost.Algorithm1Mean Sequence SRC(MSSRC)1.Input:Training gallery A,test face track Y=[y1,y2,...,y M],and sparsity weight parameterλ. 2.Normalize the columns of A to have unit 2-norm.pute mean of the track¯y= Mm=1y m/M andnormalize to unit 2-norm..5.Solve the 1-minimation problem˜x1=arg minx¯y−A x 22+λ x 1pute residual errors for each class j∈[1,C]r j(¯y)= ¯y−A j x j 27.Output:identity I and confidence P(I|¯y)I(¯y)=arg minjr j(¯y)P(I∈[1,C]|¯y)=C·max j x j 1/ ˜x 1−1C−1Before feature extraction,all images arefirst eye-alignedusing eye locations from SHORE and normalized by sub-tracting the mean,removing thefirst order brightness gradient,and performing histogram equalization.Gaborwavelets were extracted with one scaleλ=4at four ori-entationsθ={0◦,45◦,90◦,135◦}with a tight face crop at a resolution of25x30pixels.A null Gaborfilter includesthe raw pixel image(25x30)in the descriptor.The stan-dard LBP U28,2and HOG descriptors are extracted from72x80 loosely cropped images with a histogram size of59and32 over9x10and8x8pixel patches,respectively.All descrip-tors were scaled to unit norm,dimensionality reduced with PCA to1536dimensions each,and zero-meaned.3.3.Mean Sequence Sparse Representation-basedClassification(MSSRC)Given a test image y and training set A,we know that theimages of the same class to which y should match is a smallsubset of A and their relationship is modeled by y=A x, where x is the coefficient vector relating them.Therefore, the coefficient vector x should only have non-zero entries for those few images from the same class and zeros for the rest.Imposing this sparsity constraint upon the coefficient vector x results in the following formulation:ˆx1=arg minxy−A x 22+λ x 1,(3)where the 1-norm enforces a sparse solution by minimizing the absolute sum of the coefficients.The leading principle of our method is that all of the images y from the face track Y=[y1,y2,...,y M]be-long to the same person.Because all images in a face track belong to the same person,one would expect a high de-gree of correlation amongst the sparse coefficient vectors x j∀j∈[1...M],where M is the length of the track. Therefore,we can look for an agreement on a single coeffi-cient vector x determining the linear combination of train-ing images A that make up the unidentified person.In fact, with sufficient similarity between the faces in a track,one might expect nearly the same coefficient vector to be recov-ered for each frame.This provides the intuition for our ap-proach:we enforce a single coefficient vector for all frames. Mathematically,this means the sum squared residual error over the fames should be minimized.We enforce this con-straint on the 1solution of Eqn.3as follows:˜x1=arg minxMm=1y m−A x 22+λ x 1(4)where we minimize the 2error over the entire image se-quence,while assuming the coefficient vector x is sparse and the same over all of the images.Focusing on thefirst part of the equation,more specifi-cally the 2portion,we can rearrange it as follows:Mm=1y m−A x 22=Mm=1y m−¯y+¯y−A x 22=Mm=1( y−¯y 22+2(y m−¯y)T(¯y−A x)+...¯y−A x 22),(5) where¯y=Mm=1y m/M.However,Mm=12(y m−¯y)T(¯y−A x)=2Mm=1y m−M¯y(¯y−A x)=0(¯y−A x)=0.Thus,Eq.5becomes:Mm=1y m−A x 22(6) =Mm=1y m−¯y 22+M ¯y−A x 22,where the first part of the sum is a constant.Therefore,we obtain the final simplification of our original minimization:˜x 1=arg minxM m =1 y m −A x 22+λ x 21=arg min xM ¯y −A x 22+λ x 1=arg min x¯y −A x 22+λ x 1(7)where M ,by division,is absorbed by the constant weight λ.By this sequence,our optimization reduces to the 1-minimization of x for the mean face track ¯y .This conclusion,that enforcing a single,consistent co-efficient vector x across all images in a face track Y is equivalent to a single 1-minimization over the average of all the frames in the face track,is key to keeping our ap-proach robust yet fast.Instead of performing M individ-ual 1-minimizations over each frame and classifying via some voting scheme,our approach performs a single 1-minimization on the mean of the face track,which is not only a significant speed up,but theoretically sound.Further-more,we empirically validate in subsequent sections that our approach outperforms other forms of temporal fusion and voting amongst individual frames.Finally,we classify the average test track ¯y by determin-ing the class of training samples that best reconstructs the face from the recovered coefficients:I (¯y )=min jr j (¯y )=min ¯y −A j x j 2,(8)where the label I (¯y )of the test face track is the minimalresidual or reconstruction error r j (¯y)and x j is the recov-ered coefficients from the global solution ˜x 1that belong to class j .Confidence in the determined identity is obtained using the Sparsity Concentration Index (SCI),which is a measure of how distributed the residuals are across classes:SCI =C ·max j x j 1/ ˜x 1−1C −1∈[0,1],(9)ranging from 0(the test face is represented equally by all classes)to 1(the test face is fully represented by one class).4.Movie Trailer Face DatasetExisting datasets do not capture the large-scale identifi-cation scope we wish to evaluate.The YouTube Celebrities Dataset [14]has unconstrained videos from YouTube,how-ever they are very low quality and only contain 3unique videos per person,which they segment.The YouTube Faces Dataset [22]and Buffy Dataset [5]also exhibit more chal-lenging scenarios than traditional video face recognition datasets,however YouTube Faces is geared towards face0255075100125150175200204060N u m b e r o f T r a c ClassesFigure 3.The distribution of face tracks across the identities inPubFig+10.verification,same vs.not same,and Buffy only contains 8actors;thus,both are ill-suited for the large-scale face iden-tification of our proposed video retrieval framework.We built our Movie Trailer Face Dataset using 101movie trailers from YouTube from the 2010release year that con-tained celebrities present in the supplemented PublicFig+10dataset.These videos were then processed to generate face tracks using the method described above.The result-ing dataset contains 4,485face tracks,65%consisting of unknown identities (not present in PubFig+10)and 35%known.The class distribution is shown in Fig.3with the number of face tracks per celebrity in the movie trailers ranging from 5to 60labeled samples.The fact that half of the public figures do not appear in any of the movie trail-ers presents an interesting test scenario in which the algo-rithm must be able to distinguish the subject of interest from within a large pool of potential identities.5.ExperimentsIn this section,we first compare our tracking method to a standard method used in the literature.Then,we evaluate our video face recognition method on three exist-ing datasets,YouTube Faces,YouTube Celebrities,Buffy.We also evaluate several algorithms,including MSSRC (ours),on our new Movie Trailer Face Dataset,showing the strengths and weaknesses of each and thus proving experi-mentally the validity of our algorithm.5.1.Tracking ResultsTo analyze the quality of our automatically generated face tracks,we ground-truthed five movie trailers from the dataset:‘The Killer Inside’,‘My Name is Khan’,‘Biutiful’,‘Eat,Pray,Love’,and ‘The Dry Land’.Based on tracking literature [13],we use two CLEAR MOT metrics,Multi-ple Object Tracking Accuracy and Precision (MOTP and MOTA),for evaluation that better consider issues faced by trackers than standard accuracy,precision,or recall.The MOTA tells us how well the tracker did overall in regards to all of the ground-truth labels,while the MOTP appraises how well the tracker performed on the detections that exist in the ground-truth.Method Video KLT[8]Ours‘The Killer Inside’MOTP68.9369.35 MOTA42.8842.16‘My Name is Khan’MOTP65.6365.77 MOTA44.2648.24‘Biutiful’MOTP61.5861.34 MOTA39.2843.96‘Eat Pray Love’MOTP56.9856.77 MOTA34.3335.60‘The Dry Land’MOTP64.1162.70 MOTA27.9030.15Average MOTP63.4663.19 MOTA37.7340.02Table1.Tracking Results.Our method outperforms the KLT-based[8]method in terms of MOTA by2%.Method Accuracy±SE AUC EERMBGS[22]75.3±2.582.026.0MSSRC(Ours)75.3±2.282.925.3 Table2.YouTube Faces Dataset.Results for top performing video face verification algorithm MBGS and our competitive method MSSRC.Note:MBGS results are different from those published, but they are the output of default settings in their system.Although our goal is not to solve the tracking problem, in Table1we show our results compared to a standard face tracking method.Thefirst column shows a KLT-based method[8],where the face detections are associated based on a ratio of overlapping tracked features,and the second shows our method.Both methods are similarly precise, however our metrics have a larger coverage of total detec-tions/tracks by2%in MOTA with a3.5x speedup.Results are available online.5.2.YouTube Faces DatasetAlthough face identification is the focus of our paper,we evaluated our method on the YouTube Faces Dataset[22] for face verification(same/not same),to show that our method can also work in this context.To the best of our knowledge,there is only one paper[9],that has done face verification using SRC,however it was not in the context of video face recognition,but that of still images from LFW. The YouTube Faces Dataset consists of5,000video pairs, half same and half not.The videos are divided into10splits each with500pairs.The results are averaged over the ten splits,where for each split one is used for testing and the remaining nine for training.Thefinal results are presented in terms of accuracy,area under the curve,and equal error rate.As seen in Table4,we obtain competitive results withMethod Accuracy(%)HMM[14]71.24MDA[20]67.20SANP[11]65.03COV+PLS[21]70.10UISA[6]74.60MSSRC(Ours)80.75Table3.YouTube Celebrities Dataset.We outperform the best reported result by6%.Method Accuracy(%)LDML[5]85.88MSSRC(Ours)86.27Table4.Buffy Dataset.We obtain a slight gain in accuracy over the reported method.the top performing method MBGS[22],within1%in terms of accuracy,and MSSRC even surpasses it in terms of area under the curve(AUC)by just below1%with a lower equal error rate by0.7%.We perform all experiments with the same LBP data provided by[22]and aτvalue of0.0005.5.3.YouTube Celebrities DatasetThe YouTube Celebrities Dataset[14]consists of47 celebrities(actors and politicians)in1910video clips downloaded from YouTube and manually segmented to the portions where the celebrity of interest appears.There are approximately41clips per person segmented from3unique videos per actor.The dataset is challenging due to pose,il-lumination,and expression variations,as well as high com-pression and low ing our tracker,we successfully tracked92%of the videos as compared to the80%tracked in their paper[14].The standard experimental setup selects 3training clips,1from each unique video,and6test clips, 2from each unique video,per person.In Table3,we sum-marize reported results on YouTube Celebrities,where we outperform the state-of-the-art by at least6%.5.4.Buffy DatasetThe Buffy Dataset consists of639manually annotated face tracks extracted from episodes9,21,and45from dif-ferent seasons of the TV series“Buffy the Vampire Slayer”. They generated tracks using the KLT-based method[8] (available on the author’s website).For features,we com-pute SIFT descriptors at9fiducial points as described in[5] and use their experimental setup with312tracks for train-ing and327testing.They present a Logistic Discriminant-based Metric Learning(LMDL)method that learns a sub-space.In their supervised experiments,they tried several classifiers with each obtaining similar results.However,us-ing our classifier,there is a slight improvement.MethodAP (%)Recall (%)NN 9.530.00SVM50.069.69LDML [5]19.480.00L236.160.00SRC (First Frame)42.1513.39SRC (V oting)54.8823.47MSSRC (Ours)58.7030.23Table 5.Movie Trailer Face Dataset.MSSRC outperforms all of the non-SRC methods by at least 8%in AP and 20%recall at 90%precision.1020304050607080901000102030405060708090100P r e c i s i o n (%)Recall (%)NN SVM LDML L2SRC (1 Frame)SRC (Voting)MSSRC (Ours)Figure 4.Precision vs.Recall for the Movie Trailer Face Dataset.MSSRC rejects unknowns or distractors better than all others.5.5.Movie Trailer Face DatasetIn this section,we present results on our unconstrained Movie Trailer Face Dataset that allows us to test larger scale face identification,as well as each algorithms ability to re-ject unknown identities.In our test scenario,we chose the Public Figures (PF)[16]dataset as our training gallery,sup-plemented by images collected of 10actors and actresses from web searches for additional coverage of face tracks extracted from movie trailers.We also cap the maximum number of training images per person in the dataset to 200for better performance due to the fact that predictions are otherwise skewed towards the people with the most exam-ples.The distribution of face tracks across all of the identi-ties in the PubFig+10dataset are shown in Fig.3.In total,PubFig+10consists of 34,522images and our Movie Trailer Face Dataset has 4,485face tracks,which we use to conduct experiments on several algorithms.5.5.1Algorithmic ComparisonThe tested methods include NN,LDML,SVM,L2,SRC,and our method MSSRC.For the experiments with NN,LDML,SVM,L2,and SRC,we test each individual frame of the face track and predict its final identity via probabilis-tic voting and its confidence is an average over the predicted distances or decision values.The confidence values are used to reject predictions to evaluate the precision and recall of the system.Note all MSSRC experiments are performed with a λvalue of 0.01.We present results in terms of preci-sion and recall as defined in [8].Table 5presents the results for the described methods on the Movie Trailer Face Dataset in terms of two measures,average precision and recall at 90%precision.NN performs very poorly in terms of both metrics,which explains why NN based methods have focused on finding “good”key-frames to test on.LMDL struggles with the larger num-ber of training classes vs.the Buffy experiment with only 19.48%average precision.The L2method performs sur-prisingly well for a simple method.We also tried Mean L2with similar performance.The SVM and SRC based meth-ods perform very closely at high recall,but not in terms of AP and recall at 90%precision with MSSRC outperforming SVM by 8%and 20%respectively.In Fig.4,the SRC based methods reject unknown identities better than the others.The straightforward application of SRC on a frame-by-frame basis and our efficient method MSSRC perform within 4%of each other,thus experimentally validating that MSSRC is computationally equivalent to performing stan-dard SRC on each individual frame.Instead of computing SRC on each frame,which takes approximately 45minutes per track,we reduce a face track to a single feature vector for 1-minimization (1.5min/track).Surprisingly,MSSRC obtains better recall at 90%precision by 7%and 4%in aver-age precision.Instead of fusing results after classification,as done on the frame by frame methods,MSSRC benefits in better rejection of uncertain predictions.In terms of timing,the preprocessing steps of tracking runs identically for SRC and MSSRC at 20fps and feature extraction runs at 30fps.For identification,MSSRC classifies at 20milliseconds per frame,whereas SRC on a single frame takes 100millisec-onds.All other methods classify in less than 1ms,however with a steep drop in precision and recall.5.5.2Effect of Varying Track LengthThe question remains,do we really need all of the images?To answer this question we select the first m frames for each track and test the two best performing methods from the previous experiments:MSSRC and SVM.Fig.5shows that at just after 20frames performance plateaus,which is close to the average track length of 22frames.Most impor-tantly,the results show that using multiple frames is ben-eficial since moving from using 1frame to 20frames re-sults in a 5.57%and 16.03%increase in average precision and recall at 90%precision respectively for MSSRC.Fur-。