2009-表情机器人研究现状_岳翠萍

合集下载

基于卷积神经网络的老年人情绪识别技术在机器人照护中的应用

基于卷积神经网络的老年人情绪识别技术在机器人照护中的应用随着人口老龄化的加剧，老年人照护问题越来越受到关注。

传统的老年人照护方式往往需要大量的人力物力，而且很难满足老年人的个性化需求。

近年来，基于卷积神经网络的老年人情绪识别技术逐渐应用于机器人照护中，为老年人提供了更加智能、便捷的照护服务。

卷积神经网络（Convolutional Neural Network，CNN）是一种模拟人类大脑神经网络结构的人工智能算法。

它通过多层卷积和池化等操作，可以从图像、声音等输入数据中提取特征，并进行分类或识别。

在老年人情绪识别中，卷积神经网络可以通过分析老年人的面部表情、语音等信息，准确地判断出老年人的情绪状态，从而为机器人提供更加智能化的照护服务。

首先，基于卷积神经网络的老年人情绪识别技术可以通过分析老年人的面部表情来判断其情绪状态。

面部表情是人类情绪表达的重要方式，通过分析面部表情的变化可以准确地判断出人的情绪状态。

卷积神经网络可以通过多层卷积和池化等操作，提取面部表情中的特征，然后通过分类器进行情绪分类。

通过训练大量的样本数据，卷积神经网络可以学习到人类面部表情与情绪之间的对应关系，从而在实时识别老年人的情绪状态时提供准确的判断。

其次，基于卷积神经网络的老年人情绪识别技术还可以通过分析老年人的语音来判断其情绪状态。

语音是人类情绪表达的另一种重要方式，通过分析语音中的声调、音频特征等信息可以准确地判断出人的情绪状态。

卷积神经网络可以通过对语音信号进行卷积和池化等操作，提取语音中的特征，然后通过分类器进行情绪分类。

通过训练大量的语音样本数据，卷积神经网络可以学习到人类语音与情绪之间的对应关系，从而在实时识别老年人的情绪状态时提供准确的判断。

基于卷积神经网络的老年人情绪识别技术在机器人照护中的应用可以帮助机器人更好地理解老年人的需求，提供个性化的照护服务。

通过实时识别老年人的情绪状态，机器人可以根据老年人的需求调整自己的行为。

人机交互中的人脸表情识别研究进展_薛雨丽

第14卷　第5期2009年5月中国图象图形学报J o u r n a l o f I m a g e a n d G r a p h i c sV o l .14,N o .5M a y ,2009人机交互中的人脸表情识别研究进展薛雨丽毛　峡郭　叶吕善伟(北京航空航天大学电子信息工程学院,北京　100191)摘　要　随着人机交互与情感计算技术的快速发展,人脸表情识别已成为人们研究的热点。

为了阐明人机交互中人脸表情识别的研究方向及进展,该文从人脸表情数据库、表情特征提取、表情分类方法、鲁棒的表情识别、精细的表情识别、混合表情识别、非基本表情识别等方面对人脸表情识别的研究现状进行了分析。

最后总结了人脸表情识别研究的热点及趋势,同时指出了人脸表情识别研究存在的局限性,并对人脸表情识别的发展进行了展望。

关键词　表情识别　情感表情　人脸表情数据库中图法分类号:T P 391.4 文献标识码:A 文章编号:1006-8961(2009)05-764-09T h e R e s e a r c hA d v a n c e o f F a c i a l E x p r e s s i o n R e c o g n i t i o ni nH u m a nC o m p u t e r I n t e r a c t i o nX U EY u -l i ,M A OX i a ,G U OY e ,L VS h a n -w e i(S c h o o l o f E l e c t r o n i c a n d I n f o r m a t i o n E n g i n e e r i n g ,B e i h a n gU n i v e r s i t y ,B e i j i n g 100191)A b s t r a c t A l o n g w i t h t h e r a p i d p r o g r e s s o f t h e t e c h n o l o g i e s i n h u m a n c o m p u t e r i n t e r a c t i o n a n da f f e c t i v e c o m p u t i n g ,f a c i a l e x p r e s s i o nr e c o g n i t i o n h a s b e e na c t i v e l y r e s e a r c h e d .T o c l a r i f y t h e r e s e a r c h d i r e c t i o n a n d d e v e l o p m e n t o f f a c i a l e x p r e s s i o n r e c o g n i t i o ni n h u m a nc o m p u t e r i n t e r a c t i o n ,t h e r e s e a r c h s t a t e s o f f a c i a l e x p r e s s i o n r e c o g n i t i o n a r e a n a l y z e d f r o mt h e a s p e c t s o f f a c i a l e x p r e s s i o nd a t a b a s e ,f a c i a l f e a t u r ee x t r a c t i o n ,f a c i a l e x p r e s s i o nc l a s s i f i c a t i o nm e t h o d s ,r o b u s t f a c i a l e x p r e s s i o n r e c o g n i t i o n ,f i n e -g r a d e df a c i a l e x p r e s s i o nr e c o g n i t i o n ,m i x e df a c i a l e x p r e s s i o nr e c o g n i t i o n ,a n d n o n -b a s i cf a c i a l e x p r e s s i o n r e c o g n i t i o n .F i n a l l y ,t h es t u d yh o t s p o t sa n dt r e n d so f f a c i a l e x p r e s s i o nr e c o g n i t i o na r ec o n c l u d e d ,t h el i m i t si nf a c i a l e x p r e s s i o nr e c o g n i t i o n a r e p o i n t e d o u t ,a n d t h e e x p e c t a t i o no f t h e d e v e l o p m e n t o f f a c i a l e x p r e s s i o n r e c o g n i t i o ni s g i v e n .K e y w o r d s f a c i a l e x p r e s s i o n r e c o g n i t i o n ,e m o t i o n a l f a c i a l e x p r e s s i o n ,f a c i a l e x p r e s s i o nd a t a b a s e基金项目:国家自然科学基金项目(60572044);国家高技术研究发展计划(863计划)项目(2006A A 01Z 135);高等学校博士学科点专项科研基金项目(20070006057)收稿日期:2008-10-11;改回日期:2008-12-12第一作者简介:薛雨丽(1980～　),女。

人形机器人(课堂PPT)

5
发展历程
• 类人型机器人的研究，最早可追溯至西元1893年，Georges Moore创作了第一个利用蒸气驱动类人型步行机器 [Rosheim, 1994]，然详细构造与运动原理我们并无法得知。其后在第一次大战期间，Thring发明了具有腿之农耕机[Thring, 1983]。至1970年之间，许多研究人员进行辅助人类行走的步行机器之研究，如Bernstein于 1948年于莫斯科义肢设计研究中心，发展具电子装置的腿外骨骼（Exosceleton）[Karsten, 2003]。
11
• 图1.3 日本产业技术综合研究所（AIST）HRP系列[3] 在西元2000年，SONY公司也发表了高50公分，重5公斤的小型机器人， SDR-3X[Kuroki, Ishida, and Yamaguchi, 2001]（图1-3），每一只脚具有六个自由度，不但会跳舞，还可单腿站立；而在2002年，SONY更发表了最新一代的SDR-4X[Fujita, Kuroki, Ishida, and Doi, 2003]（图1-4），它的高度58公分、重6.5公斤，每一只脚同样具有六个自由度，除了具有前一代SDR-3X的功能外，还可以在10mm的凹凸地面行走，上10度的斜坡，甚至被推倒了还能自己站起来，可说是向家用机器人的目标，又迈进了一大步。
第三章节：人形机Biblioteka 人— 陈黄祥（研究方向：机器人研究、创新能力研究）
1
作业回顾与点评
自主作业：美赞臣售货机器人设计方案讨论内容： 1、造型 2、功能
2
怎么样的机器人是人形机器人？
3
类人型机器人之发展现况
• 类人型机器人是一门由仿生学、机构设计、控制理论和人工智慧等多项科技形成的跨领域科技，与轮型和多足机器人相比，类人型机器人拥有较大的优势去适应更复杂的地形，并且有更加灵活的运动能力和速度变化能力。

2009 情绪通过面部表情的传递与大脑的有效解码：行为和脑的证据

Transmission of Facial Expressions of Emotion Co-Evolved with Their Efficient Decoding in the Brain: Behavioral and Brain EvidencePhilippe G.Schyns1,2*,Lucy S.Petro1,2,Marie L.Smith1,21Centre for Cognitive Neuroimaging(CCNi),University of Glasgow,Glasgow,United Kingdom,2Department of Psychology,University of Glasgow,Glasgow,United KingdomAbstractCompetent social organisms will read the social signals of their peers.In primates,the face has evolved to transmit the organism’s internal emotional state.Adaptive action suggests that the brain of the receiver has co-evolved to efficiently decode expression signals.Here,we review and integrate the evidence for this hypothesis.With a computational approach, we co-examined facial expressions as signals for data transmission and the brain as receiver and decoder of these signals.First,we show in a model observer that facial expressions form a lowly correlated signal set.Second,using time-resolved EEG data,we show how the brain uses spatial frequency information impinging on the retina to decorrelate expression categories.Between140to200ms following stimulus onset,independently in the left and right hemispheres,an information processing mechanism starts locally with encoding the eye,irrespective of expression,followed by a zooming out to processing the entire face,followed by a zooming back in to diagnostic features(e.g.the opened eyes in‘‘fear’’,the mouth in‘‘happy’’).A model categorizer demonstrates that at200ms,the left and right brain have represented enough information to predict behavioral categorization performance.Citation:Schyns PG,Petro LS,Smith ML(2009)Transmission of Facial Expressions of Emotion Co-Evolved with Their Efficient Decoding in the Brain:Behavioral and Brain Evidence.PLoS ONE4(5):e5625.doi:10.1371/journal.pone.0005625Editor:Antonio Verdejo Garcı´a,University of Granada,SpainReceived September12,2008;Accepted April10,2009;Published May20,2009Copyright:ß2009Schyns et al.This is an open-access article distributed under the terms of the Creative Commons Attribution License,which permits unrestricted use,distribution,and reproduction in any medium,provided the original author and source are credited.Funding:BBSRC.The funders had no role in study design,data collection and analysis,decision to publish,or preparation of the manuscript.Competing Interests:The authors have declared that no competing interests exist.*E-mail:philippe@IntroductionPrimates use their faces to transmit facial expressions to their peers and communicate emotions.If the face evolved in part as a device to optimize transmission of facial expressions then the primate brain probably co-evolved as an efficient decoder of these signals:Competent social interaction implies a fast,on-line identification of facial expressions to enable adaptive actions.Here, we propose a computational theory addressing the two important questions of how and when the brain individuates facial expressions of emotion in order to categorize them quickly and accurately.The manuscript is organized in two main parts.The first part outlines the evidence for the proposal that the face has evolved in part as a sophisticated system for signaling affects to peers.We conclude that the face communicates lowly correlated emotive signals.Turning to the receiver characteristics of these affects,we review the evidence that the brain comprises a sophisticated network of structures involved in the fast decoding and categori-zation of emotion signals,using inputs from low-level vision.These inputs represent information at different spatial scales analyzed across a bank of Spatial Frequency filters in early vision.The second part of the paper develops our integrative computational account building from the data of Smith et al[1] and Schyns et al[2].We also present new data to support the integration of the different parts of the research.Integrating the data of Smith et al[1]and Schyns et al[2]in a meta-analysis,we first show that the six basic categories of emotion(happy,fear,surprise,disgust,anger,sadness plus neutral[3])constitute a set of lowly correlated signals for data transmission by the face.We then show that correct categorization behavior critically depends on using lowly correlated expression signals.The brain performs this decorrelation from facial information represented at different scales(i.e.across different spatial frequency bands)impinging on the observer’s retina.Integrating the data of Schyns et al[2],we show how and when decoding and decorrelation of facial information occur in the brain.For decoding,the left and right hemispheres cooperate to construct contra-lateralized representa-tions of features across spatial frequency bands(e.g.the left eye is represented in the right brain;the right eye in the left brain. Henceforth,we will use a viewer-centered description of features. For example,‘‘the left eye’’of a face will be the eye as it appears to the observer.).Irrespective of expression and observer,this construction follows a common routine that is summarized in three stages.Sensitivity to facial features starts at Stage1[140–152ms],which contra-laterally encodes the eyes of the face at a local scale(i.e.high spatial frequencies).Stage2[156–176ms] zooms out from the local eyes to encode more global face information(ing high and low spatial frequencies).Stage3 [180–200ms],most critical here,zooms back in to locally and contra-laterally encode the features that individuate each expres-sion(i.e.diagnostic features such as the eyes in‘‘fear’’,the corners of the nose in‘‘disgust’’,again using higher spatial frequencies).At the end of this time window,the brain has decorrelated the expression signals and has encoded sufficient information toenable correct behavior.In a novel analysis,we demonstrate this point with a Model Categorizer that uses the information encoded in the brain every4ms between140and200ms to attempt to classify the incoming facial expression at75%correct,as human observers did.The face as a transmitter of facial affectsAlthough humans have acquired the capabilities of spoken language,the role of facial expressions in social interaction remains considerable.For over a century we have deliberated whether facial expressions are universal across cultures,or if expressions evolve within civilizations via biological mechanisms, with Charles Darwin’s The Expression of the Emotions in Man and Animals[4]central to much of this research.Irrespective of whether facial expressions are inextricably linked to the internal emotion and therefore part of a structured emotional response,or whether cultures develop their own expressions,a facial expression is a visible manifestation,under both automatic and voluntary neural control,that can be measured.The Facial Action Coding System(FACS)details the anatomical basis of facial movement to describe how facial signals are exhibited based on the muscles that produce them.Ekman&Friesen[5]developed FACS by determining how the contraction of each facial muscle transforms the appearance of the face,and how muscles act both singly and in combination to produce cognitive categories of expressions[6]. On this basis,the face can be construed as a signaling system:as a system that transmits a signal about the emotional state of the transmitter.Smith et al[1]measured the characteristics of facial expressions as ing Bubbles([7]and Figure1)they sampled information from5one-octave Spatial Frequency bands and computed the facial features that observers required to be75% correct,independently with each of the7Ekman&Friesen[3] categories of emotion(see also Schyns et al[2]and Figure2).In addition,they constructed a model observer that performed the same task,adding white noise to the sampled information to modulate performance.In Smith et al[1]the model observer provides a benchmark of the information that the face signals about each expression.With classification image techniques,Smith et al[1] found that the transmitted information formed a set with low average correlation.That is,they found that different parts of the face transmitted different expressions,resulting in low Pearson correla-tions between the features of the face transmitting the expressions.For example,the wide-opened eyes were mostly involved in‘‘fear’’,the wrinkled corners of the nose in‘‘disgust’’and the wide-opened mouth in‘‘happy’’.They also found that human categorization behavior depended on using these decorrelated cues.We have argued that emotion signals have high adaptive value and there is evidence that they have evolved into a lowly correlated set.Turning to the receiver of the emotional signals,i.e. the brain,we can examine the coupling that exists between the encoding of the expression by the face for transmission and the decoding of the signals in the brain.Facial expressions of emotion represent particularly good materials to study the particulars of this transmitter-receiver coupling because most of us are expression experts.Thus,brain circuits are likely to have evolved to decode expression signals fast and efficiently,given the wide range of viewing conditions in which facial expressions are typically experienced.We will briefly review where and when in the brain emotion identification is proposed to happen.The brain as a decoder of facial affects:Where are the cortical and subcortical networks?Emotional stimuli may hold a privileged status in the brain[8], commanding a distributed neural network of cortical and subcortical structures for representing different facial expressions and determining adaptive responses to such stimuli[9,10,11,12] As established by single-cell analysis,neuroimaging and lesion studies,this network has contributions from the amygdala, cingulate gyrus,hippocampus,right inferior parietal cortex, ventromedial occipito-temporal cortex,inferotemporal cortex and the orbitofrontal cortex[12,13,14,15].With respect to functional roles,the occipito-temporal cortical pathway(in particular the fusiform gyrus and superior temporal sulcus)may be involved in the early perceptual encoding that is essential for differentiating between expressions.Ensuing catego-rization may require neural structures including the amygdala and orbitofrontal cortex to integrate perceptual encoding of the face with prior knowledge of emotion categories[16,17,18]. Another,independent subcortical pathway sidestepping striate cortex could allow a coarse,very fast processing of facial expression.In particular,fear or threat-related signals may benefit from a direct subcortical route to the amygdala,via the superior colliculus and pulvinar thalamus(see[19,20,21,22]).Evidence for a subcortical route arises in part from perception of emotional content without conscious experience,but this is still a controver-sial topic[21,22,23,24,25,26].The brain as a decoder of facial affects:When does the decoding happen?Estimates of the time course of emotion processing in the brain can be derived from event-related potential(ERP)studies.The N170 is a face-sensitive ERP,making it a good candidate to study the time course of emotions signalled by the face.Peaking between140–200ms after stimulus onset and occurring at occipito-temporal sites [27,28],what this potential actually reflects remains somewhat controversial–i.e.if it is linked to the structural encoding of faces,a response to the eyes[27,29],or whether it can be modulated by emotional facial expression[30,31,32].Despite some studies reporting no effect of emotion,due to its robust sensitivity to faces, the N170remains a good measure of early facial expression discrimination.Indeed,Schyns et al[2]demonstrated that activity in the50ms preceding the N170peak reflects a process that integrates facial information.This integration starts with the eyes and progresses down on the face.Integration stops,and the N170peaks, when the diagnostic features of an expression have been integrated (e.g.the eyes in‘‘fear’’,the corners of the nose in‘‘disgust’’and the mouth in‘‘happy’’).Consequently,distance of the diagnostic features from the eyes determines the latency of the N170. Thus,evidence from brain imaging techniques suggests that cortical and subcortical networks are both involved in the fast processing of emotions.In the cortical route,there is evidence that emotion decoding happens over the time course of the N170ERP, in a time window spanning140to200ms following stimulus onset.If the face transmits affects for the visual brain to decode,we must turn to the visual system for an understanding of the specific visual inputs to the network of brain structures involved in this fast decoding.Decoding Facial Affects In Visual Signals:The Role of Spatial FrequenciesA classic finding of vision research is that the visual system analyzes the retinal input,therefore including facial expressions, into light-dark transitions,at different spatial frequencies.A set of filters,called‘‘Spatial Frequency Channels’’,performs this analysis:Each channel is tuned to a preferential frequency band, with declining sensitivity to increasingly different frequencies.A ‘‘bandwidth’’characterizes the range of frequencies to which achannel is sensitive,and channel bandwidths are mostly in the range of 1to 1.5octaves–where an octave is a doubling of frequency,e.g.,from 2to 4cycles per deg (c/deg)of visual angle,4to 8c/deg,16to 32c/deg and so forth.In total,approximately six channels constitute the bank of spatial filters analyzing the retinal input (see [33]for a review).At the centre of the research agenda is the debate of how high-level cognition interacts with inputs from low-level spatial frequency channels to extract information relevant for visual categorization (see [33]for a review).Top-down control implies that the visual system can actively modulate information extraction from one,or a combination of spatial frequencychannels for stimulus encoding and categorization.For example,if categorization of ‘‘fear’’requires extraction of the wide-opened eyes from the retinal input,and because the wide-opened eyes are fine scale features,their accurate encoding should draw informa-tion from higher spatial frequency filters.In contrast,the wide-opened mouth of ‘‘happy’’is a large scale feature allowing encoding to be more distributed across the filters.Top-down control of spatial frequency channels,often cast in terms of modulated attention,implies such flexible tuning of the visual system to encode the combination of spatial channels representing categorization-relevant information (with e.g.,involvement of different channels for ‘‘the eyes’’and ‘‘themouth’’).Figure 1.Illustration of Bubbles Sampling [1,2].Human.A randomly selected face (from a set of 7expressions 610exemplars =70)is decomposed into six spatial frequency bands of one octave each,starting at 120–60cycles per face.Only five bands are shown.The sixth band served as constant background.At each spatial frequency band,randomly positioned Gaussian windows (with sigma =.36to 5.1cycles/deg of visual angle)sampled information from the face,as shown in the second and third rows of pictures.Summing the pictures on the third row across the five spatial frequency bands plus the constant background from the sixth band produced one experimental stimulus.This is illustrated as the rightmost picture on the third row.Model.The bottom row illustrates the modification of the stimuli to be used in the model.We added white noise to the original picture,which was then decomposed into five spatial frequency bands and sampled with bubbles as described above to produce the experimental stimulus.doi:10.1371/journal.pone.0005625.g001Several researchers have argued for a special role of the low frequency bands in face processing [34,35,36,37,38]particularly so in the categorization of facial expressions.Subcortical structures [39,40,41],more sensitive to low spatial frequencies [42,43],would directly activate the amygdala (and related structures [44])in response to fearful faces represented at low spatialfrequencies.Figure 2.Meta-Analysis of the Behavioral Data.For each spatial frequency band (1to 5),a classification image reveals the significant (p ,.001,corrected,[51])behavioural information required for 75%correct categorization of each of the seven expressions.All bands.For each expression,a classification image represents the sum of the five classification images derived for each of the five spatial frequency bands.Colored Figures.The colored figures represent the ratio of the human performance over the model performance.For each of the five spatial we computed the logarithm of the ratio of the human classification image divided by the model classification image.We then summed these ratios over the five spatial frequency bands and normalized (across all expressions)the resulting logarithms between 21and 1.Green corresponds to values close to 0,indicating optimal use of information and optimal adaptation to image statistics.Dark blue corresponds to negative values,which indicate suboptimal use of information by humans (e.g.low use of the forehead in ‘‘fear’’).Yellow to red regions correspond to positive values,indicating a human bias to use more information from this region of the face than the model observer (e.g.strong use of the corners of the mouth in ‘‘happy’’).doi:10.1371/journal.pone.0005625.g002Vuilleumier et al[44]also noted a sensitivity of the fusiform cortex to higher spatial frequency ranges,but an implied dissociation of subcortical and cortical pathways to process SF information remains debatable[45].The idea of a coarse,fast representation via low spatial frequencies[46]finds echo in Bar et al[47]who suggest a fast feedforward pathway to orbitofrontal cortex,which in turn directs precise,high spatial frequency information extraction in the visual input via the fusiform gyrus(see also [42,48].So,not only are spatial frequency bands important because they represent the building blocks of visual representa-tions;spatial frequency bands also appear to play a central role in emotion processing in the brain.Transmitting Decorrelated Facial Expressions, Constructing their Spatial Frequency Representations in the BrainWe have reviewed the evidence that muscle groups in the face produce lowly correlated signals about the affective state of the transmitter.These facial signals impinge on the retina where banks of spatial filters analyze their light-dark transitions at different scales and orientations.This information is then rapidly processed (i.e.before200ms)in cortical and subcortical pathways for the purpose of categorization.In what follows,we will be concerned with the coupling transmitter-receiver for the categorization of Ekman’s six basic expressions of emotion plus neutral.Merging the data of Smith et al[1]and Schyns et al[2],using Bubbles,we will perform an analysis to characterize the facial expressions of emotion across spatial frequency bands as signals.In line with Smith et al[1],we will show that they form a lowly correlated set of signals.We also perform a meta-analysis on the behavioral data of17observers to understand how categorization behavior depends on facial cues represented across multiple spatial frequency bands.We will show that their behavior relies on the decorrelations present in the expression signals.With time resolved EEG,we will then report new analyses in three observers revealing how the left and right occipito-temporal regions of the brain extract facial information across spatial frequency bands,over the first200ms of processing,to construct decorrelated representations for classification(see[49]for a generic version of this point).A simple Model Categorizer which uses the information represented in the brain as input will demonstrate when(i.e.the specific time point at which)this representation becomes sufficient for75%correct categorizations of the seven expressions,as required of the observers’behavior. MethodsComputational meta-analysis:signalling emotions and categorizing them,model and human observersUsing classification image techniques(Bubbles,[7]),we will characterize with a model observer the information signalled in Ekman&Friesen[3]six basic categories of facial expressions of emotion plus‘‘neutral’’.In addition to characterizing signal transmission,we will characterize,using the same classification image techniques,how behavior uses information from the transmitted signals.These meta-analyses follow the methods developed in Smith et al[1],but pooling data drawn from Smith et al[1]and Schyns et al[2]to provide more power to the analyses. Participants.A total of7male and10female University of Glasgow students of normal or corrected to normal vision were paid to participate in the experiments.Their behavioural data were pooled from Smith et al[1]and from Schyns et al[2].For each participant,written informed consent was obtained prior to the experiment and ethics was granted by University of Glasgow FIMS ethics committee.Stimuli.Stimuli were generated from5male and5female faces,each displaying the six basic FACS-coded[5]emotions plus neutral,for a total of70original stimuli(these images are part of the California Facial Expression,CAFE´,database[50]). Procedure.Human Experiment.On each trial of the experiment(1200trials per expression in Smith et al[1];3000 trials per expression in Schyns et al[2],an original face stimulus was randomly chosen and its information randomly sampled with Bubbles as illustrated in Figure1.The stimulus was split into five non-overlapping spatial frequency bands of one octave each, starting at120–60cycles per image.A sixth spatial frequency band served as constant rmation was randomly sampled from each band with a number of randomly positioned Gaussian apertures,whose sigma was adjusted across spatial frequency band so as to sample6cycles per aperture.The information sampled per band was then recombined to form the stimulus presented on this trial.It is important to note that this version of Bubbles,which samples information across spatial frequency bands,has the advantage of sampling information across the scales of a face.On each trial,global and local face cues are simultaneously presented to the visual system.Thus,both type of cues can by used by face processing mechanisms. Observers indicated their categorization response by pressing one of7possible computer keys.Stimuli remained on the screen until response.A staircase procedure was used to adjust the number of bubbles(i.e.the sampling density)on each trial,so as to maintain categorization performance at75%correct,indepen-dently for each expression.This is important:All observers categorized each one of the seven expressions at the same level of 75%correct performance.Model Observer.Experimental stimulus sets are samples of a population of stimuli.To benchmark the information available in our stimulus set to perform the experiment,we built a model observer.We submitted the model to the same experiment as human observers(collapsed across Smith et al[1]and Schyns et al [2]data),using as parameters the average accuracy per expression (n=17observers),the average number of bubbles per expression and the total number of trials(25,800)per expression.To maintain performance of the model at75%correct for each expression,a staircase procedure adjusted a density of white noise,on a trial-per-trial basis,and independently for each expression.On each trial,the model computed the Pearson correlation between the input expression and the70original faces revealed with the same bubbles as the input(collapsing all the3806240images pixels65 spatial frequency bands into a1-dimensional vector).A winner-take-all scheme determined the winning face.Its emotion category was the model’s response for this trial.To the extent that the model compares the input to a memory of all stored faces,it can use all the information present in the original data set to respond. Thus,the model provides a benchmark of the information present in the stimulus set to perform the expression categorization task. Computational analysis:time course of spatial frequency processing in the brain to encode emotion signals Having shown that behavioural categorization performance requires lowly correlated emotion signals to be correct,we now turn to brain data(i.e.the EEG of three observers from Schyns et al[2]to understand how this decorrelation is accomplished. Participants.The three participants from Schyns et al[2], whose data served in the behavioural meta-analysis,had their scalp electric activity recorded on each trial of the expressioncategorization task described before(with3000trials per expression).EEG Recording.We used sintered Ag/AgCl electrodes mounted in a62-electrode cap(Easy-Cap TM)at scalp positions including the standard10-20system positions along with intermediate positions and an additional row of low occipital electrodes.Linked mastoids served as initial common reference, and electrode AFz as the ground.Vertical electro-oculogram (vEOG)was bipolarly registered above and below the dominant eye and the horizontal electro-oculogram(hEOG)at the outer canthi of both eyes.Electrode impedance was maintained below 10k V throughout recording.Electrical activity was continuously sampled at1024Hz.Analysis epochs were generated off-line, beginning500ms prior to stimulus onset and lasting for1500ms in total.We rejected EEG and EOG artefacts using a[230mV; +30mV]deviation threshold over200ms intervals on all electrodes.The EOG rejection procedure rejected rotations of the eyeball from0.9deg inward to1.5deg downward of visual angle–the stimulus spanned5.36deg63.7deg of visual angle on the screen.Artifact-free trials were sorted using EEProbe(ANT) software,narrow-band notch filtered at49–51Hz and re-referenced to average reference.Computation:Sensor-based EEG Classification Images.To determine the facial features systematically correlated with modulations of the EEG signals,we applied Bubbles to single trial raw electrode amplitudes.We selected,for each observer,a Left and a Right Occipito-temporal electrode (henceforth,OTL and OTR)for their highest N170amplitude peak on the left and right hemispheres(corresponding to P8and PO7for each observer).For each electrode of interest,EEG was measured every4ms, from2500ms to1s around stimulus onset.For each time point, and for each expression and SF band,we computed a classification image to estimate the facial features correlated with modulations of EEG amplitudes.This classification image was computed by summing all the bubble masks leading to amplitudes above(vs. below)the mean,at this time point.We repeated the procedure for each one of the five spatial frequency bands and for each one of the seven expressions and each one of the250time points. Subtracting the bubbles masks above and below the mean leads to one classification image per SF band,time point and expression. This classification image represents the significant(p,.05, corrected,[51])facial information(if any)that is correlated with modulations of the EEG for that SF band,time point and expression(see Schyns et al[2,53];Smith et al[1],for further details).ResultsComputational meta-analysis:signalling emotions and categorizing them,model and human observersWe performed the same analysis for the human and model observers.To illustrate,consider the analysis of the expression ‘‘happy’’in the top row of Figure2.In each spatial frequency band,and for each pixel,we compute a proportion:the number of times this pixel led to a correct response over the number of times this pixel has been presented.Remember that performance was calibrated throughout the experiment at75%correct.On this basis,we determine the informative from the noninformative pixels of an expression:the proportion associated with informative pixels will be above.75.In each Spatial Frequency band,we found the statistically relevant pixels(corrected,[51],p,.001).The first row of images(labelled1–5)illustrates these pixels on one exemplar of the‘‘happy’’category,for each Spatial Frequency band.The sum of these images(under‘‘all bands’’)summarizes the information that the observers required for75%correct categorization behavior.To illustrate,‘‘happy’’requires the smiling mouth and the eyes,‘‘surprised’’the open mouth,‘‘anger’’the frowned eyes and the corners of the nose,‘‘disgust’’the wrinkles around the nose and the mouth,‘‘sadness’’the eyebrows and the corners of the mouth and‘‘fearful’’the wide-opened eyes. These facial features,in the context of equal categorization performance across expressions,provide the information that human observers required to choose amongst seven possible expressions,without much expectation as to what the input expression would be.A remaining question is the extent to which other information exists,between the seven emotion categories of this experiment,to accomplish the task at the same performance level(here,75%correct).To this end,we turn to the model observer for which we performed an identical analysis as that described for human observers,searching for pixels at the same level of statistical significance(corrected,[51],p,.001).For each expression(rows in Figure2)and Spatial Frequency band (columns of Figure2)we computed a measure of the optimality of information use by human observers:the logarithm of the ratio between human significant pixels and model significant pixels. Adding these logarithms across spatial frequency bands per expression and applying them to an original stimulus reveals in green an optimal use in humans,in blue,a suboptimal use of information by humans(e.g.the corners of the nose in‘‘happy’’, the mouth region in‘‘anger’’,the frown of the forehead in‘‘fear’’) and in red a particular bias to use information from this region,at least more so than the model observer(e.g.the corners of the mouth in‘‘happy’’the wrinkle of the nose and the mouth in ‘‘disgust’’).Such human biases might reveal that in a larger population of stimuli,this information would be more informative than in the stimulus set of this particular experiment.Having characterized the use of facial information for expression categorization,we turn to the effectiveness of the face as a transmitter of emotion signals and introduce a new measurement for the effectiveness of the signals themselves.In signal processing,the more dissimilar the encoding of two signals the less confusable they are in transmission.An ideal set of signals is‘‘orthogonal’’in the sense that they are fully decorrelated.To evaluate the decorrelation of the six basic expressions of emotion plus neutral,we examined the decorrelation of the significant pixels of the model observer.To this end,we turn the3806240 image pixels65Spatial Frequency bands representing each expression into a one-dimensional vector of3862465entries(by compressing the images by a factor of10),and cross correlate (Pearson)them to produce a symmetric correlation matrix where each(x,y)correlation represents the similarity between two expressions.If the expressions formed an orthogonal set,then the correlation matrix should be the identity matrix,with correlations (x,y)=0for each pair of expression,and(x,x)=1on the diagonal, for the correlation of each expression with itself.The Backus-Gilbert Spread[52]measures the distance between the identity matrix and an observed matrix.We adapted a Backus-Gilbert Spread to provide a measure of decorrelation(1).1{Xi,jX{IðÞ2,Xi,j1{IðÞ2"#ð1ÞFigure3illustrates the correlation matrices and their respective Backus-Gilbert Spread measurement of decorrelation(maximum decorrelation is1;minimum decorrelation is0).For human and。

情感型对话机器人技术的研究综述

文章编号：2096-1472(2021)-02-14-05DOI:10.19644/ki.issn2096-1472.2021.02.003软件工程 SOFTWARE ENGINEERING 第24卷第2期2021年2月V ol.24 No.2Feb. 2021情感型对话机器人技术的研究综述肖鹏1,2，于丹1,2，王建超1,2，来关军1,2(1.大连东软信息学院，辽宁大连 116023；2.大连东软教育科技集团有限公司研究院，辽宁大连 116023)*******************;****************;***********************;*********************摘要：对话机器人技术一直是人机交互领域的研究热点，基于文本或者语音的对话机器人已经广泛应用于生活当中。

然而，构建能够与人类进行自然的、流畅的对话的机器人仍然充满挑战。

情感作为拟人性的重要方面能够提高人机交互的自然性和流畅性。

因此，为了推进对话机器人技术的发展，本文对情感型对话机器人的相关概念、发展历史、情感生成方式、设计思路和评价方式的相关研究展开了系统的梳理。

情感型对话机器人主要分为指定类别情感回复和生成式情感回复两种，其中生成式情感回复是未来发展的主要趋势。

关键词：对话机器人；情感；设计；评价中图分类号：TP183 文献标识码：AOverview of Emotional Chatbot TechnologyXIAO Peng 1,2, YU Dan 1,2, WANG Jianchao 1,2, LAI Guanjun 1,2(1. Dalian Neusoft University of Information , Dalian 116023, China ;2. Research Institute , Dalian Neusoft Education Technology Group Co . Limited , Dalian 116023, China )*******************;****************;***********************;*********************Abstract: Chatbot technology has always been a research focus in the field of human-computer interaction. Chatbots based on text or voice have been widely used in practices. However, it is still challenging to build chatbots that can converse with human in a natural and fluent way. Emotion, an important aspect of anthropomorphism, can make human-computer interaction more natural and fluent. Therefore, in order to promote development of chatbot technology, this paper provides a systematic review of emotional chatbots, including related concepts, development history, emotion generation methods, design ideas, and evaluation methods. Emotion-enabled chatbots are divided into emotional responses of designated categories and generative emotional responses, of which generative emotional responses are the main trend.Keywords: chatbot; emotion; design; evaluation1 引言(Introduction)对话机器人能够通过语音或者文本的方式使用自然语言与人类对话，从而使人类能够轻松地与机器进行交流。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

38 《机器人技术与应用》双月刊第1期0 引言随着工业机器人技术的日趋成熟和完善，机器人开始逐步走入医疗、保健、家庭、体育和服务性行业，对机器人的要求也从简单重复的机械动作提升为研制具有高度智能性、自主性以及与其他智能体交互的仿人机器人。这种机器人不再面向简单环境中的给定任务，而是面向与复杂环境的交互，强调个性和智能化的情感表达。为此，仿人机器人需要识别和理解人的情感状态，能根据外界环境的变化产生各种情绪，具有较强的情感处理能力，并通过适当的行为和表情对情感进行表达，能体现出仿人机器人所具有的性格。面部表情是情绪在面部的表现，它是情绪表达的主要通道。在婴儿学会说话以前，人的面部表情是婴儿认知和学习的主要来源；婴儿本身的表情是他们传达意愿和需要的主要手段。婴儿和成人之间的交往是通过表情这一媒介进行的；通过感情的传递，发展着儿童的社会行为和儿童、成人的相互关系。成人在社会交往中,面部表情是语言交际的重要辅助手段。由于面部表情在人类具有习得的性质，所以它可以人为地加以控制，既可以夸大也可以抑制；既可以掩盖也可以伪装。美国著名心理学家Albert Mehrabian[1]经过研究发现，人在进行情

感表达时，语言只表达7%的内容，声调也只能表达38%

的内容，而55%的内容全由人的表情与动作来表达，可见表情交流的必要性。所以，研究人与机器人的面部表情交流的重要性也越来越明显。

1 表情实现的理论基础1.1 恐怖谷理论对表情机器人研究的影响

日本科学家森政弘（Mori）的“恐怖谷”（Uncanny Valley）理论[2]指出，由于机器人与人类在外表、动作上都相当相似，所以人类亦会对机器人产生正面的情感；直至到了一个特定程度，他们的反应会突然变得极为反感。哪怕机器人与人类有一点点的差别，都会显得非常显眼刺目，让整个机器人显得非常僵硬恐怖，让人有面对行尸走肉的感觉。可是，当机器人的外表、动作和人类的相似度继续上升时，人类对他们的情感反应亦会变回正面，贴近人类与人类之间的移情作用。

技术应用 Technique and Application

表情机器人研究现状岳翠萍1,2 梅涛1 骆敏舟1（1 中国科学院合肥智能机械研究所仿生感知与控制研究中心，2300312 中国科技大学自动化系，230031）

摘要：本文首先介绍对表情机器人的研究具有指导意义的恐怖谷理论，其次介绍表情实现的理论基础，即通过对脸部肌肉的分析说明表情运动的机理和实现机器人表情的方法。在此基础上，本文综述近年来国内外表情机器人的研究实例，最后就该领域的研究内容及如何进行更深层次的研究进行探讨。关键词：机器人；仿人机器人；面部表情；人机交互；恐怖谷

图1 恐怖谷理论[作者简介]岳翠萍（1985-），女，硕士，研究领域：仿生机器人。梅涛（1962-），男，博士，研究员，研究领域：特种机器人，信息获取，MEMS。骆敏舟（1972-），男，博士，副研究员，研究领域：特种机器人，机器人手爪。2009年1月30日《机器人技术与应用》 39

同时，动作比外表更容易让人发现机器人与人的差异，如图1[3]所示。大脑中从感觉器官（下丘脑）到情感系统（脑垂体）有两条甚至多条路径。一条是直接路径，另一条要先通过理解和认知系统（大脑皮层），才能到达情感系统。脑垂体通过直接路径（低层次路径）获得粗糙的视觉刺激。当人看见一个物体很像人类时，脑垂体通过低水平路径得到的信号说明这个物体是人类，而通过高路径得到的信号说明这个物体跟人类差别很大。这种竞争状态导致的冲突就会使人产生恐惧感（见图2[4]）。

如果机器人的外表非常像人类，人们就会把这个仿人机器人当作人类，因此有一点微小的差异都会产生很奇怪的感觉[4]。例如实现机器人的微笑表情时，变形速度是一个很重要的因素，如果变形速度控制不好看上去很不自然。动作的微小改变都可能导致机器人陷入恐怖谷。这对研制仿人机器人有很大影响。表情机器人的研制者，有的害怕陷入恐怖谷，尽量避免研发和人非常类似的机器人系统，将机器人与人的相似性限制在如图1所示的第一个峰值前，制造出拥有人类某些特征，但是又很容易与人类区分开的机器人，还有的一直在尝试如何利用类似人的行为来越过恐怖谷，达到曲线的第二个峰值。1.2 脸部肌肉分析

面部表情是肌肉运动的结果，为了了解表情的产生原理，我们首先进行面部肌肉的分析。人脸的表情变化是由位于面部肌肤下层的表情肌的变化产生的，收缩时带动皮肤运动，当各个表情肌协同工作时，就能使面部呈现不同的表情。一般起于骨或筋膜，止于皮肤, 可分为颅顶肌、眼周围肌、口周围肌等（见图3）[5]。

1)颅顶肌：由成对的枕腹和额腹以及中间的帽状肌组成。枕腹可向后牵拉帽状腱膜和头皮，额腹收缩则可提眉并使额部皮肤出现皱纹。2)眼周围肌：包括眼轮匝肌和皱眉肌。眼轮匝肌位于眼裂周围，收缩时可使眼裂闭合；皱眉肌起始于额骨，抵止于眉中部和内侧部皮肤，牵眉向内下方，使眉间皮肤

形成皱襞。3)口周围肌：口周围肌包括辐射状肌和口轮匝肌两种。辐射状肌分别位于口唇的上下方，能提上唇、降下唇或拉口角向上、下或外等不同方向。1.3 情感与表情的对应

每一种情绪都有特定的面部肌肉活动模式。面部表情可以用额眉部、眼鼻部和口唇部的变化来标示。每一种具体的情绪都是这3个部位肌肉运动的不同组合构成的。美国心理学家Paul Ekman[6]分别对面部3个区域的肌肉运动以不同的情绪分别进行标定，并且提出了面部编码系统(Facial Action Coding System，FACS)。目前的表情机器人大多根据FACS和6种基本表情研制。在FACS的44个运动单元（AU）元中(这44个活动单元指的是脸部肌肉的运动)，有24个运动单元与人的表情有关，但其中6种基本表情所需的是14个运动单元，如表1所示。为了实现表中AU的运动，我们研究人类面部肌肉的运动位移与方向，可以得到6种基本面部表情运动单元的最佳组合，如表2所示。机器人要产生仿人的表情，在机器人面部皮肤上需要设计与各AU点对应的表情控制点，具体如图4所示。面部表情驱动机构(一般采用电机、气缸驱动)与表情控制点相连，通过表情控制点的组合和位移变化，机器人可以呈现不同的面部表情。

图3 脸部表情肌的分布图 2 由于感觉和认知信息的冲突导致恐怖感

技术应用 Technique and Application

运动单元表情改变控制点左右1内眉抬高232外眉抬高144眉毛降低5，67，85上眼睑抬高9106面颊抬起眼睑收缩11127眼睑夹紧9109鼻子皱1310上眼睑抬高1312嘴角上扬111215嘴角下拉161717下颚抬起1820嘴张开141525嘴唇张开1826下颚降低18及电机

表1 与表情相关的运动单元及相应的控制点40 《机器人技术与应用》双月刊第1期

2 表情机器人的研究实例国内外一些大学和研究所都在从事表情机器人的研究，并取得了一些新的进展。下面分别从不同的表情驱动方法了解表情机器人的研究现状。2.1 采用伺服电机控制的表情机器人

由于国内的表情机器人均采用电机控制的方法，所以我们将国内和国外的研究实例分开比较。2.1.1 国内的表情机器人哈工大H&T robot-II型仿人头像机器人如图5所示，具有视觉、表情识别功能, 成功地再现了人类的自然、开

心、生气、厌恶、悲伤、惊讶6种面部表情[7-10]。

通过5组不同人的表情识别测试得到80%的识别成功率。H&T robot-III增加了说话对应口形的功能[9]。

整个系统由H&T ROBOT-II仿人头像机器人本体、CCD 采集装置、主控计算机、表情识别程序、智能控制单元、PCI 接口卡、单片机控制电路组成。通过CCD 采集装置获得表情图像，然后用表情识别软件进行表情识别，系统通过控制器驱动仿人头像机器人实现表情再现。表情驱动机构采用的是电机和非金属绳索牵引机构。非金属绳索牵引机构如图6所示[8]，弹簧2是由圆形截面的

钢丝绳制成，外面由管4包裹，1、5为固定端点，工作电机带动非金属绳3在弹簧2中滑动。2007年北京科技大学运用了机械设计、传感器技术、舵机控制、嵌入式系统、人工智能等技术设计了一个情感机器人头[11]，它可

以和人进行对话并产生表情，如图7所示。国内研究的表情机器人基本可以实现六种表情，但是还不能和真人相比，不够逼真。存在的主要问题是没有合适的、与真人皮肤接近的材料，要找到感官和弹性各方面性能都符合的材料比较困难。2.1.2 国外的表情机器人美国麻省理工学院人工智能实验室的计算机专家辛西娅·布雷齐尔从人类婴儿与看护者之间的交流方式中得到启发，开发了一个名为Kismet[12]的婴儿机器人。在面部特征上，该机器人具有15个自由度，

图5 哈工大H&T robot仿人头像机器人

图6 表情驱动单元结构

技术应用 Technique and Application

图8 Kismet表情机器人图 4 面部表情控制点基本表情运动单元惊讶1+2+5+26害怕1+2+4+5+7+20+25,26厌恶4+9+17生气4+5+7+10+25,26高兴6+12+(26)悲伤1+4+15

表2 基本表情所需的运动单元