This deliverable present s a st at e of t he art survey regarding t he use of mult iple biomet ric modalities for personal identity verification and recognition, which has been recognised as one of the potential strategies for building reliable biometric systems.

As of today none of the existing biometric identification and verification approaches are fully satisfactory, alt ernat ives and enhancement s have t o be sought t o develop t he ideal product. One promising option is to integrate a number of low cost modules which on their own cannot possibly aspire to attain the state of perfection but jointly could complement each other and achieve the required level of performance and robustness.

This report aims t o review t he various approaches t o int egrat ion and fusion. There are a number of different int egrat ion scenarios, each of which can be classified int o one of t wo basic ca egories: heterogeneou s in egra ion and homogeneou s in egra ion. In he former category t he integrated system relies on multiple biometric modalities but these are not used jointly for personal identity verification, while in the latter category each scenario involves a fusion of multiple expert decisions.



Biome t rics has been heralded as an important counter measure to crime and fraud for more than two decades. Biometric sensing and interpretation technologies are considered to provide an effec ive complemen ary securi y measure for In ernet access o eleservices such as teleshopping, t elebanking and t elevision on demand. Alt

hough many biomet ric syst ems are already commercially available, such as finger print verification systems, iris scan and voice based systems, their wide spread dissemination is limited for reasons ranging from excessive costs, t hrough unsat isfact ory reliabilit y in varied environment al condit ions and inabilit y t o perform continuous identity verification during the period of access, to user acceptance due to the perceived connotation of particular biometric modality (e.g. finger print).

Wit h a perfect biometric modality at our disposal, there would hardly be any need to worry about the problem of integration of multiple sensor and decision making components into a coherent system. Unfortunately, the criterion of perfection is multifaceted, including not only performance and robustness to environmental changes, but also costs, unobtrusiveness, ability to verify t he user cont inuously as required, applicability and user acceptability. As of today none of the existing biometric identification and verification approaches meet all these criteria simultaneously. Al erna ives and enhancemen s have o be sough o develop he ideal product.

One promising option is to integrate a number of low cost modules which on their own cannot possibly aspire to attain the state of perfection but jointly could complement each other and achieve the required level of performance and robustness. Such integration can be approached in a number of different ways. For inst ance, one could use one biomet ric charact erist ic t o

design a con t rol mechanism t o enhance t he performance of ano

t her module. This is exemplified by t he use of lip shape or pose analysis t o select a suit able probe or t he most appropriate model for a frontal face identification/verification system.

More common are the attempts at homomorphic integration when a number of modules are fused a he decision level. Mul iple modules or exper s can be cons ruc ed ei her by employing differen t decision making schemes or by using more t han one biome t ric characteristic t o represent t he user's ident it y. Even a single expert present ed wit h mult iple observations offers an opport unit y for informat ion fusion and performance ameliorat ion. Interestingly, t he fusion mechanisms involved in t hese diverse cases are concep t ually different. Mult iple expert s expressing t heir opinions on a single biomet ric observat ion or a single expert forming an opinion on mult iple observat ions can be seen as mechanisms for improved estimation of a decision function. The use of multiple biometric modalities, on the other hand, aims at bringing complement ary informat ion t o bear on t he decision making problem. Especially when t he modalit ies are independent such as voice and front al face or finger print , t he fusion process can be expect ed t o reduce t he error rat es of even t he best modality.

The problem of in t egra t ion of mul t iple biome t ric exper t

s has been enjoying growing prominence in he li era ure over he las decade. Specialised conferences like AVBPA (Audio- and Video-based Biometric Person Authentication) [16, 10] and ICBA (International Conference on Biome t ric Au t hen t ica t ion) [68] usually have a session on fusion and

multimodal biometrics. Most papers describe experimental systems combining a wide range of modalities, e.g. face images and voice [21], facial profile and voice [61], fingerprints, face and voice [36], voice, lip motion and face [27]. A recently published book on biometrics by Jain, Bolle and Pankrati [35] has a chapter devoted to multimodal biometrics.

This report aims t o review t he various approaches t o int egrat ion and fusion. As already indicated, there are a number of different integration scenarios each of which can be classified into one of two basic categories: heterogeneous integration and homogeneous integration. In the former category the integrated system relies on multiple biometric modalities but these are not used joint ly for personal ident it y verificat ion. In cont rast, in t he lat er cat egory each scenario involves a fusion of multiple expert decisions. It cannot be overemphasised that this categorisation is convenient primarily from the point of view of structuring the presentation of the integration methodology in this paper. Practical systems will invariably employ a mix of integration met hods t o accrue t he maximum benefit from t he use of mult iple expert s and modalities.

The repor

t is organised as follows. In


he nex




ion we discuss issues raised by

heterogeneous integration. Section 3 overviews fusion strategies. Section 4 discusses several issues relat ing fusion, including score normalisat ion and confidence measures. Sect ion 5 reviews application studies involving multimodal biometrics Section 6 draws conclusions and identifies t he promising direct ions of fut ure development in mult imodal biomet ric syst em integration.



Figu re 1 - Heterogeneous biometric integration: Data validation

The primary aim of het erogeneous int egrat ion is t o use addit ional biomet ric modalit ies for control purposes of some kind. Here the meaning of biometric modality can be very loose and may even signify simply some kind of measurement t hat is indicat ive of ot her measured biometrics being live and genuine. Typically, it may be desirable to check the authenticity of biometric data before it is used for verification in order to prevent fraudulent access using, for instance, pre-recorded data. This is the liveness problem, an issue which is a significant threat to biomet ric syst ems, especially t o mono modal and mono expert syst ems. The sit uat ion is illustrated in Figure 1. A syst em where t he aut hent icit y of speech dat a is confirmed by detecting and analysing lip motion can serve as an example.

Anot her example is a system that tracks the face automatically to allow the use of fingerprint and face recognit ion modalit ies. Once t he biomet ric dat a is validat ed t he ident ificat ion or verification is performed using the automatically acquired data or data to be acquired by user collaboration. In this case the system performance is the same as that of the single modality used for decision making. The access systems are enhanced only in the sense of being more robust to the so called play back attacks by accesses with misappropriated biometric data. The approach is specially relevant in sit uat ions when t he ident ificat ion or ident it y verificat ion performance of a system using a single biometric characteristic is adequate.

Anot her con t rol scenario is depic t ed in Figure 2. I t involves swi t ching be t ween t

wo modalities. It is applicable for inst ance in providing secure access t o net worked services (confidential files, et c.). Here t he decision concerning t he initial user access which must be associated wit h an ext remely low false accept ance probabilit y is based on a very reliable biometric characteristic such as fingerprint. The subsequent monitoring of the user is based on a modality which may be less reliable but it can operate by passive acquisition of the relevant biometric data for user verification, such as frontal face image.

Figu re 2 - Heterogeneous architecture: Access and monitoring

An example of non-coopera ive liveness de ec ion sys em uses au oma ic face and eye tracking of users who wish to access a physical or an abstract space, [9]. The system combines real-time face tracking as well as the localisation of facial landmarks in order to improve the authenticity of fingerprint recognition. The purpose is to assist in securing public areas and in authenticating individuals, in addit ion t o ensuring t hat t he collect ed sensor dat a in a multi modal person aut hent icat ion syst em originat e from present persons, i.e. t he syst em is not under a play-back a ack. Addi ionally, such sys ems enable he use of high resolu ion biometrics requiring a reliable knowledge of where t o zoom au t onomously, e.g. iris recognition. As an example, a pan and t ilt unit is aut omat ically cont rolled in real t ime t o acquire face images of accept able qualit y and scale for face recognit ion while opt ionally commanding a fingerprint sensor to be used in an attempt to reduce play back attacks.

Figu re 3 - Probe selection

Another scenario of he

t erogeneous in




ion involves


he use of some user rela ed

observa ions to improve the performance of a biometric modality. A typical example of this application has been reported in [47]. Here one modality is used to select the most appropriate model for anot her modalit y which is responsible for decision making. In part icular, a lip localisation and tracking module described in [57] is employed to detect the state of the mouth (open, shut) in front al face image probes. The information about the mouth state is used to select a client reference model of corresponding st at us. This has significant ly improved t he performance of the face verification system. The process is shown in more detail in Figure 3. The selection is based on the upper to lower lip distance which is plotted on the vertical axis (in pixels). The horizontal axis shows the frame number. At 25 frames per second the track covers approximately five seconds. The minima and/or the maxima of the lip distance define the frame that is passed on to frontal face verification module. If only the minima are used, we have a t ypical scenario of het erogeneous int egrat ion of biomet ric measurement s. If bot h frontal face images wit h maximal and minimal lip dist ance are used, verificat ion may be carried out by two experts operating on the same modality, but with models corresponding to different states of the face. The whole system is depicted on Figure 4. The left part shows the selection process. The probe selector effectively selects not only the probe, but also the model of t he client for t he given st at e. The right part of Figure 4 shows a st andard expert fusion based on a single modality, discussed in Section 3.

Figu re 4 - Model selection

A similar idea has been exploit ed in [30, 54] in t he cont ext of t ext independent speaker verificat ion. In heir work a speaker independen speech segmen er such as he ASLIP segmenter [20] is employed to detect and classify a section of the speech signal into acoustic categories which are somehow linked with the state of the vocal tract. The information about the categorised acoustic events is used to select the corresponding expert which is trained to model such events and therefore is likely to be more reliable. The process is shown in figure 5.

Figu re 5 - Context dependent speaker verification


In homogeneous int egrat ion of mult iple biomet ric expert s t he goal is t o t reat all t he expert decision outputs at the same level and use them to derive a final decision about the identity of a subject. This is very much in line wit h t he medical practice where t he opinion of several doctors is sought on a particular case and a consensus decision made reflecting all the views put forth. The problem of emulating the process of combining multiple and often conflicting opinions in pattern analysis has been of interest over the last decade under different headings depending on t he communit y t hat has been addressing it. The t erms classifier combination, multiple expert fusion and committee of experts all refer to the same research topic.

One of the first issues raised by homogeneous integration is that of score compatibility. Each

biometric module will respond t o a biomet ric st imulus by generat

ing an out put , score , in support of a part icular hypot hesis. As t he nat ure of t he decision rules implement ed by t he distinct expert s may be diamet rically different , i.e. some may comput e dist ances of varied ranges in different spaces and others a posteriori probabilities, the first task is to convert the respective outputs into comparable entities. The difficulty of this task depends very much on the formulat ion of t he fusion problem. If mult iple expert fusion is considered as a learning problem, where the outputs of individual experts are viewed as measurements which provide the input t o t he next decision making st age where t he expert s' scores are fused, t he prior homogenisation of the scores may not be essential. In any case, training data is necessary: if it is a learning problem, to train the classifier; if it is a score combination problem, to perform score homogenisat ion. It cannot be overemphasised t hat t he t raining dat a must be dist inct from t ha t used for designing each individual biome t ric exper t . In order t o s t ress t his distinction, this additional training data is referred to as evaluation data . It provides a means for an independent assessment of t he reliabilit y of each expert and t he confidence in it s decisions.

However, it should be not ed t hat t hese est imat es are likely t o be opt imist ically biased. For other types of scores the homogenisation process involves a linear or non-linear mapping of the scores on to the [0, 1] interval. Such mappings are usually heurist ic and as such do not guarantee optimal performance.

For the sake of our ensuing discussion we shall assume that the individual expert scores have been homogenised one way or another and that each expert i is computing an estimate of the class a posteriori probabilities ()i x j i P

ω m j ,...,1= based on the biometric vector measurement x i available to the expert, or some kind of matching score. The number of classes will depend on t he act ual applicat ion. In person ident ificat ion t he number of classes is given by t he number of clients in the database. In identity verification, where the subject claims an identity in a co-operat ive manner, t he number of classes m = 2, signifying ident it y accept ance and rejection respectively.

Figu re 6 - Serial fusion architecture

There are t wo basic archi t ec t

ures ha can be adop ed for exper fusion. In he serial architecture illust rat ed in Figure 6 t he individual expert scores are considered sequent ially. The scheme is particular relevant when each expert operates with a reject option, i.e. it can distinguish between the situations when it can make decisions reliably and when it cannot. In the lat er case t he decision making t ask is passed on t he next expert in t he chain. The methodology used for designing a serial fusion architecture is basically t hat of decision tree construction. The serial scheme can be implemented in a number of variants. For instance, the elements in t he decision making chain can be arranged in t he ascending order of cost of extracting biometric characteristic. This will ensure that the overall cost of decision making is minimised. Alt ernat ively, each st age can be used as a filt er which reduces t he number of hypotheses by eliminat ing t hose which clearly have no chance of being correct. Anot her possibility is t o design each st age t o operat e in t he reject subspace of t he previous st age (boosting, decision t rees, gat ing, et c). For t his scheme t he hypot hesis accept ance t hresholds can be set in a conservative way to minimise the false positive rate. For instance, in the case of identity verification, this approach can be used to ensure zero false acceptances. If any of the experts accept s t he claimed ident it y, access will be grant ed. Typical fusion met hods falling into t his cat egory include t he class decision t rees, cascaded AdaBoost , and class grouping methods [41].

In ordering models in a serial fusion archit ect ure, addit ional t o t he cost of ext ract ing t he modality (input) for the model, the cost of the model may also be critical. For example if one is linear and the other is k-nearest neighbour (k-NN), we would like to use k-NN only if the input is rejected by the linear model. This is the idea behind cascading [3, 39] where a first, simple rule-learner learns a general “rule” and the k-NN learns localised “exceptions” to the rule. This makes the system much faster (than for example voting).

Figu re 7 - Parallel fusion architecture

In a parallel archit ect ure all t he individual expert s comput e t heir respect ive score S(i) (for exper i) simul aneously. The homogenised scores are t hen fed in o a fusion s age as illustrated in Figure 7. A large number of combination strategies have been proposed in the literature. The reason for t his is t hat t here are very many facet s t o fusion and numerous variations on the theme relating to each facet. These can be gleaned from the following list. Some of t hese apply in t he con t ex t of t

he serial archi t ec t ure bu t t

hey become more conspicuous in parallel integration.

Data versus decision fusion - In principle it would be possible to fuse biometric characteristics before t hey are input t o a decision making st age. For inst ance, mult iple observat ions of a biometric modality could be registered first and then submitted to a single expert to reach a decision.

Model versus decision fusion - In certain situations each expert may employ a different model and these could be integrated to work in conjunction with a single expert.

Fu sion of hard versus soft decisions - At the point of expert fusion one may either work with soft decision out put s which in a sense are more informat ive but inconclusive or wit h t he outputs hardened by, for instance, a maximum selector, which are conclusive and concise.

Fu sion of best hypotheses versu s hypotheses lists - Each expert may be responsible for producing a ranked list of hypot heses rat her t han just t he most probable hypot hesis and a fusion scheme then operates on such lists.

Recent ly, several papers have contributed to a better understanding of their relationships and relative merits [40, 44, 60, 29, 56, 11, 65, 42, 18]. We shall take the view that data fusion may in certain cases be impracticable because of the complexity of the models that would have to be developed. Working with hardened decision outputs can be considered as a special case of soft decision fusion where a fusion operator is preceded by a non-linear coarsening/clipping function which converts the soft outputs into hardened outcomes. The soft decision fusion has the advantage that it naturally maintains multiple hypotheses until the final fused decision. A

review of the most common fusion strategies and their properties can be found elsewhere [44, 40, 47, 53].

Compressing all the measurement information into a single value may not be the optimal way to summarise t he decision out put of a biomet ric modalit y. One can use a mult idimensional score signal or fuzzy votes to describe the “mind” state rather than a 1D score signal [7]. An

example of t his is t o equip expert

s wit h fuzzy qualit y signals in addit ion t o t he t radit ional scores. The qualit y signal can be viewed as expert s having fuzzy “reject ” opt ion. Q low quality signal means reject to make a decision whereas a high quality signal means process the accompanying signal, e.g. in a supervisor architecture. A two dimensional score has recently been shown to outperform 1D scores, [8, 25], on signature and fingerprint modalities. In such a scheme the fusion scheme is adapted every time an identity hypothesis is put to a test by the expert. This allows the supervisor to discriminate between the decision of a good expert who has to refuse to test a hypothesis e.g. because of poor image quality and when the expert is confident to reject the identity hypothesis, continuously rather than discretely. The situation is in analogy wit h complex number represent at ion, where an argument comput at ion is less reliable he smaller he magni ude, al hough an argument (t he radi ional score) can be computed. Ult imat ely, using mult idimensional mind st at es has t he pot ent ial t o merge t he benefits of serial and parallel archit ect ures because t he fusion st age will have t he means t o adapt the decision architecture to the current hypothesis testing conditions such as biometric signal quality for the currently tested identity.

The same motivation is behind the use of ranking lists. Ranking lists have the advantage that the potential dynamic range of soft decision outputs is drastically reduced. This then avoids the problem of dominance of expert out put s close t o zero by inhibit ing t he corresponding hypotheses. However, ranking hypo t hesis scores appears t o be more meaningful for identification han verifica ion scenarios where he exper s have jus wo hypo hesis o evaluate: the claimed identity is true, or false.

In summary, the parallel combination strategies discussed in this section can be viewed as a multistage process whereby the input data is used to compute the relevant scores which in turn are used as input to t he next processing st age. The problem is t hen t o find class separating surfaces in t his new feat ure space. The su m ru le and t he averaging est imat or and t heir weighted versions hen implement linear separa ing boundaries in t his space. The ot her combination strategies implement non-linear boundaries.

The idea can t hen be ext ended furt her and t he problem of combinat ion posed as one of training the second stage using these probabilities so as to minimise the recognition error. This is t he approach adopt ed by various mult ist age combinat ion st rat egies as exemplified by Support Vector Machine fusion [5] and the behaviour knowledge space method of Huang and Suen [32] and the techniques in [48, 66]. The decision template method [51] also falls into the

category of trainable approaches. Most importantly, when the linear or non-linear combination functions are obtained by training, the distinctions between the two scenarios fade away and one can view classifier fusion in a unified way. This probably explains the success of many

heuristic combina

t ion s ra egies ha have been sugges ed in he li era ure wi hou any

concerns about the underlying theory.


There are several issues relat ed t o multimodal biometric expert fusion which have attracted considerable attention recently. These will now be addressed in the following subsections.

4.1 Score Normalisation

Al t hough simple fusion rules (as Sum, Product, Min or Max rules) do not require a training phase as learning-based classifiers, in fac hey require t ha he ou pu s of exper s be normalised in some sense. In t his sect ion we present a review of various normalisat ion schemes used in combination methods.

Many normalisation schemes have so far been studied in the literature. They can be classified in t wo main cat egories: t he first are normalisat ions t hat perform a mapping of scores t o a given in t erval; t he second concerns score normalisa t ion based on a pos t eriori class probabilities. Both categories of normalisation have been compared in [28].

In t he first scheme, we find linear and non linear mappings of scores [33, 34]. Linear normalisations which are widely used are: (i) the Min-Max normalisation that maps linearly

the scores to the [0,1] interval (to guarantee this one uses thresholding for values higher than

the Max and lower t han t he Min) (ii) the Z-score normalisation that transforms linearly t he scores t o a dist ribut ion wit h zero-mean and st andard deviat ion of 1. Concerning non linear normalisations, two types emerged: those that only exploit the mean and standard deviation of

each expert 's scores (Tanh Est imat or normalisat ion [33, 34]), and t hose t hat use, for each expert, t he cent re and widt h of t he genuine and impost or dist ribut ions' overlap. The lat t er

normalisations are called adaptive since they decrease, by t he use of those parameters and a non linear function, the area of the overlap [33, 34, 14].

In t he second scheme, score normalisat ion is achieved by means of est imat ing t he class conditional score dis ribu ions of each exper 's scores and convert ing hese int o class a posteriori probabili t ies [44] by t he Bayes rule. The es t ima t ion of class condi t ional distributions can be performed with a parametric method by assuming a given distribution (for example Gaussian) and estimating the parameters of such distribution. Another possibility is to perform non parametric estimation of class conditional distributions (as Parzen Windows

[24]). Of course, the quality of the estimation is crucial.

All these normalisations have in common the fact that they rely in some way on the genuine and impostor distributions: by means of first and second order moments, or the area of both distributions. overlap, or finally t he dis ribu ions hemselves. Es ima ing such s a is ical characteristics requires a devoted database that has to be different from the training database of each expert. Such database is often called Evaluation Database and is in fact the Training Database of the fusion system.

4.2 Confidence, Competence and Ambiguity

Any score normalisat ion modifies t he out put s and risks int roducing a bias. When t here is sufficient extra data – “evaluation data” - stacking is to be favoured; it takes classifier outputs as they are and there is no need for normalisation. K-fold cross-validation may be used when

the size of the data set available for all aspects of training is limited. Simple voting assumes all models are equally reliable; weighted voting gives the same weight to a model regardless of the input; stacking learns to correct the biases of models and is preferred. Stacking however increases variance and risks over fitting on small datasets.

If t he classifiers generat e post erior probabilit ies, t he highest post erior can be t aken as t he confidence (as is done in deciding when t o reject). Ot herwise, t he difference bet ween t he highest and the second highest outputs can be taken as confidence and used as a weight in a weighted voting scheme [2].

The confidence in the decision of a mono-modal classifier, also called decision reliability, can be used t o perform decision-level fusion for mult imodal biomet rics on a present at ion-by-present at ion basis. The decision reliability is defined as the probability that the a mono-modal classifier has taken a correct accept, reject decision given available evidence. That evidence can come from the decision domain, score domain, feature domain, signal domain, or a mix of these. In [58], the error behaviour of a speaker verification classifier and the associated log-likelihood ratio scores distributions as well as signal-to-noise ratios are explicitly modelled. The model is t hen used t o associat e each classifier decision wit h a reliabilit y figure, which indicates how likely it is that the classifier can be trusted.

This approach has recently been applied to multi-modal biometrics for combining speech and face, where the reliability figure is used to break ties when uni-modal classifiers disagree [49]. An import ant aspect of t his approach is t hat t he reliabilit ies are assessed independent ly for each presentation and not fixed a-priori, thereby exploiting modality-specific robustness.


The benefits of the multimodal biometrics approach have been demonstrated in a number of studies. The most popular modalities to fuse are facial image and voice trait. Pioneering work in this area can be traced back to mid nineteen nineties and includes [6, 61, 12, 4, 15, 19].

Combining face and speech has received most attention. Chibelushi et al. [17] were the first to combine speaker and face verificat ion models. They used weight ed summat ion t o fuse t he opinions of the two experts. They modelled each person in the database by two single-output Multi-Layer Perceptrons, one as a face expert and the other as a voice expert. These experts are combined at decision level. This output is used both in recognition and verification. For

recognition maximum of t hese out put

s is used and for verificat ion a t hreshold is set. They concluded by showing t hat t he mult i-modal syst em performs bet t er t han bot h of t he single experts.

Brunelli et al. [13] made a similar work. They fused a speaker recognition expert and a face recognition expert with weighted product fusion. The optimal weights are found empirically on an independent test set. They used a vector quantization based speaker recognition expert and a geome t ric fea t ure-based exper t wi t h 35 fea t ures. The resul t ing sys t em again outperformed the single modalities.

In [12], Brunelli and Falavigna fused t wo speech expert s and t hree face expert s. Speech experts are based on vec t or quan t iza t ion, one of t hem uses Mel Frequency Ceps t ral Coefficients (MFCC) and t he ot her uses t heir delt as. Face expert s are geomet ric based and they recognize the eye, nose and mouth areas. These five experts are combined with weighted produc fusion, wi h weigh s adjus ed according t o a heuris t ic. The fused sys em was significantly increased t he recognit ion rat e. The report of Duc [23] present s t he first multimodal results on a publicly available database, M2VTS [55].

Dieckmann et al. [22] fused a face expert, a dynamic lip expert and a text-dependent speech expert. Their fusion technique is a hybrid of decision and opinion fusions. They used majority voting to fuse three classifiers. Since there are three classifiers, two of them should agree in order to issue a result. In addition, they forced the fused opinion to exceed a given threshold to obtain a more reliable system. The fused system was more successful than the single experts.

Kit t ler et al. [46] used multiple images of a person to get multiple opinions and fused them with averaging (weight ed summat ion) and ordered st at ist ics rules (min, max and median). They used a single face verificat ion syst em based on opt imized robust correlat ion. Wit h fusion, t hey got up t o 40% reduct ion error rat es. They showed performance gains were saturated aft er t he first few images. They also showed t hat median fusion was robust t o outliers.

Kit t ler et al. [43] compared various combination schemes (sum rule, product rule, min rule, max rule, median rule and majorit y vot ing) experiment ally. The sum rule out performed t he other methods. This was unexpected because sum rule has the strongest assumptions. In this study, they used three experts: frontal face, face profile and text-dependent speech.

Luettin [52] combined speech and lip informat ion t hrough feat ure vect or concat enat ion. Larger speech frames were used t o mat ch t he frame rat es of speech and lip feat ures. This fusion lead t o only a small increase in t ext -dependent case and worse result s in t he t ext -independent case.

Jourlin et al. [37] fused a text-dependent speech expert and a text-dependent lip expert using weighted summation. The optimal weight and verification threshold was found on a validation set. They concluded that the lip expert performed much worse than the speech expert, but the integrated system outperformed the speech expert.

Abdeljaoued [1] used a Bayesian pos -classifier o combine hree exper s for a iden i y verification task. The experts use parametric models for true and impostor classes.

Ben-Yacoub et al. [4] compared several post-classifiers (Support Vector Machine, Minimum Cost Bayesian Classifier, Fisher Linear Discriminant , C4.5 and Mult i-Layer Percept ron). There were three experts to be fused: a frontal face expert (using Elastic Graph Matching), a text-dependen t (a Hidden Markov Model) and a t ex t -independen t (ari t hme t ic-harmonic sphericity based) speech expert. In the experimental comparison, the Bayesian Classifier and the Support Vector Machine with a polynomial kernel outperformed the others.

A similar comparison was performed by Verlinde [61]. Besides various pos -classifiers (decision tree, Multi-Layer Perceptron, logistic regression-based classifier, Bayesian classifier with Gaussian distributions, Fisher Linear Discriminant and -NN), majority voting, AND and OR fusion methods were used in fusion of a frontal face expert, face profile expert and a text-independent speech expert. Logistic regression-based post-classifier was the most successful method.

Frischholz and Dieckmann [26] integrated face, voice and lip movement recognizers through weighted summa t ion, majori t y vo t ing and AND fusion. This me t hod is used in t heir commercial product , BioID . The user select s t he fusion met hod and set s t he paramet ers according t o t he desired securi t y level (AND fusion for t he highes t securi t y). Voice recognition is done using vect or quant izat ion: The codebook found from feat ure vect ors (cepstral coefficients) of an individual is used as a reference voice pattern for that user and recognition is done using a minimum distance classifier. Lip movement recognizer calculates a vector field representing the local movements of images in a video sequence, using optical-flow technique. Vector fields are orthogonalized and normalised and a compressed prototype is created. Test patterns are multiplied with the templates and the highest scalar product gives the result ing class. The face module uses Hausdorff dist ance t o locat e t he face, rot at es and scales the image and classifies it as the lip module.

Similarly, Yemez et al. [67] fused three modalities: speech, face and lip motion. Speech and lip mot ion are fused before mapping via vect or concat enat ion. From speech signal, MFCC vectors are ext ract ed and t hey are concat enat ed wit h t he opt ical-flow vect ors. In order t o synchronize frame rat es of t hese t wo feat ures, lip mot ion feat ures are int erpolat ed. The recognition is done using a HMM which runs on concat enat ed and synchronised feat ure vectors. The output of an independent face recognizer, based on the Eigenface method, is then fused with the output of the HMM using weighted summation.

The same aut hors also designed a t ext-dependent recognit ion syst em which int egrat es lip mot ion and speech information [38]. The lip motion module extracts eigenlip coefficients and the speech module ext ract s MFCCs. These vect ors are synchronised by in erpola ing lip vectors according to speech frame rate and then concatenated before mapping. The resulting vectors are used to train an HMM. The integrated system outperforms the single modalities under noisy condi ions, but performs worse t han t he speech-only recognizer under clean conditions.

Wark et al. [64] combined a text-independent speech expert and a text-independent lip expert using weight ed summat ion. The primary aim of t his work was t o set weight s such t hat contribution of t he speech expert decreases when signal t o noise rat io (SNR) is low. The

standard error of

t he difference be


ween sample means is used as a measure for he

discrimination ability of an expert. If there is less variation between opinions for the true and impostor claims (low st andard error), t he performance of t he expert is high. This weight heuristic provided good results in clean conditions and moderate success in conditions with higher than 10 dB SNR.

Wark et al. [63] improved their heuristic for weight optimisation so that they can adjusted in test t ime. Speech expert s are liable t o perform worse in t he presence of noise (low SNR levels). Text-independent syst ems are more affect ed from high noise levels t han t he t ext-dependent experts are. They proposed a new weight heuristic for the speech expert such that when t here is no noise, t he dist ance bet ween a t est opinion and t he opinion model of t rue claims is small and t he dist ance from the model of impost or claims is large. Thus, a larger weight is provided for the speech expert. In noisy condit ions, the contribution of the speech expert decreases.

Sanderson and Paliwal [59] proposed a method of weight adjustment which models MFCCs of noise segments using a Gaussian Mixture Model (GMM) and compares noise segments of test speech ut erances wit h t he model. Weight s are adjust ed according o t he mismat ch between the noise segments of test utterances and the model. This mismatch is mapped to the [0,1] interval using a sigmoid. The weight of the speech expert is close to zero for noisy test utterances and close to one for clean test utterances.

The effect of combining face, speech and fingerprint is invest igat ed in [36]. Similarly, t he merits of the combination of fingerprint with other image type modalities have been explored by a number of researchers [31]. Face and iris have been combined in [62], while palm and hand geometry were integrated in [50]. Face and lips were jointly exploited in [45].


